2025-05-07T20:22:34.9267335Z Current runner version: '2.323.0'
2025-05-07T20:22:34.9276919Z Runner name: 'i-0e49e9d70b38203df'
2025-05-07T20:22:34.9278372Z Machine name: 'ip-10-0-29-135'
2025-05-07T20:22:34.9282608Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:34.9285637Z Contents: read
2025-05-07T20:22:34.9286403Z Metadata: read
2025-05-07T20:22:34.9287151Z Packages: read
2025-05-07T20:22:34.9288003Z ##[endgroup]
2025-05-07T20:22:34.9291070Z Secret source: None
2025-05-07T20:22:34.9292026Z Prepare workflow directory
2025-05-07T20:22:34.9841449Z Prepare all required actions
2025-05-07T20:22:34.9877411Z Getting action download info
2025-05-07T20:22:35.1777168Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:35.5006277Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:35.8679630Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:37.4667207Z Getting action download info
2025-05-07T20:22:37.5527012Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:37.7809898Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.11, 12.6.3, 12.6.3, clang)
2025-05-07T20:22:37.8421980Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:37.8555950Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:37.8568954Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:37.8570490Z ##[endgroup]
2025-05-07T20:22:39.0034409Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:39.0035091Z Instance Type: g5.4xlarge
2025-05-07T20:22:39.0035559Z AMI Name: unknown
2025-05-07T20:22:39.0071831Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.3782408Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.3782720Z with:
2025-05-07T20:22:44.3782944Z   submodules: true
2025-05-07T20:22:44.3783190Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.3783589Z   token: ***
2025-05-07T20:22:44.3783790Z   ssh-strict: true
2025-05-07T20:22:44.3783999Z   ssh-user: git
2025-05-07T20:22:44.3784218Z   persist-credentials: true
2025-05-07T20:22:44.3784460Z   clean: true
2025-05-07T20:22:44.3784683Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.3784946Z   fetch-depth: 1
2025-05-07T20:22:44.3785160Z   fetch-tags: false
2025-05-07T20:22:44.3785379Z   show-progress: true
2025-05-07T20:22:44.3785600Z   lfs: false
2025-05-07T20:22:44.3785802Z   set-safe-directory: true
2025-05-07T20:22:44.3786052Z env:
2025-05-07T20:22:44.3786262Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.3786566Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.3786818Z   BUILD_TARGET: genai
2025-05-07T20:22:44.3787036Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.3787307Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:44.3787563Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.3787813Z ##[endgroup]
2025-05-07T20:22:44.4948015Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:44.4949207Z ##[group]Getting Git version info
2025-05-07T20:22:44.4949652Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.4950264Z [command]/usr/bin/git version
2025-05-07T20:22:44.4950538Z git version 2.47.1
2025-05-07T20:22:44.4962095Z ##[endgroup]
2025-05-07T20:22:44.4976567Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/854af8bc-58eb-438a-bc80-ef8ee6ced871' before making global git config changes
2025-05-07T20:22:44.4977461Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:44.4991043Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.5029142Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.5032072Z ##[group]Initializing the repository
2025-05-07T20:22:44.5036368Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.5079760Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:44.5080989Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:44.5082063Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:44.5082825Z hint:
2025-05-07T20:22:44.5083423Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:44.5084082Z hint:
2025-05-07T20:22:44.5084736Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:44.5085477Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:44.5085891Z hint:
2025-05-07T20:22:44.5086110Z hint: 	git branch -m <name>
2025-05-07T20:22:44.5086586Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:44.5091130Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:44.5125536Z ##[endgroup]
2025-05-07T20:22:44.5126041Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:44.5129776Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:44.5161214Z ##[endgroup]
2025-05-07T20:22:44.5161603Z ##[group]Setting up auth
2025-05-07T20:22:44.5167823Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:44.5200191Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:44.5564806Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:44.5596099Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:44.5939748Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:44.5988067Z ##[endgroup]
2025-05-07T20:22:44.5988508Z ##[group]Fetching the repository
2025-05-07T20:22:44.5995930Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:45.3963908Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.3964828Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:45.3988085Z ##[endgroup]
2025-05-07T20:22:45.3988486Z ##[group]Determining the checkout info
2025-05-07T20:22:45.3991015Z ##[endgroup]
2025-05-07T20:22:45.4005931Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:45.4043719Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:45.4087578Z ##[group]Checking out the ref
2025-05-07T20:22:45.4091314Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:45.5185971Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:45.5186236Z 
2025-05-07T20:22:45.5187273Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:45.5187816Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:45.5188330Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:45.5188639Z 
2025-05-07T20:22:45.5188859Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:45.5189330Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:45.5189617Z 
2025-05-07T20:22:45.5189732Z   git switch -c <new-branch-name>
2025-05-07T20:22:45.5189925Z 
2025-05-07T20:22:45.5190054Z Or undo this operation with:
2025-05-07T20:22:45.5190226Z 
2025-05-07T20:22:45.5190321Z   git switch -
2025-05-07T20:22:45.5191089Z 
2025-05-07T20:22:45.5191316Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:45.5191645Z 
2025-05-07T20:22:45.5192028Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:45.5199149Z ##[endgroup]
2025-05-07T20:22:45.5199556Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:45.5204606Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.5252542Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:45.5285157Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:45.5317754Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:45.5345622Z ##[endgroup]
2025-05-07T20:22:45.5345998Z ##[group]Fetching submodules
2025-05-07T20:22:45.5348891Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:45.5690586Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:45.6011832Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:45.6013814Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:45.6018115Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:45.6022187Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:45.6026457Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:45.6031068Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:45.6035186Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:45.6065910Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:45.9861749Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:46.4722335Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:46.9298870Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:48.0445259Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.3860781Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:48.6848699Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:49.7675317Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:49.7675786Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:49.8168458Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:50.4837349Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:50.4837908Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:50.7637159Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:51.3903930Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:51.3904374Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:51.4901506Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:52.6267467Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:52.6267912Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:53.3201234Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.0569276Z From https://github.com/google/googletest
2025-05-07T20:22:54.0569720Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.0976855Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:54.7288302Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:54.7288810Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:54.7402582Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:55.4854803Z From https://github.com/nlohmann/json
2025-05-07T20:22:55.4855247Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:55.5964914Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:55.5986140Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:55.6320636Z Entering 'external/asmjit'
2025-05-07T20:22:55.6351951Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.6382276Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.6414982Z Entering 'external/cutlass'
2025-05-07T20:22:55.6445442Z Entering 'external/googletest'
2025-05-07T20:22:55.6477800Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.6510808Z Entering 'external/json'
2025-05-07T20:22:55.6557050Z ##[endgroup]
2025-05-07T20:22:55.6557462Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:55.6564046Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:55.6894153Z Entering 'external/asmjit'
2025-05-07T20:22:55.6958468Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.7038526Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.7107569Z Entering 'external/cutlass'
2025-05-07T20:22:55.7181271Z Entering 'external/googletest'
2025-05-07T20:22:55.7251852Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.7318097Z Entering 'external/json'
2025-05-07T20:22:55.7400531Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:55.7729542Z Entering 'external/asmjit'
2025-05-07T20:22:55.7794351Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:55.7797175Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.7858373Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:55.7861214Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.7922023Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:55.7925072Z Entering 'external/cutlass'
2025-05-07T20:22:55.7985682Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:55.7988475Z Entering 'external/googletest'
2025-05-07T20:22:55.8049557Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:55.8052573Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.8113935Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:55.8116558Z Entering 'external/json'
2025-05-07T20:22:55.8179438Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:55.8270209Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:55.8608492Z Entering 'external/asmjit'
2025-05-07T20:22:55.8640767Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.8672311Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.8703671Z Entering 'external/cutlass'
2025-05-07T20:22:55.8736903Z Entering 'external/googletest'
2025-05-07T20:22:55.8768140Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.8799505Z Entering 'external/json'
2025-05-07T20:22:55.8846925Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:55.9167511Z Entering 'external/asmjit'
2025-05-07T20:22:55.9201219Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.9233721Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.9264808Z Entering 'external/cutlass'
2025-05-07T20:22:55.9296579Z Entering 'external/googletest'
2025-05-07T20:22:55.9328319Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.9359703Z Entering 'external/json'
2025-05-07T20:22:55.9402661Z ##[endgroup]
2025-05-07T20:22:55.9444403Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:55.9471289Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:55.9647198Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:55.9647609Z with:
2025-05-07T20:22:55.9647852Z   name: fbgemm_genai_x86_clang_py3.11_cu12.6.3.whl
2025-05-07T20:22:55.9648182Z   merge-multiple: false
2025-05-07T20:22:55.9648433Z   repository: pytorch/FBGEMM
2025-05-07T20:22:55.9648709Z   run-id: 14891846252
2025-05-07T20:22:55.9648947Z env:
2025-05-07T20:22:55.9649166Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:55.9649456Z   BUILD_ENV: build_binary
2025-05-07T20:22:55.9649695Z   BUILD_TARGET: genai
2025-05-07T20:22:55.9649917Z   BUILD_VARIANT: cuda
2025-05-07T20:22:55.9650152Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:55.9650393Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:55.9650638Z ##[endgroup]
2025-05-07T20:22:56.1957846Z Downloading single artifact
2025-05-07T20:22:56.3256734Z Preparing to download the following artifacts:
2025-05-07T20:22:56.3257733Z - fbgemm_genai_x86_clang_py3.11_cu12.6.3.whl (ID: 3081362348, Size: 12539580, Expected Digest: sha256:3102cc9a69eb3583ac0189afa1ca8413efb7589bfcb502323f89be9052562745)
2025-05-07T20:22:56.3770709Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-ed4592c2-86e8-5a3b-90c7-f7838a45299d/artifacts/e12be460a7cfa72843e2743f76c4cd1766662f0611b2be7066228e0cc294f44e.zip
2025-05-07T20:22:56.3772132Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:56.4451485Z (node:57018) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:56.4452500Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:56.6224279Z SHA256 digest of downloaded artifact is 3102cc9a69eb3583ac0189afa1ca8413efb7589bfcb502323f89be9052562745
2025-05-07T20:22:56.6225091Z Artifact download completed successfully.
2025-05-07T20:22:56.6225423Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:56.6230156Z Download artifact has finished successfully
2025-05-07T20:22:56.6488144Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:56.6488536Z with:
2025-05-07T20:22:56.6488753Z   driver-version: 570.133.07
2025-05-07T20:22:56.6488997Z env:
2025-05-07T20:22:56.6489212Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.6489515Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.6489765Z   BUILD_TARGET: genai
2025-05-07T20:22:56.6489992Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.6490230Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.6490487Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.6490726Z ##[endgroup]
2025-05-07T20:22:56.6580209Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:56.6580596Z with:
2025-05-07T20:22:56.6580984Z   timeout_minutes: 10
2025-05-07T20:22:56.6581216Z   max_attempts: 3
2025-05-07T20:22:56.6604261Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:56.6627732Z   retry_wait_seconds: 10
2025-05-07T20:22:56.6627993Z   polling_interval_seconds: 1
2025-05-07T20:22:56.6628255Z   warning_on_retry: true
2025-05-07T20:22:56.6628508Z   continue_on_error: false
2025-05-07T20:22:56.6628746Z env:
2025-05-07T20:22:56.6628972Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.6629272Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.6629515Z   BUILD_TARGET: genai
2025-05-07T20:22:56.6629743Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.6629987Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.6630245Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.6630483Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:56.6630732Z ##[endgroup]
2025-05-07T20:22:56.7428189Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:56.7428805Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:56.7432090Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:57.2929574Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:57.2943763Z No packages marked for removal.
2025-05-07T20:22:57.2993803Z Dependencies resolved.
2025-05-07T20:22:57.3003471Z Nothing to do.
2025-05-07T20:22:57.3003750Z Complete!
2025-05-07T20:22:57.3330775Z + install_nvidia_driver_common
2025-05-07T20:22:57.3334427Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:57.3334738Z + lspci
2025-05-07T20:22:57.3336117Z Before installing NVIDIA driver
2025-05-07T20:22:57.3522977Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:57.3523770Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:57.3524317Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:57.3524819Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:57.3525288Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:57.3525804Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:57.3526280Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:57.3526740Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:57.3527133Z + lsmod
2025-05-07T20:22:57.3569382Z Module                  Size  Used by
2025-05-07T20:22:57.3569966Z xt_conntrack           16384  1
2025-05-07T20:22:57.3570477Z nft_chain_nat          16384  3
2025-05-07T20:22:57.3570998Z xt_MASQUERADE          20480  1
2025-05-07T20:22:57.3571601Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:57.3572248Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:57.3573025Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:57.3573879Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:57.3574488Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:57.3575061Z xfrm_user              57344  1
2025-05-07T20:22:57.3575583Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:57.3576144Z xt_addrtype            16384  2
2025-05-07T20:22:57.3576655Z nft_compat             20480  4
2025-05-07T20:22:57.3577262Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:57.3578086Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:57.3578820Z br_netfilter           36864  0
2025-05-07T20:22:57.3579125Z bridge                323584  1 br_netfilter
2025-05-07T20:22:57.3579431Z stp                    16384  1 bridge
2025-05-07T20:22:57.3579708Z llc                    16384  2 bridge,stp
2025-05-07T20:22:57.3579993Z overlay               167936  0
2025-05-07T20:22:57.3580248Z tls                   135168  0
2025-05-07T20:22:57.3580499Z nls_ascii              16384  1
2025-05-07T20:22:57.3580749Z nls_cp437              20480  1
2025-05-07T20:22:57.3581001Z vfat                   24576  1
2025-05-07T20:22:57.3581254Z fat                    86016  1 vfat
2025-05-07T20:22:57.3581517Z ena                   180224  0
2025-05-07T20:22:57.3581759Z i8042                  45056  0
2025-05-07T20:22:57.3582015Z serio                  28672  3 i8042
2025-05-07T20:22:57.3582287Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:57.3582548Z button                 24576  0
2025-05-07T20:22:57.3582802Z sunrpc                696320  1
2025-05-07T20:22:57.3583053Z sch_fq_codel           20480  17
2025-05-07T20:22:57.3583316Z dm_mod                188416  0
2025-05-07T20:22:57.3583570Z fuse                  163840  1
2025-05-07T20:22:57.3583822Z configfs               57344  1
2025-05-07T20:22:57.3584076Z loop                   36864  0
2025-05-07T20:22:57.3584333Z dax                    45056  1 dm_mod
2025-05-07T20:22:57.3584604Z dmi_sysfs              20480  0
2025-05-07T20:22:57.3584863Z crc32_pclmul           16384  0
2025-05-07T20:22:57.3585121Z crc32c_intel           24576  0
2025-05-07T20:22:57.3585380Z efivarfs               24576  1
2025-05-07T20:22:57.3585627Z + modinfo nvidia
2025-05-07T20:22:57.3591088Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:57.3591562Z import_ns:      DMA_BUF
2025-05-07T20:22:57.3591813Z alias:          char-major-195-*
2025-05-07T20:22:57.3592084Z version:        570.133.07
2025-05-07T20:22:57.3592341Z supported:      external
2025-05-07T20:22:57.3592674Z license:        Dual MIT/GPL
2025-05-07T20:22:57.3592976Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:57.3593320Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:57.3593931Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:57.3594266Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:57.3594608Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:57.3594949Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:57.3595265Z depends:        i2c-core,drm
2025-05-07T20:22:57.3595521Z retpoline:      Y
2025-05-07T20:22:57.3595749Z name:           nvidia
2025-05-07T20:22:57.3596118Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:57.3596624Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:57.3597065Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:57.3597582Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:57.3597900Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:57.3598312Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:57.3598766Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:57.3599141Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:57.3599443Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:57.3599809Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:57.3600202Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:57.3600544Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:57.3600852Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:57.3601176Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:57.3601686Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:57.3602095Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:57.3602475Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:57.3602895Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.3603297Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:57.3603722Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.3604132Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:57.3604473Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:57.3604840Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:57.3605219Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:57.3605561Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:57.3606079Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:57.3606417Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:57.3606745Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:57.3607053Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:57.3607406Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:57.3607839Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:57.3608179Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:57.3608515Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:57.3608872Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:57.3609217Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:57.3609556Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:57.3609901Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:57.3610198Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:57.3610521Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:57.3610860Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:57.3611185Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:57.3611517Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:57.3611882Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:57.3612242Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:57.3612580Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:57.3612924Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:57.3613272Z parm:           rm_firmware_active:charp
2025-05-07T20:22:57.3613740Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:57.3613994Z ++ command -v nvidia-smi
2025-05-07T20:22:57.3614252Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:57.3614510Z + set +e
2025-05-07T20:22:57.3614815Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:22:59.1803623Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:22:59.1804003Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:22:59.1804245Z + '[' 0 -ne 0 ']'
2025-05-07T20:22:59.1804469Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:22:59.1804745Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:22:59.1805173Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:22:59.1805875Z + set -e
2025-05-07T20:22:59.1806453Z + '[' 1 -eq 0 ']'
2025-05-07T20:22:59.1806841Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:22:59.1807293Z + post_install_nvidia_driver_common
2025-05-07T20:22:59.1810932Z + sudo modprobe nvidia
2025-05-07T20:22:59.2711710Z + echo 'After installing NVIDIA driver'
2025-05-07T20:22:59.2712043Z + lspci
2025-05-07T20:22:59.2712268Z After installing NVIDIA driver
2025-05-07T20:22:59.2827332Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:59.2827821Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:59.2828359Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:59.2828876Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:59.2829351Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:59.2829868Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:59.2830379Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:59.2830856Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:59.2831265Z + lsmod
2025-05-07T20:22:59.2858878Z Module                  Size  Used by
2025-05-07T20:22:59.2859243Z nvidia_uvm           1884160  0
2025-05-07T20:22:59.2859619Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:22:59.2860015Z drm                   602112  1 nvidia
2025-05-07T20:22:59.2860384Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:22:59.2860695Z backlight              24576  1 drm
2025-05-07T20:22:59.2860983Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:22:59.2861297Z xt_conntrack           16384  1
2025-05-07T20:22:59.2861669Z nft_chain_nat          16384  3
2025-05-07T20:22:59.2862022Z xt_MASQUERADE          20480  1
2025-05-07T20:22:59.2862423Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:59.2862778Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:59.2863177Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:59.2863606Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:59.2863926Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:59.2864220Z xfrm_user              57344  1
2025-05-07T20:22:59.2864483Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:59.2864761Z xt_addrtype            16384  2
2025-05-07T20:22:59.2865019Z nft_compat             20480  4
2025-05-07T20:22:59.2865328Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:59.2865737Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:59.2866100Z br_netfilter           36864  0
2025-05-07T20:22:59.2866369Z bridge                323584  1 br_netfilter
2025-05-07T20:22:59.2866678Z stp                    16384  1 bridge
2025-05-07T20:22:59.2866951Z llc                    16384  2 bridge,stp
2025-05-07T20:22:59.2867237Z overlay               167936  0
2025-05-07T20:22:59.2867485Z tls                   135168  0
2025-05-07T20:22:59.2867728Z nls_ascii              16384  1
2025-05-07T20:22:59.2868236Z nls_cp437              20480  1
2025-05-07T20:22:59.2868485Z vfat                   24576  1
2025-05-07T20:22:59.2868733Z fat                    86016  1 vfat
2025-05-07T20:22:59.2868988Z ena                   180224  0
2025-05-07T20:22:59.2869231Z i8042                  45056  0
2025-05-07T20:22:59.2869477Z serio                  28672  3 i8042
2025-05-07T20:22:59.2869746Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:59.2870003Z button                 24576  0
2025-05-07T20:22:59.2870253Z sunrpc                696320  1
2025-05-07T20:22:59.2870500Z sch_fq_codel           20480  17
2025-05-07T20:22:59.2870757Z dm_mod                188416  0
2025-05-07T20:22:59.2870999Z fuse                  163840  1
2025-05-07T20:22:59.2871237Z configfs               57344  1
2025-05-07T20:22:59.2871623Z loop                   36864  0
2025-05-07T20:22:59.2871872Z dax                    45056  1 dm_mod
2025-05-07T20:22:59.2872144Z dmi_sysfs              20480  0
2025-05-07T20:22:59.2872388Z crc32_pclmul           16384  0
2025-05-07T20:22:59.2872644Z crc32c_intel           24576  0
2025-05-07T20:22:59.2872891Z efivarfs               24576  1
2025-05-07T20:22:59.2873130Z + modinfo nvidia
2025-05-07T20:22:59.2876062Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:59.2876721Z import_ns:      DMA_BUF
2025-05-07T20:22:59.2877055Z alias:          char-major-195-*
2025-05-07T20:22:59.2877392Z version:        570.133.07
2025-05-07T20:22:59.2877641Z supported:      external
2025-05-07T20:22:59.2877886Z license:        Dual MIT/GPL
2025-05-07T20:22:59.2878159Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:59.2878494Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:59.2878806Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:59.2879119Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:59.2879455Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:59.2879783Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:59.2880090Z depends:        i2c-core,drm
2025-05-07T20:22:59.2880344Z retpoline:      Y
2025-05-07T20:22:59.2880557Z name:           nvidia
2025-05-07T20:22:59.2880953Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:59.2881600Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:59.2882199Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:59.2882678Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:59.2882981Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:59.2883280Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:59.2883595Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:59.2883893Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:59.2884208Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:59.2884572Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:59.2884973Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:59.2885305Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:59.2885607Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:59.2885913Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:59.2886270Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:59.2886668Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:59.2887050Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:59.2887541Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.2887960Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:59.2888382Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.2888811Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:59.2889145Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:59.2889511Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:59.2890004Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:59.2890345Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:59.2890666Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:59.2890997Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:59.2891312Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:59.2891623Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:59.2891967Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:59.2892326Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:59.2892646Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:59.2892976Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:59.2893316Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:59.2893743Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:59.2894081Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:59.2894408Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:59.2894694Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:59.2895026Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:59.2895350Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:59.2895655Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:59.2895985Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:59.2896337Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:59.2896732Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:59.2897061Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:59.2897398Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:59.2897734Z parm:           rm_firmware_active:charp
2025-05-07T20:22:59.2898019Z + set +e
2025-05-07T20:22:59.2898219Z + nvidia-smi
2025-05-07T20:23:00.6992256Z Wed May  7 20:23:00 2025       
2025-05-07T20:23:00.6992781Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:00.6993408Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:00.6993898Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:00.6994394Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:00.6994917Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:00.6995335Z |                                         |                        |               MIG M. |
2025-05-07T20:23:00.6995665Z |=========================================+========================+======================|
2025-05-07T20:23:00.7056239Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:00.7056872Z |  0%   29C    P0             61W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:00.7057406Z |                                         |                        |                  N/A |
2025-05-07T20:23:00.7057953Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:00.7058479Z                                                                                          
2025-05-07T20:23:00.7058870Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:00.7059292Z | Processes:                                                                              |
2025-05-07T20:23:00.7059727Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:00.7060133Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:00.7060473Z |=========================================================================================|
2025-05-07T20:23:00.7061361Z |  No running processes found                                                             |
2025-05-07T20:23:00.7062452Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.1168013Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:02.5204024Z NVIDIA A10G
2025-05-07T20:23:02.8032403Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:02.8032664Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:02.8032907Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:02.8033190Z + set -e
2025-05-07T20:23:02.8033390Z INFO: Ignoring allowed status 0
2025-05-07T20:23:02.8041436Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:02.8045411Z + sudo yum install -y yum-utils
2025-05-07T20:23:03.2040635Z Last metadata expiration check: 0:05:44 ago on Wed May  7 20:17:19 2025.
2025-05-07T20:23:03.2288389Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:03.2680078Z Dependencies resolved.
2025-05-07T20:23:03.2859750Z Nothing to do.
2025-05-07T20:23:03.2860441Z Complete!
2025-05-07T20:23:03.3253570Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:03.3254392Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.3255315Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.6104974Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.6668170Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:04.2313304Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:04.2559093Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:04.2955567Z Dependencies resolved.
2025-05-07T20:23:04.3137286Z ================================================================================
2025-05-07T20:23:04.3137711Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:04.3138193Z ================================================================================
2025-05-07T20:23:04.3138585Z Downgrading:
2025-05-07T20:23:04.3138961Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:04.3139554Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:04.3139907Z 
2025-05-07T20:23:04.3140008Z Transaction Summary
2025-05-07T20:23:04.3140269Z ================================================================================
2025-05-07T20:23:04.3140584Z Downgrade  2 Packages
2025-05-07T20:23:04.3140741Z 
2025-05-07T20:23:04.3140915Z Total download size: 6.8 M
2025-05-07T20:23:04.3142339Z Downloading Packages:
2025-05-07T20:23:04.3571701Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  30 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:04.4299744Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  49 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:04.4308534Z --------------------------------------------------------------------------------
2025-05-07T20:23:04.4311463Z Total                                            59 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:04.4314000Z Running transaction check
2025-05-07T20:23:04.4417048Z Transaction check succeeded.
2025-05-07T20:23:04.4417351Z Running transaction test
2025-05-07T20:23:04.4711084Z Transaction test succeeded.
2025-05-07T20:23:04.4713648Z Running transaction
2025-05-07T20:23:05.0235639Z   Preparing        :                                                        1/1 
2025-05-07T20:23:05.1329560Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:05.1358556Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.1576006Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.1576788Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.1684356Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.1709084Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:06.5438072Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:06.5438865Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:06.5439449Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:06.5439974Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:06.6727410Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:06.6728558Z WARNING:
2025-05-07T20:23:06.6728836Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:06.6729063Z 
2025-05-07T20:23:06.6729156Z   Available Versions:
2025-05-07T20:23:06.6729308Z 
2025-05-07T20:23:06.6729411Z   Version 2023.7.20250331:
2025-05-07T20:23:06.6729728Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:06.6729981Z 
2025-05-07T20:23:06.6730108Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:06.6730315Z 
2025-05-07T20:23:06.6730402Z     Release notes:
2025-05-07T20:23:06.6730815Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:06.6731181Z 
2025-05-07T20:23:06.6731279Z   Version 2023.7.20250414:
2025-05-07T20:23:06.6731588Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:06.6731843Z 
2025-05-07T20:23:06.6731957Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:06.6732171Z 
2025-05-07T20:23:06.6732257Z     Release notes:
2025-05-07T20:23:06.6732797Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:06.6733186Z 
2025-05-07T20:23:06.6733307Z   Version 2023.7.20250428:
2025-05-07T20:23:06.6733918Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:06.6734201Z 
2025-05-07T20:23:06.6734379Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:06.6734614Z 
2025-05-07T20:23:06.6734729Z     Release notes:
2025-05-07T20:23:06.6735280Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:06.6748037Z 
2025-05-07T20:23:06.6748173Z ================================================================================
2025-05-07T20:23:06.7087686Z  
2025-05-07T20:23:06.7087954Z 
2025-05-07T20:23:06.7088099Z Downgraded:
2025-05-07T20:23:06.7088592Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:06.7089242Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:06.7089596Z 
2025-05-07T20:23:06.7089684Z Complete!
2025-05-07T20:23:06.7535303Z + sudo systemctl restart docker
2025-05-07T20:23:10.6649006Z Wed May  7 20:23:10 2025       
2025-05-07T20:23:10.6649607Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:10.6650237Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:10.6650725Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:10.6651218Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:10.6651749Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:10.6652170Z |                                         |                        |               MIG M. |
2025-05-07T20:23:10.6652504Z |=========================================+========================+======================|
2025-05-07T20:23:10.6732043Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:10.6733074Z |  0%   29C    P0             61W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:10.6733592Z |                                         |                        |                  N/A |
2025-05-07T20:23:10.6734138Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:10.6734653Z                                                                                          
2025-05-07T20:23:10.6735152Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:10.6735593Z | Processes:                                                                              |
2025-05-07T20:23:10.6736037Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:10.6736678Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:10.6737154Z |=========================================================================================|
2025-05-07T20:23:10.6737728Z |  No running processes found                                                             |
2025-05-07T20:23:10.6738269Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.7199626Z Command completed after 1 attempt(s).
2025-05-07T20:23:11.7286107Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:11.7286585Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:11.7301930Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:11.7302278Z env:
2025-05-07T20:23:11.7302497Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:11.7302817Z   BUILD_ENV: build_binary
2025-05-07T20:23:11.7303071Z   BUILD_TARGET: genai
2025-05-07T20:23:11.7303313Z   BUILD_VARIANT: cuda
2025-05-07T20:23:11.7303557Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:11.7303825Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:11.7304131Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:11.7304466Z ##[endgroup]
2025-05-07T20:23:12.0650980Z ################################################################################
2025-05-07T20:23:12.0651360Z # Print System Info
2025-05-07T20:23:12.0651582Z #
2025-05-07T20:23:12.0665955Z # [2025-05-07T20:23:12.066Z] + print_system_info 
2025-05-07T20:23:12.0666327Z ################################################################################
2025-05-07T20:23:12.0666542Z 
2025-05-07T20:23:12.0666655Z ################################################################################
2025-05-07T20:23:12.0666990Z [INFO] Printing environment variables ...
2025-05-07T20:23:12.0667288Z + printenv
2025-05-07T20:23:12.0667402Z 
2025-05-07T20:23:12.0688020Z SHELL=/bin/bash
2025-05-07T20:23:12.0688391Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:12.0688799Z BUILD_VARIANT=cuda
2025-05-07T20:23:12.0689325Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_2a59ce91-961f-4faf-8430-0285c6cfa352
2025-05-07T20:23:12.0689887Z GITHUB_ACTION=__run
2025-05-07T20:23:12.0690177Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.0690549Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:12.0690806Z RUNNER_NAME=i-0e49e9d70b38203df
2025-05-07T20:23:12.0691082Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:12.0691386Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:12.0691654Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:12.0692016Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:12.0692440Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:12.0692722Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:12.0693019Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:12.0693588Z ***
2025-05-07T20:23:12.0693793Z LOGNAME=ec2-user
2025-05-07T20:23:12.0694029Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:12.0694283Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:12.0694514Z GITHUB_ACTIONS=true
2025-05-07T20:23:12.0694739Z SYSTEMD_EXEC_PID=55576
2025-05-07T20:23:12.0695009Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:12.0695557Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:12.0696068Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:12.0696350Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:12.0696606Z RUNNER_OS=Linux
2025-05-07T20:23:12.0696836Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:12.0697084Z HOME=/home/ec2-user
2025-05-07T20:23:12.0697324Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:12.0697617Z LANG=C.UTF-8
2025-05-07T20:23:12.0697907Z RUNNER_TRACKING_ID=github_d031b214-61cb-47a7-8b20-1c60992044bb
2025-05-07T20:23:12.0698253Z RUNNER_ARCH=X64
2025-05-07T20:23:12.0698528Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:12.0699210Z BUILD_TARGET=genai
2025-05-07T20:23:12.0699732Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_2a59ce91-961f-4faf-8430-0285c6cfa352
2025-05-07T20:23:12.0700588Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_2a59ce91-961f-4faf-8430-0285c6cfa352
2025-05-07T20:23:12.0701314Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:12.0701965Z INVOCATION_ID=452351df4c4043e9bfbc1dc973e48402
2025-05-07T20:23:12.0702288Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:12.0702553Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:12.0703125Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_2a59ce91-961f-4faf-8430-0285c6cfa352
2025-05-07T20:23:12.0703724Z BUILD_ENV=build_binary
2025-05-07T20:23:12.0703955Z GITHUB_ACTOR=q10
2025-05-07T20:23:12.0704170Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:12.0704384Z KERN_NAME_LC=linux
2025-05-07T20:23:12.0704617Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:12.0704912Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:12.0705260Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:12.0705498Z USER=ec2-user
2025-05-07T20:23:12.0706029Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:12.0706324Z SHLVL=1
2025-05-07T20:23:12.0706510Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:12.0706819Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:12.0707258Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:12.0707622Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:12.0707855Z KERN_NAME=Linux
2025-05-07T20:23:12.0708083Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:12.0708487Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:12.0708904Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:12.0709179Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:12.0709416Z JOURNAL_STREAM=8:83634
2025-05-07T20:23:12.0709727Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:12.0710087Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:12.0710391Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:12.0710709Z GITHUB_BASE_REF=main
2025-05-07T20:23:12.0710927Z CI=true
2025-05-07T20:23:12.0711137Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:12.0711411Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:12.0711676Z GITHUB_ACTION_REF=
2025-05-07T20:23:12.0711925Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:12.0712528Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_2a59ce91-961f-4faf-8430-0285c6cfa352
2025-05-07T20:23:12.0713101Z MACHINE_NAME=x86_64
2025-05-07T20:23:12.0713321Z _=/usr/bin/printenv
2025-05-07T20:23:12.0713452Z 
2025-05-07T20:23:12.0713571Z ################################################################################
2025-05-07T20:23:12.0713880Z [INFO] Print ldd version ...
2025-05-07T20:23:12.0714147Z + ldd --version
2025-05-07T20:23:12.0714278Z 
2025-05-07T20:23:12.0714362Z ldd (GNU libc) 2.34
2025-05-07T20:23:12.0714627Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:12.0715059Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:12.0715586Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:12.0716034Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:12.0716252Z 
2025-05-07T20:23:12.0716368Z ################################################################################
2025-05-07T20:23:12.0716674Z [INFO] Print CPU info ...
2025-05-07T20:23:12.0716906Z + nproc
2025-05-07T20:23:12.0717014Z 
2025-05-07T20:23:12.0732447Z 16
2025-05-07T20:23:12.0734185Z 
2025-05-07T20:23:12.0734427Z + lscpu
2025-05-07T20:23:12.0734543Z 
2025-05-07T20:23:12.0863268Z Architecture:                         x86_64
2025-05-07T20:23:12.0863735Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:12.0864396Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0864782Z Byte Order:                           Little Endian
2025-05-07T20:23:12.0865100Z CPU(s):                               16
2025-05-07T20:23:12.0865398Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:12.0865716Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:12.0866064Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:12.0866386Z CPU family:                           23
2025-05-07T20:23:12.0866810Z Model:                                49
2025-05-07T20:23:12.0867094Z Thread(s) per core:                   2
2025-05-07T20:23:12.0867387Z Core(s) per socket:                   8
2025-05-07T20:23:12.0867676Z Socket(s):                            1
2025-05-07T20:23:12.0867953Z Stepping:                             0
2025-05-07T20:23:12.0868262Z BogoMIPS:                             5600.00
2025-05-07T20:23:12.0870345Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0872419Z Hypervisor vendor:                    KVM
2025-05-07T20:23:12.0872744Z Virtualization type:                  full
2025-05-07T20:23:12.0873087Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.0873471Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.0873834Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:12.0874200Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:12.0874527Z NUMA node(s):                         1
2025-05-07T20:23:12.0874822Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:12.0875169Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:12.0875533Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:12.0875897Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:12.0876245Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:12.0876618Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:12.0876974Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:12.0877424Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:12.0878084Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:12.0878869Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:12.0879618Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:12.0880537Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:12.0881444Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:12.0882238Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:12.0882603Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:12.0882920Z 
2025-05-07T20:23:12.0883023Z + cat /proc/cpuinfo
2025-05-07T20:23:12.0883162Z 
2025-05-07T20:23:12.0883266Z processor	: 0
2025-05-07T20:23:12.0883490Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0883740Z cpu family	: 23
2025-05-07T20:23:12.0883959Z model		: 49
2025-05-07T20:23:12.0884173Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0884432Z stepping	: 0
2025-05-07T20:23:12.0884657Z microcode	: 0x830107f
2025-05-07T20:23:12.0884998Z cpu MHz		: 2442.812
2025-05-07T20:23:12.0885235Z cache size	: 512 KB
2025-05-07T20:23:12.0885460Z physical id	: 0
2025-05-07T20:23:12.0885675Z siblings	: 16
2025-05-07T20:23:12.0885890Z core id		: 0
2025-05-07T20:23:12.0886102Z cpu cores	: 8
2025-05-07T20:23:12.0886307Z apicid		: 0
2025-05-07T20:23:12.0886517Z initial apicid	: 0
2025-05-07T20:23:12.0886739Z fpu		: yes
2025-05-07T20:23:12.0886943Z fpu_exception	: yes
2025-05-07T20:23:12.0887175Z cpuid level	: 13
2025-05-07T20:23:12.0887394Z wp		: yes
2025-05-07T20:23:12.0889567Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0891797Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0892293Z bogomips	: 5600.00
2025-05-07T20:23:12.0892514Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0892759Z clflush size	: 64
2025-05-07T20:23:12.0892970Z cache_alignment	: 64
2025-05-07T20:23:12.0893239Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0893562Z power management:
2025-05-07T20:23:12.0893691Z 
2025-05-07T20:23:12.0893773Z processor	: 1
2025-05-07T20:23:12.0893993Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0894227Z cpu family	: 23
2025-05-07T20:23:12.0894430Z model		: 49
2025-05-07T20:23:12.0894634Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0894885Z stepping	: 0
2025-05-07T20:23:12.0895084Z microcode	: 0x830107f
2025-05-07T20:23:12.0895319Z cpu MHz		: 3245.085
2025-05-07T20:23:12.0895528Z cache size	: 512 KB
2025-05-07T20:23:12.0895743Z physical id	: 0
2025-05-07T20:23:12.0895953Z siblings	: 16
2025-05-07T20:23:12.0896160Z core id		: 1
2025-05-07T20:23:12.0896349Z cpu cores	: 8
2025-05-07T20:23:12.0896551Z apicid		: 2
2025-05-07T20:23:12.0896744Z initial apicid	: 2
2025-05-07T20:23:12.0896956Z fpu		: yes
2025-05-07T20:23:12.0897158Z fpu_exception	: yes
2025-05-07T20:23:12.0897380Z cpuid level	: 13
2025-05-07T20:23:12.0897580Z wp		: yes
2025-05-07T20:23:12.0899533Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0901750Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0902246Z bogomips	: 5600.00
2025-05-07T20:23:12.0902465Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0902697Z clflush size	: 64
2025-05-07T20:23:12.0902920Z cache_alignment	: 64
2025-05-07T20:23:12.0903193Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0903503Z power management:
2025-05-07T20:23:12.0903641Z 
2025-05-07T20:23:12.0903727Z processor	: 2
2025-05-07T20:23:12.0903948Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0904178Z cpu family	: 23
2025-05-07T20:23:12.0904401Z model		: 49
2025-05-07T20:23:12.0904608Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0904855Z stepping	: 0
2025-05-07T20:23:12.0905067Z microcode	: 0x830107f
2025-05-07T20:23:12.0905298Z cpu MHz		: 3305.794
2025-05-07T20:23:12.0905504Z cache size	: 512 KB
2025-05-07T20:23:12.0905997Z physical id	: 0
2025-05-07T20:23:12.0906208Z siblings	: 16
2025-05-07T20:23:12.0906566Z core id		: 2
2025-05-07T20:23:12.0906763Z cpu cores	: 8
2025-05-07T20:23:12.0906970Z apicid		: 4
2025-05-07T20:23:12.0907174Z initial apicid	: 4
2025-05-07T20:23:12.0907385Z fpu		: yes
2025-05-07T20:23:12.0907590Z fpu_exception	: yes
2025-05-07T20:23:12.0907814Z cpuid level	: 13
2025-05-07T20:23:12.0908017Z wp		: yes
2025-05-07T20:23:12.0910103Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0912334Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0912836Z bogomips	: 5600.00
2025-05-07T20:23:12.0913053Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0913294Z clflush size	: 64
2025-05-07T20:23:12.0913517Z cache_alignment	: 64
2025-05-07T20:23:12.0913783Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0914104Z power management:
2025-05-07T20:23:12.0914247Z 
2025-05-07T20:23:12.0914335Z processor	: 3
2025-05-07T20:23:12.0914558Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0914806Z cpu family	: 23
2025-05-07T20:23:12.0915010Z model		: 49
2025-05-07T20:23:12.0915221Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0915467Z stepping	: 0
2025-05-07T20:23:12.0915674Z microcode	: 0x830107f
2025-05-07T20:23:12.0915913Z cpu MHz		: 3299.548
2025-05-07T20:23:12.0916132Z cache size	: 512 KB
2025-05-07T20:23:12.0916347Z physical id	: 0
2025-05-07T20:23:12.0916558Z siblings	: 16
2025-05-07T20:23:12.0916763Z core id		: 3
2025-05-07T20:23:12.0916963Z cpu cores	: 8
2025-05-07T20:23:12.0917166Z apicid		: 6
2025-05-07T20:23:12.0917369Z initial apicid	: 6
2025-05-07T20:23:12.0917583Z fpu		: yes
2025-05-07T20:23:12.0917803Z fpu_exception	: yes
2025-05-07T20:23:12.0918022Z cpuid level	: 13
2025-05-07T20:23:12.0918231Z wp		: yes
2025-05-07T20:23:12.0920191Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0922550Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0923058Z bogomips	: 5600.00
2025-05-07T20:23:12.0923276Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0923517Z clflush size	: 64
2025-05-07T20:23:12.0923735Z cache_alignment	: 64
2025-05-07T20:23:12.0924009Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0924321Z power management:
2025-05-07T20:23:12.0924461Z 
2025-05-07T20:23:12.0924546Z processor	: 4
2025-05-07T20:23:12.0924761Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0924996Z cpu family	: 23
2025-05-07T20:23:12.0925204Z model		: 49
2025-05-07T20:23:12.0925420Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0925663Z stepping	: 0
2025-05-07T20:23:12.0925876Z microcode	: 0x830107f
2025-05-07T20:23:12.0926108Z cpu MHz		: 3188.425
2025-05-07T20:23:12.0926311Z cache size	: 512 KB
2025-05-07T20:23:12.0926522Z physical id	: 0
2025-05-07T20:23:12.0926799Z siblings	: 16
2025-05-07T20:23:12.0941343Z core id		: 4
2025-05-07T20:23:12.0941676Z cpu cores	: 8
2025-05-07T20:23:12.0941965Z apicid		: 8
2025-05-07T20:23:12.0942328Z initial apicid	: 8
2025-05-07T20:23:12.0942554Z fpu		: yes
2025-05-07T20:23:12.0942806Z fpu_exception	: yes
2025-05-07T20:23:12.0943035Z cpuid level	: 13
2025-05-07T20:23:12.0943249Z wp		: yes
2025-05-07T20:23:12.0945291Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0947510Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0947994Z bogomips	: 5600.00
2025-05-07T20:23:12.0948230Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0948472Z clflush size	: 64
2025-05-07T20:23:12.0948685Z cache_alignment	: 64
2025-05-07T20:23:12.0948963Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0949285Z power management:
2025-05-07T20:23:12.0949425Z 
2025-05-07T20:23:12.0949518Z processor	: 5
2025-05-07T20:23:12.0949744Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0949988Z cpu family	: 23
2025-05-07T20:23:12.0950196Z model		: 49
2025-05-07T20:23:12.0950411Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0950667Z stepping	: 0
2025-05-07T20:23:12.0950872Z microcode	: 0x830107f
2025-05-07T20:23:12.0951100Z cpu MHz		: 3261.166
2025-05-07T20:23:12.0951320Z cache size	: 512 KB
2025-05-07T20:23:12.0951533Z physical id	: 0
2025-05-07T20:23:12.0951750Z siblings	: 16
2025-05-07T20:23:12.0951956Z core id		: 5
2025-05-07T20:23:12.0952155Z cpu cores	: 8
2025-05-07T20:23:12.0952362Z apicid		: 10
2025-05-07T20:23:12.0952572Z initial apicid	: 10
2025-05-07T20:23:12.0952784Z fpu		: yes
2025-05-07T20:23:12.0952995Z fpu_exception	: yes
2025-05-07T20:23:12.0953215Z cpuid level	: 13
2025-05-07T20:23:12.0953427Z wp		: yes
2025-05-07T20:23:12.0955361Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0957588Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0958077Z bogomips	: 5600.00
2025-05-07T20:23:12.0958300Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0958532Z clflush size	: 64
2025-05-07T20:23:12.0958761Z cache_alignment	: 64
2025-05-07T20:23:12.0959037Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0959347Z power management:
2025-05-07T20:23:12.0959487Z 
2025-05-07T20:23:12.0959572Z processor	: 6
2025-05-07T20:23:12.0959792Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0960028Z cpu family	: 23
2025-05-07T20:23:12.0960239Z model		: 49
2025-05-07T20:23:12.0960450Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0960704Z stepping	: 0
2025-05-07T20:23:12.0960911Z microcode	: 0x830107f
2025-05-07T20:23:12.0961149Z cpu MHz		: 3198.001
2025-05-07T20:23:12.0961368Z cache size	: 512 KB
2025-05-07T20:23:12.0961585Z physical id	: 0
2025-05-07T20:23:12.0961800Z siblings	: 16
2025-05-07T20:23:12.0962007Z core id		: 6
2025-05-07T20:23:12.0962212Z cpu cores	: 8
2025-05-07T20:23:12.0962423Z apicid		: 12
2025-05-07T20:23:12.0962635Z initial apicid	: 12
2025-05-07T20:23:12.0962845Z fpu		: yes
2025-05-07T20:23:12.0963059Z fpu_exception	: yes
2025-05-07T20:23:12.0963308Z cpuid level	: 13
2025-05-07T20:23:12.0963630Z wp		: yes
2025-05-07T20:23:12.0965668Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0968000Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0968492Z bogomips	: 5600.00
2025-05-07T20:23:12.0968712Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0968954Z clflush size	: 64
2025-05-07T20:23:12.0969173Z cache_alignment	: 64
2025-05-07T20:23:12.0969447Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0969750Z power management:
2025-05-07T20:23:12.0969888Z 
2025-05-07T20:23:12.0969975Z processor	: 7
2025-05-07T20:23:12.0970199Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0970432Z cpu family	: 23
2025-05-07T20:23:12.0970643Z model		: 49
2025-05-07T20:23:12.0970853Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0971091Z stepping	: 0
2025-05-07T20:23:12.0971303Z microcode	: 0x830107f
2025-05-07T20:23:12.0971532Z cpu MHz		: 3316.480
2025-05-07T20:23:12.0971747Z cache size	: 512 KB
2025-05-07T20:23:12.0971975Z physical id	: 0
2025-05-07T20:23:12.0972191Z siblings	: 16
2025-05-07T20:23:12.0972392Z core id		: 7
2025-05-07T20:23:12.0972601Z cpu cores	: 8
2025-05-07T20:23:12.0972809Z apicid		: 14
2025-05-07T20:23:12.0973009Z initial apicid	: 14
2025-05-07T20:23:12.0973225Z fpu		: yes
2025-05-07T20:23:12.0973429Z fpu_exception	: yes
2025-05-07T20:23:12.0973643Z cpuid level	: 13
2025-05-07T20:23:12.0973857Z wp		: yes
2025-05-07T20:23:12.0975813Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0978044Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0978536Z bogomips	: 5600.00
2025-05-07T20:23:12.0978755Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0978997Z clflush size	: 64
2025-05-07T20:23:12.0979216Z cache_alignment	: 64
2025-05-07T20:23:12.0979484Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0979807Z power management:
2025-05-07T20:23:12.0979938Z 
2025-05-07T20:23:12.0980034Z processor	: 8
2025-05-07T20:23:12.0980251Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0980491Z cpu family	: 23
2025-05-07T20:23:12.0980703Z model		: 49
2025-05-07T20:23:12.0980902Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0981150Z stepping	: 0
2025-05-07T20:23:12.0981366Z microcode	: 0x830107f
2025-05-07T20:23:12.0981588Z cpu MHz		: 2823.868
2025-05-07T20:23:12.0981813Z cache size	: 512 KB
2025-05-07T20:23:12.0982034Z physical id	: 0
2025-05-07T20:23:12.0982247Z siblings	: 16
2025-05-07T20:23:12.0982452Z core id		: 0
2025-05-07T20:23:12.0982650Z cpu cores	: 8
2025-05-07T20:23:12.0982850Z apicid		: 1
2025-05-07T20:23:12.0983036Z initial apicid	: 1
2025-05-07T20:23:12.0983240Z fpu		: yes
2025-05-07T20:23:12.0983431Z fpu_exception	: yes
2025-05-07T20:23:12.0983637Z cpuid level	: 13
2025-05-07T20:23:12.0983837Z wp		: yes
2025-05-07T20:23:12.0985769Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0988297Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0988775Z bogomips	: 5600.00
2025-05-07T20:23:12.0988988Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0989215Z clflush size	: 64
2025-05-07T20:23:12.0989426Z cache_alignment	: 64
2025-05-07T20:23:12.0989684Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.0989998Z power management:
2025-05-07T20:23:12.0990128Z 
2025-05-07T20:23:12.0990224Z processor	: 9
2025-05-07T20:23:12.0990430Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.0990660Z cpu family	: 23
2025-05-07T20:23:12.0990868Z model		: 49
2025-05-07T20:23:12.0991066Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.0991301Z stepping	: 0
2025-05-07T20:23:12.0991503Z microcode	: 0x830107f
2025-05-07T20:23:12.0991715Z cpu MHz		: 3308.647
2025-05-07T20:23:12.0991928Z cache size	: 512 KB
2025-05-07T20:23:12.0992143Z physical id	: 0
2025-05-07T20:23:12.0992348Z siblings	: 16
2025-05-07T20:23:12.0992598Z core id		: 1
2025-05-07T20:23:12.0992806Z cpu cores	: 8
2025-05-07T20:23:12.0993028Z apicid		: 3
2025-05-07T20:23:12.0993251Z initial apicid	: 3
2025-05-07T20:23:12.0993464Z fpu		: yes
2025-05-07T20:23:12.0993658Z fpu_exception	: yes
2025-05-07T20:23:12.0993876Z cpuid level	: 13
2025-05-07T20:23:12.0994085Z wp		: yes
2025-05-07T20:23:12.0996028Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.0998293Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.0998773Z bogomips	: 5600.00
2025-05-07T20:23:12.0998997Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.0999231Z clflush size	: 64
2025-05-07T20:23:12.0999444Z cache_alignment	: 64
2025-05-07T20:23:12.0999712Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1000024Z power management:
2025-05-07T20:23:12.1000155Z 
2025-05-07T20:23:12.1000239Z processor	: 10
2025-05-07T20:23:12.1000456Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1000702Z cpu family	: 23
2025-05-07T20:23:12.1000901Z model		: 49
2025-05-07T20:23:12.1001109Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1001355Z stepping	: 0
2025-05-07T20:23:12.1001556Z microcode	: 0x830107f
2025-05-07T20:23:12.1001781Z cpu MHz		: 2905.656
2025-05-07T20:23:12.1001995Z cache size	: 512 KB
2025-05-07T20:23:12.1002217Z physical id	: 0
2025-05-07T20:23:12.1002420Z siblings	: 16
2025-05-07T20:23:12.1002624Z core id		: 2
2025-05-07T20:23:12.1002828Z cpu cores	: 8
2025-05-07T20:23:12.1003031Z apicid		: 5
2025-05-07T20:23:12.1003237Z initial apicid	: 5
2025-05-07T20:23:12.1003448Z fpu		: yes
2025-05-07T20:23:12.1003645Z fpu_exception	: yes
2025-05-07T20:23:12.1003865Z cpuid level	: 13
2025-05-07T20:23:12.1004075Z wp		: yes
2025-05-07T20:23:12.1006871Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1009627Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1010316Z bogomips	: 5600.00
2025-05-07T20:23:12.1010790Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1011113Z clflush size	: 64
2025-05-07T20:23:12.1011353Z cache_alignment	: 64
2025-05-07T20:23:12.1011620Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1011938Z power management:
2025-05-07T20:23:12.1012069Z 
2025-05-07T20:23:12.1012152Z processor	: 11
2025-05-07T20:23:12.1012372Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1012611Z cpu family	: 23
2025-05-07T20:23:12.1012811Z model		: 49
2025-05-07T20:23:12.1013046Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1013321Z stepping	: 0
2025-05-07T20:23:12.1013524Z microcode	: 0x830107f
2025-05-07T20:23:12.1013750Z cpu MHz		: 3294.868
2025-05-07T20:23:12.1013965Z cache size	: 512 KB
2025-05-07T20:23:12.1014174Z physical id	: 0
2025-05-07T20:23:12.1014383Z siblings	: 16
2025-05-07T20:23:12.1014585Z core id		: 3
2025-05-07T20:23:12.1014781Z cpu cores	: 8
2025-05-07T20:23:12.1014982Z apicid		: 7
2025-05-07T20:23:12.1015181Z initial apicid	: 7
2025-05-07T20:23:12.1015395Z fpu		: yes
2025-05-07T20:23:12.1015592Z fpu_exception	: yes
2025-05-07T20:23:12.1015810Z cpuid level	: 13
2025-05-07T20:23:12.1016011Z wp		: yes
2025-05-07T20:23:12.1017954Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1020181Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1020862Z bogomips	: 5600.00
2025-05-07T20:23:12.1021161Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1021467Z clflush size	: 64
2025-05-07T20:23:12.1021681Z cache_alignment	: 64
2025-05-07T20:23:12.1021951Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1022315Z power management:
2025-05-07T20:23:12.1022509Z 
2025-05-07T20:23:12.1022626Z processor	: 12
2025-05-07T20:23:12.1022924Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1023239Z cpu family	: 23
2025-05-07T20:23:12.1023522Z model		: 49
2025-05-07T20:23:12.1023733Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1023976Z stepping	: 0
2025-05-07T20:23:12.1024187Z microcode	: 0x830107f
2025-05-07T20:23:12.1024411Z cpu MHz		: 2057.325
2025-05-07T20:23:12.1024621Z cache size	: 512 KB
2025-05-07T20:23:12.1024836Z physical id	: 0
2025-05-07T20:23:12.1025046Z siblings	: 16
2025-05-07T20:23:12.1025243Z core id		: 4
2025-05-07T20:23:12.1025443Z cpu cores	: 8
2025-05-07T20:23:12.1025645Z apicid		: 9
2025-05-07T20:23:12.1025838Z initial apicid	: 9
2025-05-07T20:23:12.1026044Z fpu		: yes
2025-05-07T20:23:12.1026239Z fpu_exception	: yes
2025-05-07T20:23:12.1026454Z cpuid level	: 13
2025-05-07T20:23:12.1026662Z wp		: yes
2025-05-07T20:23:12.1028607Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1030937Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1031420Z bogomips	: 5600.00
2025-05-07T20:23:12.1031634Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1031873Z clflush size	: 64
2025-05-07T20:23:12.1032088Z cache_alignment	: 64
2025-05-07T20:23:12.1032430Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1032858Z power management:
2025-05-07T20:23:12.1033040Z 
2025-05-07T20:23:12.1033164Z processor	: 13
2025-05-07T20:23:12.1033449Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1033763Z cpu family	: 23
2025-05-07T20:23:12.1033972Z model		: 49
2025-05-07T20:23:12.1034172Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1034415Z stepping	: 0
2025-05-07T20:23:12.1034677Z microcode	: 0x830107f
2025-05-07T20:23:12.1035001Z cpu MHz		: 3281.511
2025-05-07T20:23:12.1035286Z cache size	: 512 KB
2025-05-07T20:23:12.1035577Z physical id	: 0
2025-05-07T20:23:12.1035820Z siblings	: 16
2025-05-07T20:23:12.1036014Z core id		: 5
2025-05-07T20:23:12.1036215Z cpu cores	: 8
2025-05-07T20:23:12.1036418Z apicid		: 11
2025-05-07T20:23:12.1036619Z initial apicid	: 11
2025-05-07T20:23:12.1036836Z fpu		: yes
2025-05-07T20:23:12.1037037Z fpu_exception	: yes
2025-05-07T20:23:12.1037247Z cpuid level	: 13
2025-05-07T20:23:12.1037455Z wp		: yes
2025-05-07T20:23:12.1039414Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1041650Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1042128Z bogomips	: 5600.00
2025-05-07T20:23:12.1042348Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1042584Z clflush size	: 64
2025-05-07T20:23:12.1042797Z cache_alignment	: 64
2025-05-07T20:23:12.1043093Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1043438Z power management:
2025-05-07T20:23:12.1043568Z 
2025-05-07T20:23:12.1043659Z processor	: 14
2025-05-07T20:23:12.1043875Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1044115Z cpu family	: 23
2025-05-07T20:23:12.1044320Z model		: 49
2025-05-07T20:23:12.1044521Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1044764Z stepping	: 0
2025-05-07T20:23:12.1044970Z microcode	: 0x830107f
2025-05-07T20:23:12.1045191Z cpu MHz		: 3297.405
2025-05-07T20:23:12.1045414Z cache size	: 512 KB
2025-05-07T20:23:12.1045632Z physical id	: 0
2025-05-07T20:23:12.1045836Z siblings	: 16
2025-05-07T20:23:12.1046036Z core id		: 6
2025-05-07T20:23:12.1046237Z cpu cores	: 8
2025-05-07T20:23:12.1046436Z apicid		: 13
2025-05-07T20:23:12.1046638Z initial apicid	: 13
2025-05-07T20:23:12.1046857Z fpu		: yes
2025-05-07T20:23:12.1047125Z fpu_exception	: yes
2025-05-07T20:23:12.1047427Z cpuid level	: 13
2025-05-07T20:23:12.1047794Z wp		: yes
2025-05-07T20:23:12.1050148Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1052525Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1053027Z bogomips	: 5600.00
2025-05-07T20:23:12.1053278Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1053509Z clflush size	: 64
2025-05-07T20:23:12.1053718Z cache_alignment	: 64
2025-05-07T20:23:12.1053985Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1054300Z power management:
2025-05-07T20:23:12.1054431Z 
2025-05-07T20:23:12.1054607Z processor	: 15
2025-05-07T20:23:12.1054826Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1055065Z cpu family	: 23
2025-05-07T20:23:12.1055269Z model		: 49
2025-05-07T20:23:12.1055476Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1055718Z stepping	: 0
2025-05-07T20:23:12.1055922Z microcode	: 0x830107f
2025-05-07T20:23:12.1056147Z cpu MHz		: 3249.562
2025-05-07T20:23:12.1056363Z cache size	: 512 KB
2025-05-07T20:23:12.1056575Z physical id	: 0
2025-05-07T20:23:12.1056793Z siblings	: 16
2025-05-07T20:23:12.1056999Z core id		: 7
2025-05-07T20:23:12.1057205Z cpu cores	: 8
2025-05-07T20:23:12.1057404Z apicid		: 15
2025-05-07T20:23:12.1057610Z initial apicid	: 15
2025-05-07T20:23:12.1057826Z fpu		: yes
2025-05-07T20:23:12.1058019Z fpu_exception	: yes
2025-05-07T20:23:12.1058240Z cpuid level	: 13
2025-05-07T20:23:12.1058452Z wp		: yes
2025-05-07T20:23:12.1060430Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1063335Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1064020Z bogomips	: 5600.00
2025-05-07T20:23:12.1064319Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1064594Z clflush size	: 64
2025-05-07T20:23:12.1064810Z cache_alignment	: 64
2025-05-07T20:23:12.1065087Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1065400Z power management:
2025-05-07T20:23:12.1065542Z 
2025-05-07T20:23:12.1065546Z 
2025-05-07T20:23:12.1065670Z ################################################################################
2025-05-07T20:23:12.1065987Z [INFO] Print PCI info ...
2025-05-07T20:23:12.1066234Z + lspci -v
2025-05-07T20:23:12.1066350Z 
2025-05-07T20:23:12.1066584Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:12.1066965Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:12.1067287Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:12.1067495Z 
2025-05-07T20:23:12.1067703Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:12.1068088Z 	Physical Slot: 1
2025-05-07T20:23:12.1068323Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.1068535Z 
2025-05-07T20:23:12.1068782Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:12.1069216Z 	Physical Slot: 1
2025-05-07T20:23:12.1069469Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:12.1069700Z 
2025-05-07T20:23:12.1069966Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:12.1070414Z 	Physical Slot: 3
2025-05-07T20:23:12.1070661Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.1071009Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.1071369Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:12.1071592Z 
2025-05-07T20:23:12.1071894Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.1072507Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.1072800Z 	Physical Slot: 4
2025-05-07T20:23:12.1073059Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:12.1073437Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.1073793Z 	Capabilities: <access denied>
2025-05-07T20:23:12.1074064Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.1074229Z 
2025-05-07T20:23:12.1074567Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.1075043Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.1075391Z 	Physical Slot: 5
2025-05-07T20:23:12.1075637Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.1075990Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.1076383Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.1076716Z 	Capabilities: <access denied>
2025-05-07T20:23:12.1076985Z 	Kernel driver in use: ena
2025-05-07T20:23:12.1077234Z 	Kernel modules: ena
2025-05-07T20:23:12.1077374Z 
2025-05-07T20:23:12.1077551Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:12.1077934Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:12.1078227Z 	Physical Slot: 30
2025-05-07T20:23:12.1078491Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:12.1078872Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:12.1079307Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:12.1079835Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:12.1080297Z 	Capabilities: <access denied>
2025-05-07T20:23:12.1080655Z 	Kernel driver in use: nvidia
2025-05-07T20:23:12.1080958Z 	Kernel modules: nvidia
2025-05-07T20:23:12.1081105Z 
2025-05-07T20:23:12.1081416Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.1081935Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.1082226Z 	Physical Slot: 31
2025-05-07T20:23:12.1082473Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.1082835Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.1083212Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:12.1083547Z 	Capabilities: <access denied>
2025-05-07T20:23:12.1083817Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.1083979Z 
2025-05-07T20:23:12.1083983Z 
2025-05-07T20:23:12.1084106Z ################################################################################
2025-05-07T20:23:12.1084437Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:12.1084729Z + uname -a
2025-05-07T20:23:12.1084842Z 
2025-05-07T20:23:12.1085260Z Linux ip-10-0-29-135.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:12.1085761Z 
2025-05-07T20:23:12.1085845Z + uname -m
2025-05-07T20:23:12.1085969Z 
2025-05-07T20:23:12.1086046Z x86_64
2025-05-07T20:23:12.1086154Z 
2025-05-07T20:23:12.1086256Z + cat /proc/version
2025-05-07T20:23:12.1086390Z 
2025-05-07T20:23:12.1086943Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:12.1087663Z 
2025-05-07T20:23:12.1087751Z + cat /etc/os-release
2025-05-07T20:23:12.1087903Z 
2025-05-07T20:23:12.1087995Z NAME="Amazon Linux"
2025-05-07T20:23:12.1088217Z VERSION="2023"
2025-05-07T20:23:12.1088421Z ID="amzn"
2025-05-07T20:23:12.1088623Z ID_LIKE="fedora"
2025-05-07T20:23:12.1096510Z VERSION_ID="2023"
2025-05-07T20:23:12.1096835Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:12.1097137Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:12.1097433Z ANSI_COLOR="0;33"
2025-05-07T20:23:12.1097689Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:12.1098225Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:12.1098667Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:12.1099086Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:12.1099536Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:12.1099918Z VENDOR_NAME="AWS"
2025-05-07T20:23:12.1100173Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:12.1100463Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:12.1100629Z 
2025-05-07T20:23:12.1100844Z ################################################################################
2025-05-07T20:23:12.1101165Z # Print EC2 Instance Info
2025-05-07T20:23:12.1101406Z #
2025-05-07T20:23:12.1101635Z # [2025-05-07T20:23:12.107Z] + print_ec2_info 
2025-05-07T20:23:12.1101968Z ################################################################################
2025-05-07T20:23:12.1102186Z 
2025-05-07T20:23:12.1242355Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:12.1363555Z instance-id: i-0e49e9d70b38203df
2025-05-07T20:23:12.1481075Z instance-type: g5.4xlarge
2025-05-07T20:23:12.1532700Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:12.1533289Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:12.1546618Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.1547148Z env:
2025-05-07T20:23:12.1547475Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.1547946Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.1548340Z   BUILD_TARGET: genai
2025-05-07T20:23:12.1548691Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.1549060Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:12.1549464Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.1549938Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.1550450Z ##[endgroup]
2025-05-07T20:23:12.4863267Z ################################################################################
2025-05-07T20:23:12.4863689Z [INFO] Printing general display info ...
2025-05-07T20:23:12.4890705Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:12.5993397Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:12.6001480Z /usr/bin/sudo
2025-05-07T20:23:12.6013315Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:12.6022707Z /usr/bin/yum
2025-05-07T20:23:12.6024289Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:12.6045233Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:13.0669826Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:13.1439103Z ================================================================================
2025-05-07T20:23:13.1439602Z WARNING:
2025-05-07T20:23:13.1439960Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:13.1440290Z 
2025-05-07T20:23:13.1440414Z   Available Versions:
2025-05-07T20:23:13.1440620Z 
2025-05-07T20:23:13.1440742Z   Version 2023.7.20250331:
2025-05-07T20:23:13.1441128Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:13.1441397Z 
2025-05-07T20:23:13.1441534Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:13.1441743Z 
2025-05-07T20:23:13.1441828Z     Release notes:
2025-05-07T20:23:13.1442236Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:13.1442604Z 
2025-05-07T20:23:13.1442701Z   Version 2023.7.20250414:
2025-05-07T20:23:13.1443015Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:13.1443262Z 
2025-05-07T20:23:13.1443401Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:13.1443636Z 
2025-05-07T20:23:13.1443723Z     Release notes:
2025-05-07T20:23:13.1444118Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:13.1444479Z 
2025-05-07T20:23:13.1444568Z   Version 2023.7.20250428:
2025-05-07T20:23:13.1444880Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:13.1445339Z 
2025-05-07T20:23:13.1445451Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:13.1445660Z 
2025-05-07T20:23:13.1445753Z     Release notes:
2025-05-07T20:23:13.1446139Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:13.1446503Z 
2025-05-07T20:23:13.1446615Z ================================================================================
2025-05-07T20:23:13.2584203Z Dependencies resolved.
2025-05-07T20:23:13.2871738Z ================================================================================
2025-05-07T20:23:13.2872753Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:13.2873683Z ================================================================================
2025-05-07T20:23:13.2874063Z Upgrading:
2025-05-07T20:23:13.2874478Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:13.2875289Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:13.2875794Z 
2025-05-07T20:23:13.2876181Z Transaction Summary
2025-05-07T20:23:13.2876511Z ================================================================================
2025-05-07T20:23:13.2876812Z Upgrade  2 Packages
2025-05-07T20:23:13.2876953Z 
2025-05-07T20:23:13.2877061Z Total download size: 6.9 M
2025-05-07T20:23:13.2877320Z Downloading Packages:
2025-05-07T20:23:13.3248931Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  34 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:13.3669582Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  73 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:13.3679956Z --------------------------------------------------------------------------------
2025-05-07T20:23:13.3681197Z Total                                            86 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:13.3684335Z Running transaction check
2025-05-07T20:23:13.3783525Z Transaction check succeeded.
2025-05-07T20:23:13.3784155Z Running transaction test
2025-05-07T20:23:13.4079515Z Transaction test succeeded.
2025-05-07T20:23:13.4082324Z Running transaction
2025-05-07T20:23:13.9621198Z   Preparing        :                                                        1/1 
2025-05-07T20:23:14.0676000Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:14.0696256Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.0930817Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.0931583Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.1033532Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.1054466Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.2831516Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:14.2832094Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:14.2832658Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:14.2833189Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:14.4268432Z ================================================================================
2025-05-07T20:23:14.4268884Z WARNING:
2025-05-07T20:23:14.4269119Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:14.4269343Z 
2025-05-07T20:23:14.4269437Z   Available Versions:
2025-05-07T20:23:14.4269588Z 
2025-05-07T20:23:14.4269675Z   Version 2023.7.20250331:
2025-05-07T20:23:14.4269982Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:14.4270227Z 
2025-05-07T20:23:14.4270355Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:14.4270561Z 
2025-05-07T20:23:14.4270646Z     Release notes:
2025-05-07T20:23:14.4271050Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:14.4271692Z 
2025-05-07T20:23:14.4271797Z   Version 2023.7.20250414:
2025-05-07T20:23:14.4272092Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:14.4272341Z 
2025-05-07T20:23:14.4272453Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:14.4272663Z 
2025-05-07T20:23:14.4272746Z     Release notes:
2025-05-07T20:23:14.4273134Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:14.4273489Z 
2025-05-07T20:23:14.4273575Z   Version 2023.7.20250428:
2025-05-07T20:23:14.4273876Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:14.4274115Z 
2025-05-07T20:23:14.4274233Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:14.4274435Z 
2025-05-07T20:23:14.4274520Z     Release notes:
2025-05-07T20:23:14.4274903Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:14.4275267Z 
2025-05-07T20:23:14.4275606Z ================================================================================
2025-05-07T20:23:14.4842873Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.4843220Z 
2025-05-07T20:23:14.4843315Z Upgraded:
2025-05-07T20:23:14.4843667Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:14.4844280Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:14.4844613Z 
2025-05-07T20:23:14.4844695Z Complete!
2025-05-07T20:23:14.5279559Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:14.5303893Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:15.0104356Z Last metadata expiration check: 0:00:10 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:15.0345531Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:15.0743317Z Dependencies resolved.
2025-05-07T20:23:15.0920056Z ================================================================================
2025-05-07T20:23:15.0920557Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:15.0920975Z ================================================================================
2025-05-07T20:23:15.0921277Z Installing:
2025-05-07T20:23:15.0921568Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:15.0921833Z 
2025-05-07T20:23:15.0921930Z Transaction Summary
2025-05-07T20:23:15.0922174Z ================================================================================
2025-05-07T20:23:15.0922480Z Install  1 Package
2025-05-07T20:23:15.0922612Z 
2025-05-07T20:23:15.0922938Z Total download size: 319 k
2025-05-07T20:23:15.0923221Z Installed size: 837 k
2025-05-07T20:23:15.0924499Z Downloading Packages:
2025-05-07T20:23:15.1706194Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        6.5 MB/s | 319 kB     00:00    
2025-05-07T20:23:15.1711966Z --------------------------------------------------------------------------------
2025-05-07T20:23:15.1714712Z Total                                           4.0 MB/s | 319 kB     00:00     
2025-05-07T20:23:15.1869355Z Running transaction check
2025-05-07T20:23:15.1924236Z Transaction check succeeded.
2025-05-07T20:23:15.1924851Z Running transaction test
2025-05-07T20:23:15.2381871Z Transaction test succeeded.
2025-05-07T20:23:15.2385470Z Running transaction
2025-05-07T20:23:15.3404077Z   Preparing        :                                                        1/1 
2025-05-07T20:23:15.3909663Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.5389247Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.6641561Z ================================================================================
2025-05-07T20:23:15.6641925Z WARNING:
2025-05-07T20:23:15.6642210Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:15.6642725Z 
2025-05-07T20:23:15.6642829Z   Available Versions:
2025-05-07T20:23:15.6642993Z 
2025-05-07T20:23:15.6643109Z   Version 2023.7.20250331:
2025-05-07T20:23:15.6643416Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:15.6643673Z 
2025-05-07T20:23:15.6643791Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:15.6644006Z 
2025-05-07T20:23:15.6644107Z     Release notes:
2025-05-07T20:23:15.6644543Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:15.6644910Z 
2025-05-07T20:23:15.6644999Z   Version 2023.7.20250414:
2025-05-07T20:23:15.6645301Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:15.6645543Z 
2025-05-07T20:23:15.6645662Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:15.6645865Z 
2025-05-07T20:23:15.6645952Z     Release notes:
2025-05-07T20:23:15.6646333Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:15.6646702Z 
2025-05-07T20:23:15.6646956Z   Version 2023.7.20250428:
2025-05-07T20:23:15.6647268Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:15.6647620Z 
2025-05-07T20:23:15.6647733Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:15.6647942Z 
2025-05-07T20:23:15.6648025Z     Release notes:
2025-05-07T20:23:15.6648407Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:15.6648759Z 
2025-05-07T20:23:15.6648874Z ================================================================================
2025-05-07T20:23:15.6986560Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.6986913Z 
2025-05-07T20:23:15.6987004Z Installed:
2025-05-07T20:23:15.6987320Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:15.6987610Z 
2025-05-07T20:23:15.6987704Z Complete!
2025-05-07T20:23:15.7428565Z + hostname
2025-05-07T20:23:15.7428726Z 
2025-05-07T20:23:15.7442135Z ip-10-0-29-135.ec2.internal
2025-05-07T20:23:15.7443083Z 
2025-05-07T20:23:15.7443568Z + sudo lshw -C display
2025-05-07T20:23:15.7443780Z 
2025-05-07T20:23:16.3041306Z   *-display:0 UNCLAIMED
2025-05-07T20:23:16.3041662Z        description: VGA compatible controller
2025-05-07T20:23:16.3042000Z        product: Amazon.com, Inc.
2025-05-07T20:23:16.3042285Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:16.3042551Z        physical id: 3
2025-05-07T20:23:16.3042790Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:16.3043055Z        version: 00
2025-05-07T20:23:16.3043278Z        width: 32 bits
2025-05-07T20:23:16.3043503Z        clock: 33MHz
2025-05-07T20:23:16.3043760Z        capabilities: vga_controller bus_master
2025-05-07T20:23:16.3044082Z        configuration: latency=0
2025-05-07T20:23:16.3044410Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:16.3044739Z   *-display:1
2025-05-07T20:23:16.3044997Z        description: 3D controller
2025-05-07T20:23:16.3045290Z        product: GA102GL [A10G]
2025-05-07T20:23:16.3045554Z        vendor: NVIDIA Corporation
2025-05-07T20:23:16.3045823Z        physical id: 1e
2025-05-07T20:23:16.3046072Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:16.3046324Z        version: a1
2025-05-07T20:23:16.3046544Z        width: 64 bits
2025-05-07T20:23:16.3046765Z        clock: 33MHz
2025-05-07T20:23:16.3047049Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:16.3047420Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:16.3048135Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:16.3079330Z 
2025-05-07T20:23:16.3079798Z ################################################################################
2025-05-07T20:23:16.3208016Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:16.3208398Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:16.3376081Z Wed May  7 20:23:16 2025       
2025-05-07T20:23:16.3376830Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.3377788Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:16.3378732Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.3379690Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:16.3380710Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:16.3381531Z |                                         |                        |               MIG M. |
2025-05-07T20:23:16.3382180Z |=========================================+========================+======================|
2025-05-07T20:23:16.3454812Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:16.3455468Z |  0%   30C    P0             57W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:16.3455860Z |                                         |                        |                  N/A |
2025-05-07T20:23:16.3456256Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.3456648Z                                                                                          
2025-05-07T20:23:16.3457044Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.3457461Z | Processes:                                                                              |
2025-05-07T20:23:16.3457896Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:16.3458309Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:16.3458661Z |=========================================================================================|
2025-05-07T20:23:16.3459575Z |  No running processes found                                                             |
2025-05-07T20:23:16.3460039Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.4850021Z ################################################################################
2025-05-07T20:23:16.4850378Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:16.4991811Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.4992459Z [CHECK] rocminfo not found
2025-05-07T20:23:16.5000290Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.5002466Z [CHECK] rocm-smi not found
2025-05-07T20:23:16.5066651Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:16.5067080Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:16.5078939Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:16.5079287Z env:
2025-05-07T20:23:16.5079518Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:16.5079820Z   BUILD_ENV: build_binary
2025-05-07T20:23:16.5080070Z   BUILD_TARGET: genai
2025-05-07T20:23:16.5080304Z   BUILD_VARIANT: cuda
2025-05-07T20:23:16.5080537Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:16.5080798Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:16.5081099Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:16.5081434Z ##[endgroup]
2025-05-07T20:23:16.8402599Z ################################################################################
2025-05-07T20:23:16.8402976Z # Setup Miniconda
2025-05-07T20:23:16.8403258Z #
2025-05-07T20:23:16.8417539Z # [2025-05-07T20:23:16.841Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:16.8418102Z ################################################################################
2025-05-07T20:23:16.8418405Z 
2025-05-07T20:23:16.8431916Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:16.9398662Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:16.9399633Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:16.9400176Z 
2025-05-07T20:23:16.9416160Z 
2025-05-07T20:23:16.9416664Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:16.9437313Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:17.9807991Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:17.9808351Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:17.9808604Z 
2025-05-07T20:23:17.9952496Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:18.4401761Z Unpacking payload ...
2025-05-07T20:23:18.9576374Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:19.7539011Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:21.8549942Z 
2025-05-07T20:23:21.8550398Z Installing base environment...
2025-05-07T20:23:21.8550634Z 
2025-05-07T20:23:22.9340877Z Preparing transaction: ...working... done
2025-05-07T20:23:25.7917388Z Executing transaction: ...working... done
2025-05-07T20:23:26.4502269Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:26.5381877Z installation finished.
2025-05-07T20:23:26.5389507Z 
2025-05-07T20:23:26.5389743Z + rm -f miniconda.sh
2025-05-07T20:23:26.5389980Z 
2025-05-07T20:23:26.5697060Z 
2025-05-07T20:23:26.5697608Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:26.5698093Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:26.5698410Z 
2025-05-07T20:23:26.9319175Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:26.9319563Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:26.9319980Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:26.9320425Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:26.9320783Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:26.9321175Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:26.9321596Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:26.9322037Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:26.9322867Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:26.9323650Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:26.9324165Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:26.9324534Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:26.9324721Z 
2025-05-07T20:23:26.9324921Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:26.9325213Z 
2025-05-07T20:23:26.9973467Z 
2025-05-07T20:23:26.9973939Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:26.9974148Z 
2025-05-07T20:23:27.8285433Z 
2025-05-07T20:23:27.8285916Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:27.8310963Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:41.7114854Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:23:43.2974971Z Solving environment: - \ | / - \ | / - \ | / done
2025-05-07T20:23:43.4026207Z 
2025-05-07T20:23:43.4026801Z ## Package Plan ##
2025-05-07T20:23:43.4027229Z 
2025-05-07T20:23:43.4027615Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:43.4028283Z 
2025-05-07T20:23:43.4028533Z   added / updated specs:
2025-05-07T20:23:43.4029191Z     - conda-libmamba-solver
2025-05-07T20:23:43.4029717Z     - libarchive
2025-05-07T20:23:43.4030158Z     - libmamba
2025-05-07T20:23:43.4030574Z     - libmambapy
2025-05-07T20:23:43.4030830Z 
2025-05-07T20:23:43.4030840Z 
2025-05-07T20:23:43.4031120Z The following packages will be downloaded:
2025-05-07T20:23:43.4031557Z 
2025-05-07T20:23:43.4031791Z     package                    |            build
2025-05-07T20:23:43.4032287Z     ---------------------------|-----------------
2025-05-07T20:23:43.4032828Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:43.4033305Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:43.4033739Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:43.4034371Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:43.4035052Z     ------------------------------------------------------------
2025-05-07T20:23:43.4035557Z                                            Total:         1.4 MB
2025-05-07T20:23:43.4035879Z 
2025-05-07T20:23:43.4036051Z The following packages will be UPDATED:
2025-05-07T20:23:43.4036374Z 
2025-05-07T20:23:43.4041696Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:43.4042746Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:43.4043309Z 
2025-05-07T20:23:43.4043669Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:43.4044204Z 
2025-05-07T20:23:43.4044720Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:43.4046046Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:43.4046851Z 
2025-05-07T20:23:43.4046858Z 
2025-05-07T20:23:43.4046876Z 
2025-05-07T20:23:43.4047101Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:43.4047794Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:43.4048149Z 
2025-05-07T20:23:43.4048939Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:43.4049346Z 
2025-05-07T20:23:43.4049352Z 
2025-05-07T20:23:43.4062779Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:43.4063169Z 
2025-05-07T20:23:43.4063175Z 
2025-05-07T20:23:43.4063180Z 
2025-05-07T20:23:43.4648236Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:43.4648620Z 
2025-05-07T20:23:43.4648926Z 
2025-05-07T20:23:43.4724738Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.4765675Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.4765975Z 
2025-05-07T20:23:43.4766112Z 
2025-05-07T20:23:43.4807670Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.4810502Z 
2025-05-07T20:23:43.4849830Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.4850076Z 
2025-05-07T20:23:43.4850153Z 
2025-05-07T20:23:43.4850195Z 
2025-05-07T20:23:43.5141106Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.5141471Z 
2025-05-07T20:23:43.5145864Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.5146212Z 
2025-05-07T20:23:43.5205909Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.5206525Z 
2025-05-07T20:23:43.5206529Z 
2025-05-07T20:23:43.5206533Z 
2025-05-07T20:23:43.5208810Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.5209169Z 
2025-05-07T20:23:43.5209174Z 
2025-05-07T20:23:43.5209180Z 
2025-05-07T20:23:43.6159204Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.6160146Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.6164083Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.6164509Z                                                      
2025-05-07T20:23:43.6164782Z 
2025-05-07T20:23:43.6165029Z                                                      [A
2025-05-07T20:23:43.6165306Z 
2025-05-07T20:23:43.6165323Z 
2025-05-07T20:23:43.6165546Z                                                      [A[A
2025-05-07T20:23:43.6165823Z 
2025-05-07T20:23:43.6165828Z 
2025-05-07T20:23:43.6165842Z 
2025-05-07T20:23:43.6166074Z                                                      [A[A[A done
2025-05-07T20:23:43.7169233Z Preparing transaction: \ done
2025-05-07T20:23:43.8174098Z Verifying transaction: / done
2025-05-07T20:23:45.1194536Z Executing transaction: \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:46.8417605Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:46.8442273Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:47.7875839Z Channels:
2025-05-07T20:23:47.7876179Z  - defaults
2025-05-07T20:23:47.7876496Z Platform: linux-64
2025-05-07T20:23:49.0563550Z Collecting package metadata (repodata.json): - \ | / - \ | / done
2025-05-07T20:23:49.1816038Z Solving environment: \ | Channels:
2025-05-07T20:23:49.1816447Z  - defaults
2025-05-07T20:23:49.1816709Z Platform: linux-64
2025-05-07T20:23:49.4964432Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:23:49.7121593Z Solving environment: \ | / - done
2025-05-07T20:23:49.7896818Z done
2025-05-07T20:23:49.8600536Z 
2025-05-07T20:23:49.8600871Z ## Package Plan ##
2025-05-07T20:23:49.8601084Z 
2025-05-07T20:23:49.8601278Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:49.8601618Z 
2025-05-07T20:23:49.8601757Z   added / updated specs:
2025-05-07T20:23:49.8602074Z     - conda
2025-05-07T20:23:49.8602238Z 
2025-05-07T20:23:49.8602245Z 
2025-05-07T20:23:49.8602374Z The following packages will be downloaded:
2025-05-07T20:23:49.8602590Z 
2025-05-07T20:23:49.8602704Z     package                    |            build
2025-05-07T20:23:49.8603019Z     ---------------------------|-----------------
2025-05-07T20:23:49.8603353Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:49.8604157Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:49.8604684Z     ------------------------------------------------------------
2025-05-07T20:23:49.8605088Z                                            Total:         1.4 MB
2025-05-07T20:23:49.8605307Z 
2025-05-07T20:23:49.8605422Z The following packages will be UPDATED:
2025-05-07T20:23:49.8605862Z 
2025-05-07T20:23:49.8606229Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:49.8606737Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:49.8606983Z 
2025-05-07T20:23:49.8606987Z 
2025-05-07T20:23:49.8606991Z 
2025-05-07T20:23:49.8607141Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:49.8607725Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:49.8608029Z 
2025-05-07T20:23:49.9079063Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:49.9081049Z 
2025-05-07T20:23:49.9127675Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1776608Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.1782236Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.1910685Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.1911013Z 
2025-05-07T20:23:50.1911939Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1912278Z 
2025-05-07T20:23:50.1916113Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1916581Z                                                      
2025-05-07T20:23:50.1916847Z 
2025-05-07T20:23:50.1917082Z                                                      [A done
2025-05-07T20:23:50.2922018Z Preparing transaction: | done
2025-05-07T20:23:50.3927378Z Verifying transaction: - done
2025-05-07T20:23:52.6955883Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:23:53.3027933Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:53.3032365Z + conda clean --packages --tarball -y
2025-05-07T20:23:53.3032660Z 
2025-05-07T20:23:54.3103597Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:54.3104047Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:54.3731697Z 
2025-05-07T20:23:54.3739892Z + conda clean --all -y
2025-05-07T20:23:54.3740136Z 
2025-05-07T20:23:54.9054132Z There are no unused tarball(s) to remove.
2025-05-07T20:23:54.9054587Z Will remove 1 index cache(s).
2025-05-07T20:23:54.9054965Z There are no unused package(s) to remove.
2025-05-07T20:23:54.9055340Z There are no tempfile(s) to remove.
2025-05-07T20:23:54.9055638Z There are no logfile(s) to remove.
2025-05-07T20:23:54.9676698Z 
2025-05-07T20:23:54.9681137Z + conda info
2025-05-07T20:23:54.9681332Z 
2025-05-07T20:23:55.7161960Z 
2025-05-07T20:23:55.7162703Z      active environment : base
2025-05-07T20:23:55.7163230Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:55.7163629Z             shell level : 1
2025-05-07T20:23:55.7163905Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:55.7164293Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:55.7164664Z           conda version : 25.3.1
2025-05-07T20:23:55.7164941Z     conda-build version : not installed
2025-05-07T20:23:55.7165246Z          python version : 3.13.2.final.0
2025-05-07T20:23:55.7165544Z                  solver : libmamba (default)
2025-05-07T20:23:55.7165893Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:55.7166189Z                           __conda=25.3.1=0
2025-05-07T20:23:55.7166463Z                           __cuda=12.8=0
2025-05-07T20:23:55.7166726Z                           __glibc=2.34=0
2025-05-07T20:23:55.7167001Z                           __linux=6.1.130=0
2025-05-07T20:23:55.7167270Z                           __unix=0=0
2025-05-07T20:23:55.7168021Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:55.7168436Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:55.7168788Z   conda av metadata url : None
2025-05-07T20:23:55.7169159Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:55.7169583Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:55.7169964Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:55.7170342Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:55.7170704Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:55.7171040Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:55.7171383Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:55.7171713Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:55.7172017Z                platform : linux-64
2025-05-07T20:23:55.7172852Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:55.7173826Z                 UID:GID : 1000:1000
2025-05-07T20:23:55.7174097Z              netrc file : None
2025-05-07T20:23:55.7174358Z            offline mode : False
2025-05-07T20:23:55.7174523Z 
2025-05-07T20:23:55.7814399Z 
2025-05-07T20:23:55.7814749Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:55.7815930Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_fcaa5f2a-b6b0-49f7-919d-c15ce02ab8c2 ...
2025-05-07T20:23:55.7816713Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:55.7888715Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.11
2025-05-07T20:23:55.7889226Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.11[0m
2025-05-07T20:23:55.7908315Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:55.7908655Z env:
2025-05-07T20:23:55.7908876Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:55.7909178Z   BUILD_ENV: build_binary
2025-05-07T20:23:55.7909438Z   BUILD_TARGET: genai
2025-05-07T20:23:55.7909666Z   BUILD_VARIANT: cuda
2025-05-07T20:23:55.7909891Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:55.7910143Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:55.7910437Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:55.7910759Z ##[endgroup]
2025-05-07T20:23:56.1272818Z ################################################################################
2025-05-07T20:23:56.1273189Z # Create Conda Environment
2025-05-07T20:23:56.1273435Z #
2025-05-07T20:23:56.1289909Z # [2025-05-07T20:23:56.128Z] + create_conda_environment build_binary 3.11
2025-05-07T20:23:56.1290320Z ################################################################################
2025-05-07T20:23:56.1290548Z 
2025-05-07T20:23:56.1308025Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:56.2185934Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:56.2186305Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:56.2186626Z + conda info --envs
2025-05-07T20:23:56.2186769Z 
2025-05-07T20:23:56.9625536Z 
2025-05-07T20:23:56.9626093Z # conda environments:
2025-05-07T20:23:56.9626368Z #
2025-05-07T20:23:56.9626592Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:56.9626827Z 
2025-05-07T20:23:57.0280252Z 
2025-05-07T20:23:57.0281457Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:58.6515675Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:58.6515942Z 
2025-05-07T20:23:58.6529228Z 
2025-05-07T20:23:58.6538384Z [SETUP] Creating new Conda environment (Python 3.11) ...
2025-05-07T20:23:58.6561181Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.11
2025-05-07T20:23:59.4094589Z Channels:
2025-05-07T20:23:59.4094842Z  - defaults
2025-05-07T20:23:59.4095115Z Platform: linux-64
2025-05-07T20:24:00.9778419Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | done
2025-05-07T20:24:01.1018179Z Solving environment: - done
2025-05-07T20:24:01.1308823Z 
2025-05-07T20:24:01.1309057Z ## Package Plan ##
2025-05-07T20:24:01.1309279Z 
2025-05-07T20:24:01.1309577Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:01.1309978Z 
2025-05-07T20:24:01.1310105Z   added / updated specs:
2025-05-07T20:24:01.1310425Z     - python=3.11
2025-05-07T20:24:01.1310557Z 
2025-05-07T20:24:01.1310561Z 
2025-05-07T20:24:01.1310689Z The following packages will be downloaded:
2025-05-07T20:24:01.1310904Z 
2025-05-07T20:24:01.1311041Z     package                    |            build
2025-05-07T20:24:01.1311357Z     ---------------------------|-----------------
2025-05-07T20:24:01.1311715Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:01.1312118Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:01.1312689Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:01.1313236Z     python-3.11.11             |       he870216_0        32.9 MB
2025-05-07T20:24:01.1314092Z     setuptools-78.1.1          |  py311h06a4308_0         2.3 MB
2025-05-07T20:24:01.1314490Z     wheel-0.45.1               |  py311h06a4308_0         151 KB
2025-05-07T20:24:01.1314850Z     ------------------------------------------------------------
2025-05-07T20:24:01.1315190Z                                            Total:        35.4 MB
2025-05-07T20:24:01.1315395Z 
2025-05-07T20:24:01.1315531Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:01.1315753Z 
2025-05-07T20:24:01.1316148Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:01.1316602Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:01.1317015Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:01.1317514Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:01.1318063Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:01.1318524Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:01.1318948Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.1319465Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:01.1320092Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.1320714Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:01.1321149Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:01.1321568Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:01.1321968Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:01.1322367Z   python             pkgs/main/linux-64::python-3.11.11-he870216_0 
2025-05-07T20:24:01.1322800Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:01.1323298Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py311h06a4308_0 
2025-05-07T20:24:01.1323833Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:01.1324224Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:01.1324607Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:01.1325016Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py311h06a4308_0 
2025-05-07T20:24:01.1325415Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:01.1325794Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:01.1326063Z 
2025-05-07T20:24:01.1326070Z 
2025-05-07T20:24:01.1326075Z 
2025-05-07T20:24:01.1326282Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:01.1326654Z python-3.11.11       | 32.9 MB   |            |   0% 
2025-05-07T20:24:01.1326890Z 
2025-05-07T20:24:01.1327259Z setuptools-78.1.1    | 2.3 MB    |            |   0% [A
2025-05-07T20:24:01.1327592Z 
2025-05-07T20:24:01.1327602Z 
2025-05-07T20:24:01.1327804Z wheel-0.45.1         | 151 KB    |            |   0% [A[A
2025-05-07T20:24:01.1328036Z 
2025-05-07T20:24:01.1328040Z 
2025-05-07T20:24:01.1328044Z 
2025-05-07T20:24:01.1331184Z ca-certificates-2025 | 129 KB    |            |   0% [A[A[A
2025-05-07T20:24:01.1331562Z 
2025-05-07T20:24:01.1331566Z 
2025-05-07T20:24:01.1331570Z 
2025-05-07T20:24:01.1331574Z 
2025-05-07T20:24:01.1346836Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:01.1347107Z 
2025-05-07T20:24:01.1347141Z 
2025-05-07T20:24:01.1347154Z 
2025-05-07T20:24:01.1347167Z 
2025-05-07T20:24:01.1347306Z 
2025-05-07T20:24:01.1661213Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:01.1661609Z 
2025-05-07T20:24:01.1661616Z 
2025-05-07T20:24:01.1661621Z 
2025-05-07T20:24:01.1662956Z 
2025-05-07T20:24:01.1714642Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.1715014Z 
2025-05-07T20:24:01.1715295Z 
2025-05-07T20:24:01.1901202Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.1901543Z 
2025-05-07T20:24:01.1901548Z 
2025-05-07T20:24:01.1901554Z 
2025-05-07T20:24:01.1901559Z 
2025-05-07T20:24:01.1905923Z 
2025-05-07T20:24:01.2193818Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.2194206Z 
2025-05-07T20:24:01.2194212Z 
2025-05-07T20:24:01.2194228Z 
2025-05-07T20:24:01.2194234Z 
2025-05-07T20:24:01.2194239Z 
2025-05-07T20:24:01.2247397Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.2247910Z 
2025-05-07T20:24:01.2247915Z 
2025-05-07T20:24:01.2251601Z 
2025-05-07T20:24:01.2316767Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.2318244Z python-3.11.11       | 32.9 MB   | 3          |   3% 
2025-05-07T20:24:01.2318719Z 
2025-05-07T20:24:01.2764720Z setuptools-78.1.1    | 2.3 MB    | ######9    |  69% [A
2025-05-07T20:24:01.2765426Z 
2025-05-07T20:24:01.2765460Z 
2025-05-07T20:24:01.2765471Z 
2025-05-07T20:24:01.2765974Z 
2025-05-07T20:24:01.2771649Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.2772022Z 
2025-05-07T20:24:01.2772028Z 
2025-05-07T20:24:01.2772034Z 
2025-05-07T20:24:01.2780712Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.2781083Z 
2025-05-07T20:24:01.2781089Z 
2025-05-07T20:24:01.2781094Z 
2025-05-07T20:24:01.2781768Z 
2025-05-07T20:24:01.2783779Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.2784154Z 
2025-05-07T20:24:01.2784161Z 
2025-05-07T20:24:01.2784226Z 
2025-05-07T20:24:01.2850036Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.2850424Z 
2025-05-07T20:24:01.3122658Z setuptools-78.1.1    | 2.3 MB    | ########## | 100% [A
2025-05-07T20:24:01.3123013Z 
2025-05-07T20:24:01.3123018Z 
2025-05-07T20:24:01.3127157Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.3127585Z 
2025-05-07T20:24:01.3127590Z 
2025-05-07T20:24:01.3363463Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.4364311Z python-3.11.11       | 32.9 MB   | #7         |  18% 
2025-05-07T20:24:01.6150725Z python-3.11.11       | 32.9 MB   | #####3     |  53% 
2025-05-07T20:24:01.6151752Z python-3.11.11       | 32.9 MB   | ########## | 100% 
2025-05-07T20:24:01.6462009Z python-3.11.11       | 32.9 MB   | ########## | 100% 
2025-05-07T20:24:01.6462662Z 
2025-05-07T20:24:02.2722566Z setuptools-78.1.1    | 2.3 MB    | ########## | 100% [A
2025-05-07T20:24:02.2728708Z python-3.11.11       | 32.9 MB   | ########## | 100% 
2025-05-07T20:24:02.2729189Z                                                      
2025-05-07T20:24:02.2729468Z 
2025-05-07T20:24:02.2729720Z                                                      [A
2025-05-07T20:24:02.2730006Z 
2025-05-07T20:24:02.2730027Z 
2025-05-07T20:24:02.2730199Z                                                      [A[A
2025-05-07T20:24:02.2730416Z 
2025-05-07T20:24:02.2730419Z 
2025-05-07T20:24:02.2730423Z 
2025-05-07T20:24:02.2730598Z                                                      [A[A[A
2025-05-07T20:24:02.2730847Z 
2025-05-07T20:24:02.2730852Z 
2025-05-07T20:24:02.2730858Z 
2025-05-07T20:24:02.2730863Z 
2025-05-07T20:24:02.2731103Z                                                      [A[A[A[A
2025-05-07T20:24:02.2731317Z 
2025-05-07T20:24:02.2731321Z 
2025-05-07T20:24:02.2731325Z 
2025-05-07T20:24:02.2731328Z 
2025-05-07T20:24:02.2731332Z 
2025-05-07T20:24:02.2731535Z                                                      [A[A[A[A[A done
2025-05-07T20:24:02.4837329Z Preparing transaction: | / done
2025-05-07T20:24:03.8499766Z Verifying transaction: \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:06.1652336Z Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:06.2161114Z #
2025-05-07T20:24:06.2161767Z # To activate this environment, use
2025-05-07T20:24:06.2163098Z #
2025-05-07T20:24:06.2163528Z #     $ conda activate build_binary
2025-05-07T20:24:06.2164043Z #
2025-05-07T20:24:06.2164459Z # To deactivate an active environment, use
2025-05-07T20:24:06.2164996Z #
2025-05-07T20:24:06.2165368Z #     $ conda deactivate
2025-05-07T20:24:06.2165678Z 
2025-05-07T20:24:06.3207698Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:06.3229034Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:09.2256834Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (25.1)
2025-05-07T20:24:09.2257463Z Collecting pip
2025-05-07T20:24:09.2257783Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:09.2258193Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:09.2259109Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 69.0 MB/s eta 0:00:00
2025-05-07T20:24:09.2259596Z Installing collected packages: pip
2025-05-07T20:24:09.2260016Z   Attempting uninstall: pip
2025-05-07T20:24:09.2260419Z     Found existing installation: pip 25.1
2025-05-07T20:24:09.2260840Z     Uninstalling pip-25.1:
2025-05-07T20:24:09.2261227Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:09.2261675Z Successfully installed pip-25.1.1
2025-05-07T20:24:09.2261970Z 
2025-05-07T20:24:09.2905195Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:09.2927636Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:10.1536035Z Channels:
2025-05-07T20:24:10.1536275Z  - conda-forge
2025-05-07T20:24:10.1536554Z Platform: linux-64
2025-05-07T20:24:20.6405885Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:22.3183180Z Solving environment: / - \ | / done
2025-05-07T20:24:22.3791687Z 
2025-05-07T20:24:22.3792172Z ## Package Plan ##
2025-05-07T20:24:22.3792388Z 
2025-05-07T20:24:22.3792600Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:22.3792904Z 
2025-05-07T20:24:22.3793016Z   added / updated specs:
2025-05-07T20:24:22.3793288Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:22.3793513Z 
2025-05-07T20:24:22.3793518Z 
2025-05-07T20:24:22.3793667Z The following packages will be downloaded:
2025-05-07T20:24:22.3793893Z 
2025-05-07T20:24:22.3794014Z     package                    |            build
2025-05-07T20:24:22.3794353Z     ---------------------------|-----------------
2025-05-07T20:24:22.3794712Z     cffi-1.17.1                |  py311hf29c0ef_0         295 KB  conda-forge
2025-05-07T20:24:22.3795163Z     cryptography-44.0.3        |  py311hafd3f86_0         1.5 MB  conda-forge
2025-05-07T20:24:22.3795612Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:22.3796022Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:22.3796452Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:22.3796868Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:22.3797301Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:22.3797729Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:22.3798157Z     python_abi-3.11            |          2_cp311           5 KB  conda-forge
2025-05-07T20:24:22.3798620Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:22.3799104Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:22.3799526Z     ------------------------------------------------------------
2025-05-07T20:24:22.3799872Z                                            Total:         6.4 MB
2025-05-07T20:24:22.3800076Z 
2025-05-07T20:24:22.3800206Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:22.3800760Z 
2025-05-07T20:24:22.3800950Z   cffi               conda-forge/linux-64::cffi-1.17.1-py311hf29c0ef_0 
2025-05-07T20:24:22.3801439Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py311hafd3f86_0 
2025-05-07T20:24:22.3801933Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:22.3802374Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:22.3802832Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:22.3803451Z   python_abi         conda-forge/linux-64::python_abi-3.11-2_cp311 
2025-05-07T20:24:22.3804111Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:22.3804682Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:22.3805012Z 
2025-05-07T20:24:22.3805126Z The following packages will be UPDATED:
2025-05-07T20:24:22.3805329Z 
2025-05-07T20:24:22.3805923Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:22.3806689Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:22.3807322Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:22.3808027Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:22.3808387Z 
2025-05-07T20:24:22.3808391Z 
2025-05-07T20:24:22.3808402Z 
2025-05-07T20:24:22.3808542Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:22.3808916Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:22.3809141Z 
2025-05-07T20:24:22.3815048Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:22.3815311Z 
2025-05-07T20:24:22.3815316Z 
2025-05-07T20:24:22.3820696Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:22.3820959Z 
2025-05-07T20:24:22.3820963Z 
2025-05-07T20:24:22.3822901Z 
2025-05-07T20:24:22.3848636Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:22.3849037Z 
2025-05-07T20:24:22.3849043Z 
2025-05-07T20:24:22.3849048Z 
2025-05-07T20:24:22.3849053Z 
2025-05-07T20:24:22.3861826Z cffi-1.17.1          | 295 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:22.3862156Z 
2025-05-07T20:24:22.3862161Z 
2025-05-07T20:24:22.3862164Z 
2025-05-07T20:24:22.3862168Z 
2025-05-07T20:24:22.3866020Z 
2025-05-07T20:24:22.3874727Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:22.3875151Z 
2025-05-07T20:24:22.3875158Z 
2025-05-07T20:24:22.3875164Z 
2025-05-07T20:24:22.3875169Z 
2025-05-07T20:24:22.3875175Z 
2025-05-07T20:24:22.3875180Z 
2025-05-07T20:24:22.3878900Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:22.3879238Z 
2025-05-07T20:24:22.3879243Z 
2025-05-07T20:24:22.3879247Z 
2025-05-07T20:24:22.3879267Z 
2025-05-07T20:24:22.3879271Z 
2025-05-07T20:24:22.3879283Z 
2025-05-07T20:24:22.3879286Z 
2025-05-07T20:24:22.3882207Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:22.3882519Z 
2025-05-07T20:24:22.3882523Z 
2025-05-07T20:24:22.3882527Z 
2025-05-07T20:24:22.3882537Z 
2025-05-07T20:24:22.3882541Z 
2025-05-07T20:24:22.3882545Z 
2025-05-07T20:24:22.3882548Z 
2025-05-07T20:24:22.3882552Z 
2025-05-07T20:24:22.3883632Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3883939Z 
2025-05-07T20:24:22.3883943Z 
2025-05-07T20:24:22.3883947Z 
2025-05-07T20:24:22.3883950Z 
2025-05-07T20:24:22.3883954Z 
2025-05-07T20:24:22.3883958Z 
2025-05-07T20:24:22.3883962Z 
2025-05-07T20:24:22.3883965Z 
2025-05-07T20:24:22.3883972Z 
2025-05-07T20:24:22.3885535Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3885880Z 
2025-05-07T20:24:22.3885884Z 
2025-05-07T20:24:22.3886188Z 
2025-05-07T20:24:22.3886192Z 
2025-05-07T20:24:22.3886195Z 
2025-05-07T20:24:22.3886199Z 
2025-05-07T20:24:22.3886203Z 
2025-05-07T20:24:22.3886206Z 
2025-05-07T20:24:22.3886210Z 
2025-05-07T20:24:22.3886213Z 
2025-05-07T20:24:22.4493917Z python_abi-3.11      | 5 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.4494219Z 
2025-05-07T20:24:22.4494370Z 
2025-05-07T20:24:22.4699174Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.4700774Z 
2025-05-07T20:24:22.4828529Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.4828798Z 
2025-05-07T20:24:22.4828803Z 
2025-05-07T20:24:22.4829001Z 
2025-05-07T20:24:22.4880435Z libgomp-15.1.0       | 442 KB    | ##1        |  22% [A[A[A
2025-05-07T20:24:22.4880844Z 
2025-05-07T20:24:22.4880851Z 
2025-05-07T20:24:22.4880857Z 
2025-05-07T20:24:22.4880862Z 
2025-05-07T20:24:22.4881650Z 
2025-05-07T20:24:22.4890941Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A
2025-05-07T20:24:22.4959849Z openssl-3.5.0        | 3.0 MB    |            |   1% 
2025-05-07T20:24:22.4960185Z 
2025-05-07T20:24:22.4960190Z 
2025-05-07T20:24:22.4960194Z 
2025-05-07T20:24:22.4960197Z 
2025-05-07T20:24:22.4964634Z 
2025-05-07T20:24:22.5033186Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.5033562Z 
2025-05-07T20:24:22.5033566Z 
2025-05-07T20:24:22.5033570Z 
2025-05-07T20:24:22.5091434Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.5091729Z 
2025-05-07T20:24:22.5091735Z 
2025-05-07T20:24:22.5091752Z 
2025-05-07T20:24:22.5091758Z 
2025-05-07T20:24:22.5275090Z cffi-1.17.1          | 295 KB    | 5          |   5% [A[A[A[A
2025-05-07T20:24:22.5275434Z 
2025-05-07T20:24:22.5275441Z 
2025-05-07T20:24:22.5275446Z 
2025-05-07T20:24:22.5275451Z 
2025-05-07T20:24:22.5275456Z 
2025-05-07T20:24:22.5275461Z 
2025-05-07T20:24:22.5331993Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:22.5332316Z 
2025-05-07T20:24:22.5332320Z 
2025-05-07T20:24:22.5332324Z 
2025-05-07T20:24:22.5332327Z 
2025-05-07T20:24:22.5337260Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.5337523Z 
2025-05-07T20:24:22.5337527Z 
2025-05-07T20:24:22.5348184Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.5348479Z 
2025-05-07T20:24:22.5348483Z 
2025-05-07T20:24:22.5348487Z 
2025-05-07T20:24:22.5348490Z 
2025-05-07T20:24:22.5348494Z 
2025-05-07T20:24:22.5348551Z 
2025-05-07T20:24:22.5355361Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.5355663Z 
2025-05-07T20:24:22.5355667Z 
2025-05-07T20:24:22.5355671Z 
2025-05-07T20:24:22.5355675Z 
2025-05-07T20:24:22.5355679Z 
2025-05-07T20:24:22.5355682Z 
2025-05-07T20:24:22.5365672Z 
2025-05-07T20:24:22.5380884Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:22.5381255Z 
2025-05-07T20:24:22.5383747Z 
2025-05-07T20:24:22.5448335Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.5448682Z 
2025-05-07T20:24:22.5448687Z 
2025-05-07T20:24:22.5448690Z 
2025-05-07T20:24:22.5448694Z 
2025-05-07T20:24:22.5448698Z 
2025-05-07T20:24:22.5448701Z 
2025-05-07T20:24:22.5448705Z 
2025-05-07T20:24:22.5622358Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.5622750Z 
2025-05-07T20:24:22.5622758Z 
2025-05-07T20:24:22.5622764Z 
2025-05-07T20:24:22.5622768Z 
2025-05-07T20:24:22.5622771Z 
2025-05-07T20:24:22.5622775Z 
2025-05-07T20:24:22.5622790Z 
2025-05-07T20:24:22.5622794Z 
2025-05-07T20:24:22.5696947Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.5697262Z 
2025-05-07T20:24:22.5697266Z 
2025-05-07T20:24:22.5697270Z 
2025-05-07T20:24:22.5697274Z 
2025-05-07T20:24:22.5697277Z 
2025-05-07T20:24:22.5697281Z 
2025-05-07T20:24:22.5697285Z 
2025-05-07T20:24:22.5701755Z 
2025-05-07T20:24:22.5874236Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.5874830Z 
2025-05-07T20:24:22.5874837Z 
2025-05-07T20:24:22.5874842Z 
2025-05-07T20:24:22.5874859Z 
2025-05-07T20:24:22.5874863Z 
2025-05-07T20:24:22.5935327Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.5935705Z 
2025-05-07T20:24:22.5935711Z 
2025-05-07T20:24:22.5935725Z 
2025-05-07T20:24:22.5935731Z 
2025-05-07T20:24:22.5935736Z 
2025-05-07T20:24:22.5935741Z 
2025-05-07T20:24:22.5935746Z 
2025-05-07T20:24:22.5935751Z 
2025-05-07T20:24:22.5935756Z 
2025-05-07T20:24:22.5937353Z 
2025-05-07T20:24:22.5956217Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.5956558Z 
2025-05-07T20:24:22.5956562Z 
2025-05-07T20:24:22.5956566Z 
2025-05-07T20:24:22.5956570Z 
2025-05-07T20:24:22.5956574Z 
2025-05-07T20:24:22.5956577Z 
2025-05-07T20:24:22.5956581Z 
2025-05-07T20:24:22.5956585Z 
2025-05-07T20:24:22.5956588Z 
2025-05-07T20:24:22.5956890Z 
2025-05-07T20:24:22.6092013Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.6092304Z 
2025-05-07T20:24:22.6092308Z 
2025-05-07T20:24:22.6092312Z 
2025-05-07T20:24:22.6092315Z 
2025-05-07T20:24:22.6092319Z 
2025-05-07T20:24:22.6092323Z 
2025-05-07T20:24:22.6092326Z 
2025-05-07T20:24:22.6092330Z 
2025-05-07T20:24:22.6092509Z 
2025-05-07T20:24:22.6131946Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.6132247Z 
2025-05-07T20:24:22.6132252Z 
2025-05-07T20:24:22.6132256Z 
2025-05-07T20:24:22.6132270Z 
2025-05-07T20:24:22.6132274Z 
2025-05-07T20:24:22.6132278Z 
2025-05-07T20:24:22.6132288Z 
2025-05-07T20:24:22.6132292Z 
2025-05-07T20:24:22.6132296Z 
2025-05-07T20:24:22.6402248Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.6402667Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.6473569Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.6473825Z 
2025-05-07T20:24:22.6473831Z 
2025-05-07T20:24:22.6476379Z 
2025-05-07T20:24:22.6479348Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.6479703Z 
2025-05-07T20:24:22.6479708Z 
2025-05-07T20:24:22.6479712Z 
2025-05-07T20:24:22.6747580Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.6747924Z 
2025-05-07T20:24:22.6748075Z 
2025-05-07T20:24:22.6748079Z 
2025-05-07T20:24:22.6748128Z 
2025-05-07T20:24:22.6752426Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.6752717Z 
2025-05-07T20:24:22.6752721Z 
2025-05-07T20:24:22.6752725Z 
2025-05-07T20:24:22.6754036Z 
2025-05-07T20:24:22.6921380Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.6921633Z 
2025-05-07T20:24:22.6921637Z 
2025-05-07T20:24:22.6921641Z 
2025-05-07T20:24:22.6921644Z 
2025-05-07T20:24:22.6921648Z 
2025-05-07T20:24:22.6921651Z 
2025-05-07T20:24:22.6921807Z 
2025-05-07T20:24:22.6925872Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.6926189Z 
2025-05-07T20:24:22.6926193Z 
2025-05-07T20:24:22.6926197Z 
2025-05-07T20:24:22.6926200Z 
2025-05-07T20:24:22.6926204Z 
2025-05-07T20:24:22.6926208Z 
2025-05-07T20:24:22.6926211Z 
2025-05-07T20:24:22.7108612Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.7109211Z 
2025-05-07T20:24:22.7109219Z 
2025-05-07T20:24:22.7109226Z 
2025-05-07T20:24:22.7109234Z 
2025-05-07T20:24:22.7109241Z 
2025-05-07T20:24:22.7109264Z 
2025-05-07T20:24:22.7109272Z 
2025-05-07T20:24:22.7109279Z 
2025-05-07T20:24:22.7115267Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7115562Z 
2025-05-07T20:24:22.7115566Z 
2025-05-07T20:24:22.7115570Z 
2025-05-07T20:24:22.7115573Z 
2025-05-07T20:24:22.7115577Z 
2025-05-07T20:24:22.7115581Z 
2025-05-07T20:24:22.7115591Z 
2025-05-07T20:24:22.7115595Z 
2025-05-07T20:24:22.7248118Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7248598Z 
2025-05-07T20:24:22.7248602Z 
2025-05-07T20:24:22.7248612Z 
2025-05-07T20:24:22.7248616Z 
2025-05-07T20:24:22.7248620Z 
2025-05-07T20:24:22.7248623Z 
2025-05-07T20:24:22.7248627Z 
2025-05-07T20:24:22.7248630Z 
2025-05-07T20:24:22.7248634Z 
2025-05-07T20:24:22.7248638Z 
2025-05-07T20:24:22.7694942Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7695306Z 
2025-05-07T20:24:22.7695313Z 
2025-05-07T20:24:22.7695536Z 
2025-05-07T20:24:22.7695542Z 
2025-05-07T20:24:22.7695546Z 
2025-05-07T20:24:22.7695550Z 
2025-05-07T20:24:22.7695554Z 
2025-05-07T20:24:22.7695557Z 
2025-05-07T20:24:22.7695570Z 
2025-05-07T20:24:22.7705816Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7706161Z 
2025-05-07T20:24:22.7706166Z 
2025-05-07T20:24:22.7706172Z 
2025-05-07T20:24:22.7706177Z 
2025-05-07T20:24:22.7706191Z 
2025-05-07T20:24:22.7706207Z 
2025-05-07T20:24:22.7706212Z 
2025-05-07T20:24:22.7706217Z 
2025-05-07T20:24:22.7706222Z 
2025-05-07T20:24:22.7723770Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7724153Z 
2025-05-07T20:24:22.7724158Z 
2025-05-07T20:24:22.7724161Z 
2025-05-07T20:24:22.7724165Z 
2025-05-07T20:24:22.7724169Z 
2025-05-07T20:24:22.7724405Z 
2025-05-07T20:24:22.7727925Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.7728214Z 
2025-05-07T20:24:22.7728219Z 
2025-05-07T20:24:22.7728235Z 
2025-05-07T20:24:22.7728241Z 
2025-05-07T20:24:22.7728246Z 
2025-05-07T20:24:22.7728251Z 
2025-05-07T20:24:22.8157950Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.8158491Z 
2025-05-07T20:24:22.8159523Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.8160049Z 
2025-05-07T20:24:22.8576064Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.8582424Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.8582977Z                                                      
2025-05-07T20:24:22.8583319Z 
2025-05-07T20:24:22.8583669Z                                                      [A
2025-05-07T20:24:22.8584015Z 
2025-05-07T20:24:22.8584022Z 
2025-05-07T20:24:22.8584298Z                                                      [A[A
2025-05-07T20:24:22.8584645Z 
2025-05-07T20:24:22.8584661Z 
2025-05-07T20:24:22.8584669Z 
2025-05-07T20:24:22.8585004Z                                                      [A[A[A
2025-05-07T20:24:22.8585348Z 
2025-05-07T20:24:22.8585354Z 
2025-05-07T20:24:22.8585359Z 
2025-05-07T20:24:22.8585374Z 
2025-05-07T20:24:22.8585652Z                                                      [A[A[A[A
2025-05-07T20:24:22.8585976Z 
2025-05-07T20:24:22.8585981Z 
2025-05-07T20:24:22.8585986Z 
2025-05-07T20:24:22.8585992Z 
2025-05-07T20:24:22.8585998Z 
2025-05-07T20:24:22.8586269Z                                                      [A[A[A[A[A
2025-05-07T20:24:22.8586591Z 
2025-05-07T20:24:22.8586596Z 
2025-05-07T20:24:22.8586601Z 
2025-05-07T20:24:22.8586607Z 
2025-05-07T20:24:22.8586612Z 
2025-05-07T20:24:22.8586618Z 
2025-05-07T20:24:22.8586894Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:22.8587206Z 
2025-05-07T20:24:22.8587211Z 
2025-05-07T20:24:22.8587216Z 
2025-05-07T20:24:22.8587221Z 
2025-05-07T20:24:22.8587227Z 
2025-05-07T20:24:22.8587232Z 
2025-05-07T20:24:22.8587237Z 
2025-05-07T20:24:22.8587509Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:22.8587743Z 
2025-05-07T20:24:22.8587748Z 
2025-05-07T20:24:22.8587753Z 
2025-05-07T20:24:22.8587759Z 
2025-05-07T20:24:22.8587764Z 
2025-05-07T20:24:22.8587769Z 
2025-05-07T20:24:22.8587784Z 
2025-05-07T20:24:22.8587789Z 
2025-05-07T20:24:22.8588053Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.8588361Z 
2025-05-07T20:24:22.8588669Z 
2025-05-07T20:24:22.8588675Z 
2025-05-07T20:24:22.8588680Z 
2025-05-07T20:24:22.8588695Z 
2025-05-07T20:24:22.8588700Z 
2025-05-07T20:24:22.8588705Z 
2025-05-07T20:24:22.8588711Z 
2025-05-07T20:24:22.8588716Z 
2025-05-07T20:24:22.8589014Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.8589346Z 
2025-05-07T20:24:22.8589352Z 
2025-05-07T20:24:22.8589366Z 
2025-05-07T20:24:22.8589372Z 
2025-05-07T20:24:22.8589377Z 
2025-05-07T20:24:22.8589383Z 
2025-05-07T20:24:22.8589388Z 
2025-05-07T20:24:22.8589594Z 
2025-05-07T20:24:22.8589601Z 
2025-05-07T20:24:22.8589607Z 
2025-05-07T20:24:22.8589915Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:22.9593394Z Preparing transaction: \ done
2025-05-07T20:24:23.0596848Z Verifying transaction: / done
2025-05-07T20:24:24.5624681Z Executing transaction: \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:24.7378003Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:26.4590964Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:26.4602887Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:26.4626676Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:27.3256676Z Channels:
2025-05-07T20:24:27.3256920Z  - conda-forge
2025-05-07T20:24:27.3257156Z Platform: linux-64
2025-05-07T20:24:30.6406405Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:31.0169683Z Solving environment: \ | done
2025-05-07T20:24:31.0781915Z 
2025-05-07T20:24:31.0782187Z ## Package Plan ##
2025-05-07T20:24:31.0782343Z 
2025-05-07T20:24:31.0782555Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:31.0782863Z 
2025-05-07T20:24:31.0782960Z   added / updated specs:
2025-05-07T20:24:31.0783207Z     - libxcrypt
2025-05-07T20:24:31.0783333Z 
2025-05-07T20:24:31.0783338Z 
2025-05-07T20:24:31.0783477Z The following packages will be downloaded:
2025-05-07T20:24:31.0783691Z 
2025-05-07T20:24:31.0783811Z     package                    |            build
2025-05-07T20:24:31.0784123Z     ---------------------------|-----------------
2025-05-07T20:24:31.0784497Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:31.0784891Z     ------------------------------------------------------------
2025-05-07T20:24:31.0785226Z                                            Total:          98 KB
2025-05-07T20:24:31.0785439Z 
2025-05-07T20:24:31.0785568Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:31.0785793Z 
2025-05-07T20:24:31.0786003Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:31.0786281Z 
2025-05-07T20:24:31.0786285Z 
2025-05-07T20:24:31.0786289Z 
2025-05-07T20:24:31.0786430Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:31.2071419Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:31.2090928Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:31.2196838Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:31.2199052Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:31.2199388Z                                                      
2025-05-07T20:24:31.2199670Z  done
2025-05-07T20:24:31.3204995Z Preparing transaction: - done
2025-05-07T20:24:31.4210069Z Verifying transaction: | done
2025-05-07T20:24:31.5215262Z Executing transaction: - done
2025-05-07T20:24:34.9374997Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:34.9375737Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.11/crypt.h
2025-05-07T20:24:34.9376282Z 
2025-05-07T20:24:34.9403002Z 
2025-05-07T20:24:36.5797646Z [SETUP] Installed Python version: Python 3.11.11
2025-05-07T20:24:36.5798129Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:36.5832080Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:36.5832554Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:36.5845737Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:36.5846087Z env:
2025-05-07T20:24:36.5846317Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:36.5846627Z   BUILD_ENV: build_binary
2025-05-07T20:24:36.5846874Z   BUILD_TARGET: genai
2025-05-07T20:24:36.5847109Z   BUILD_VARIANT: cuda
2025-05-07T20:24:36.5847349Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:36.5847707Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:36.5848001Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:36.5848331Z ##[endgroup]
2025-05-07T20:24:36.9204585Z ################################################################################
2025-05-07T20:24:36.9204953Z # Install C/C++ Compilers
2025-05-07T20:24:36.9205217Z #
2025-05-07T20:24:36.9220065Z # [2025-05-07T20:24:36.921Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:36.9220474Z ################################################################################
2025-05-07T20:24:36.9220700Z 
2025-05-07T20:24:36.9235038Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:37.0120379Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:37.0131133Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:37.0152745Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:37.8796838Z Channels:
2025-05-07T20:24:37.8797096Z  - conda-forge
2025-05-07T20:24:37.8797327Z Platform: linux-64
2025-05-07T20:24:41.1992619Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:41.5656558Z Solving environment: \ done
2025-05-07T20:24:41.6277644Z 
2025-05-07T20:24:41.6278223Z ## Package Plan ##
2025-05-07T20:24:41.6278664Z 
2025-05-07T20:24:41.6279126Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:41.6279447Z 
2025-05-07T20:24:41.6279552Z   added / updated specs:
2025-05-07T20:24:41.6279815Z     - sysroot_linux-64=2.17
2025-05-07T20:24:41.6279979Z 
2025-05-07T20:24:41.6279983Z 
2025-05-07T20:24:41.6280105Z The following packages will be downloaded:
2025-05-07T20:24:41.6280323Z 
2025-05-07T20:24:41.6280438Z     package                    |            build
2025-05-07T20:24:41.6280760Z     ---------------------------|-----------------
2025-05-07T20:24:41.6281223Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:41.6281761Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:41.6282180Z     ------------------------------------------------------------
2025-05-07T20:24:41.6282526Z                                            Total:        15.4 MB
2025-05-07T20:24:41.6282736Z 
2025-05-07T20:24:41.6282864Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:41.6283099Z 
2025-05-07T20:24:41.6283387Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:41.6283953Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:41.6284259Z 
2025-05-07T20:24:41.6284263Z 
2025-05-07T20:24:41.6284267Z 
2025-05-07T20:24:41.6284418Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:41.6284796Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:41.6285790Z 
2025-05-07T20:24:41.8616781Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:41.8901005Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:41.8901384Z 
2025-05-07T20:24:41.9089540Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:24:41.9089816Z 
2025-05-07T20:24:41.9919321Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.9922175Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:42.2395195Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:42.2395623Z 
2025-05-07T20:24:42.2397314Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:42.2397606Z 
2025-05-07T20:24:42.6578149Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:42.6582133Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:42.6582612Z                                                      
2025-05-07T20:24:42.6582819Z 
2025-05-07T20:24:42.6583248Z                                                      [A done
2025-05-07T20:24:42.7589728Z Preparing transaction: / done
2025-05-07T20:24:42.9597159Z Verifying transaction: \ | done
2025-05-07T20:24:43.1650395Z Executing transaction: - \ done
2025-05-07T20:24:43.3188228Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:43.3188529Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:44.9917509Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:44.9930455Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:44.9953083Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:45.8858859Z Channels:
2025-05-07T20:24:45.8859104Z  - conda-forge
2025-05-07T20:24:45.8859334Z Platform: linux-64
2025-05-07T20:24:49.3022460Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:50.2592621Z Solving environment: \ | / done
2025-05-07T20:24:50.3240457Z 
2025-05-07T20:24:50.3240637Z ## Package Plan ##
2025-05-07T20:24:50.3240791Z 
2025-05-07T20:24:50.3241079Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:50.3241487Z 
2025-05-07T20:24:50.3241590Z   added / updated specs:
2025-05-07T20:24:50.3241855Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:50.3242019Z 
2025-05-07T20:24:50.3242032Z 
2025-05-07T20:24:50.3242157Z The following packages will be downloaded:
2025-05-07T20:24:50.3242403Z 
2025-05-07T20:24:50.3242541Z     package                    |            build
2025-05-07T20:24:50.3242862Z     ---------------------------|-----------------
2025-05-07T20:24:50.3243263Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:50.3243746Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:50.3244202Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:50.3244646Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:50.3245089Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:50.3245528Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:50.3245955Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:50.3246427Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:50.3246910Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:50.3247352Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:50.3247926Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:50.3248414Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:50.3248825Z     ------------------------------------------------------------
2025-05-07T20:24:50.3249164Z                                            Total:        91.6 MB
2025-05-07T20:24:50.3249387Z 
2025-05-07T20:24:50.3249515Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:50.3249742Z 
2025-05-07T20:24:50.3250017Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:50.3250574Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:50.3251475Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:50.3252149Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:50.3252652Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:50.3253154Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:50.3253675Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:50.3254232Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:50.3254725Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:50.3255263Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:50.3255630Z 
2025-05-07T20:24:50.3255745Z The following packages will be UPDATED:
2025-05-07T20:24:50.3255955Z 
2025-05-07T20:24:50.3256272Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:50.3256994Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:50.3257400Z 
2025-05-07T20:24:50.3257404Z 
2025-05-07T20:24:50.3257408Z 
2025-05-07T20:24:50.3257557Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:50.3257935Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:50.3258173Z 
2025-05-07T20:24:50.3258602Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:50.3258852Z 
2025-05-07T20:24:50.3258856Z 
2025-05-07T20:24:50.3264842Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:50.3265306Z 
2025-05-07T20:24:50.3265312Z 
2025-05-07T20:24:50.3271318Z 
2025-05-07T20:24:50.3286317Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:50.3286606Z 
2025-05-07T20:24:50.3286617Z 
2025-05-07T20:24:50.3286621Z 
2025-05-07T20:24:50.3286947Z 
2025-05-07T20:24:50.3294478Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:50.3294854Z 
2025-05-07T20:24:50.3294870Z 
2025-05-07T20:24:50.3294874Z 
2025-05-07T20:24:50.3294877Z 
2025-05-07T20:24:50.3294881Z 
2025-05-07T20:24:50.3315850Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:50.3316148Z 
2025-05-07T20:24:50.3316158Z 
2025-05-07T20:24:50.3316162Z 
2025-05-07T20:24:50.3316165Z 
2025-05-07T20:24:50.3316169Z 
2025-05-07T20:24:50.3316173Z 
2025-05-07T20:24:50.3317251Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:50.3317535Z 
2025-05-07T20:24:50.3317546Z 
2025-05-07T20:24:50.3317550Z 
2025-05-07T20:24:50.3317554Z 
2025-05-07T20:24:50.3317558Z 
2025-05-07T20:24:50.3317561Z 
2025-05-07T20:24:50.3317567Z 
2025-05-07T20:24:50.3319248Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:50.3319534Z 
2025-05-07T20:24:50.3319538Z 
2025-05-07T20:24:50.3319554Z 
2025-05-07T20:24:50.3319562Z 
2025-05-07T20:24:50.3319566Z 
2025-05-07T20:24:50.3319570Z 
2025-05-07T20:24:50.3319574Z 
2025-05-07T20:24:50.3319577Z 
2025-05-07T20:24:50.3320604Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.3320885Z 
2025-05-07T20:24:50.3320889Z 
2025-05-07T20:24:50.3320893Z 
2025-05-07T20:24:50.3320896Z 
2025-05-07T20:24:50.3320900Z 
2025-05-07T20:24:50.3320903Z 
2025-05-07T20:24:50.3320909Z 
2025-05-07T20:24:50.3320913Z 
2025-05-07T20:24:50.3320916Z 
2025-05-07T20:24:50.3331779Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.3332078Z 
2025-05-07T20:24:50.3332082Z 
2025-05-07T20:24:50.3332086Z 
2025-05-07T20:24:50.3332090Z 
2025-05-07T20:24:50.3332093Z 
2025-05-07T20:24:50.3332097Z 
2025-05-07T20:24:50.3332101Z 
2025-05-07T20:24:50.3332104Z 
2025-05-07T20:24:50.3332108Z 
2025-05-07T20:24:50.3364232Z 
2025-05-07T20:24:50.3376455Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.3376936Z 
2025-05-07T20:24:50.3376941Z 
2025-05-07T20:24:50.3376945Z 
2025-05-07T20:24:50.3376948Z 
2025-05-07T20:24:50.3376952Z 
2025-05-07T20:24:50.3376956Z 
2025-05-07T20:24:50.3376959Z 
2025-05-07T20:24:50.3376971Z 
2025-05-07T20:24:50.3376975Z 
2025-05-07T20:24:50.3376979Z 
2025-05-07T20:24:50.3376982Z 
2025-05-07T20:24:50.4338086Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.4338523Z 
2025-05-07T20:24:50.4338529Z 
2025-05-07T20:24:50.4338534Z 
2025-05-07T20:24:50.5454464Z binutils_impl_linux- | 6.0 MB    | 4          |   4% [A[A[A
2025-05-07T20:24:50.5454845Z 
2025-05-07T20:24:50.5454850Z 
2025-05-07T20:24:50.5454855Z 
2025-05-07T20:24:50.5454860Z 
2025-05-07T20:24:50.5736819Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:50.5737175Z 
2025-05-07T20:24:50.5737179Z 
2025-05-07T20:24:50.5892442Z 
2025-05-07T20:24:50.6365919Z binutils_impl_linux- | 6.0 MB    | 8          |   8% [A[A[A
2025-05-07T20:24:50.6366272Z 
2025-05-07T20:24:50.6455813Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:50.6456140Z 
2025-05-07T20:24:50.6456144Z 
2025-05-07T20:24:50.6456148Z 
2025-05-07T20:24:50.6456711Z 
2025-05-07T20:24:50.6607893Z libstdcxx-15.1.0     | 3.7 MB    | #####5     |  55% [A[A[A[A
2025-05-07T20:24:50.6736456Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:50.6736703Z 
2025-05-07T20:24:50.6736707Z 
2025-05-07T20:24:50.6736711Z 
2025-05-07T20:24:50.6843082Z binutils_impl_linux- | 6.0 MB    | #######8   |  78% [A[A[A
2025-05-07T20:24:50.6843367Z 
2025-05-07T20:24:50.6843371Z 
2025-05-07T20:24:50.7370540Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:50.7370811Z 
2025-05-07T20:24:50.7447976Z gxx_impl_linux-64-11 | 11.2 MB   | #####3     |  53% [A
2025-05-07T20:24:50.7448218Z 
2025-05-07T20:24:50.7448228Z 
2025-05-07T20:24:50.7448232Z 
2025-05-07T20:24:50.7448236Z 
2025-05-07T20:24:50.7611951Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:50.7846029Z gcc_impl_linux-64-11 | 53.0 MB   | 6          |   7% 
2025-05-07T20:24:50.7846265Z 
2025-05-07T20:24:50.7846782Z 
2025-05-07T20:24:50.8110435Z libstdcxx-devel_linu | 11.1 MB   | ##5        |  25% [A[A
2025-05-07T20:24:50.8110725Z 
2025-05-07T20:24:50.8110730Z 
2025-05-07T20:24:50.8110734Z 
2025-05-07T20:24:50.8110738Z 
2025-05-07T20:24:50.8110741Z 
2025-05-07T20:24:50.8377031Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:50.8379613Z 
2025-05-07T20:24:50.8410071Z gxx_impl_linux-64-11 | 11.2 MB   | ########4  |  84% [A
2025-05-07T20:24:50.8410320Z 
2025-05-07T20:24:50.8410324Z 
2025-05-07T20:24:50.8417966Z 
2025-05-07T20:24:50.8612433Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:50.8850199Z gcc_impl_linux-64-11 | 53.0 MB   | #3         |  13% 
2025-05-07T20:24:50.8850438Z 
2025-05-07T20:24:50.8851930Z 
2025-05-07T20:24:50.8880906Z libstdcxx-devel_linu | 11.1 MB   | #####1     |  51% [A[A
2025-05-07T20:24:50.8881299Z 
2025-05-07T20:24:50.8881306Z 
2025-05-07T20:24:50.8881311Z 
2025-05-07T20:24:50.8881316Z 
2025-05-07T20:24:50.8881322Z 
2025-05-07T20:24:50.8883263Z 
2025-05-07T20:24:50.9115782Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:50.9116080Z 
2025-05-07T20:24:50.9116406Z 
2025-05-07T20:24:50.9116413Z 
2025-05-07T20:24:50.9116419Z 
2025-05-07T20:24:50.9116424Z 
2025-05-07T20:24:50.9615724Z libsanitizer-11.4.0  | 3.5 MB    | #######8   |  79% [A[A[A[A[A
2025-05-07T20:24:51.0135822Z gcc_impl_linux-64-11 | 53.0 MB   | ##         |  20% 
2025-05-07T20:24:51.0136207Z 
2025-05-07T20:24:51.0136214Z 
2025-05-07T20:24:51.0136220Z 
2025-05-07T20:24:51.0136225Z 
2025-05-07T20:24:51.0136230Z 
2025-05-07T20:24:51.0137990Z 
2025-05-07T20:24:51.0138656Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:51.0139066Z 
2025-05-07T20:24:51.0139414Z 
2025-05-07T20:24:51.0139611Z 
2025-05-07T20:24:51.0139616Z 
2025-05-07T20:24:51.0139620Z 
2025-05-07T20:24:51.0139623Z 
2025-05-07T20:24:51.0239961Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:51.0240610Z 
2025-05-07T20:24:51.0240616Z 
2025-05-07T20:24:51.0240621Z 
2025-05-07T20:24:51.0240626Z 
2025-05-07T20:24:51.0242062Z 
2025-05-07T20:24:51.0545276Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:51.0545707Z 
2025-05-07T20:24:51.0545713Z 
2025-05-07T20:24:51.0545718Z 
2025-05-07T20:24:51.0545723Z 
2025-05-07T20:24:51.0545728Z 
2025-05-07T20:24:51.0545733Z 
2025-05-07T20:24:51.0545739Z 
2025-05-07T20:24:51.0622213Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:51.0737536Z gcc_impl_linux-64-11 | 53.0 MB   | ##7        |  27% 
2025-05-07T20:24:51.0737890Z 
2025-05-07T20:24:51.0737896Z 
2025-05-07T20:24:51.0737902Z 
2025-05-07T20:24:51.0737906Z 
2025-05-07T20:24:51.0737912Z 
2025-05-07T20:24:51.0737960Z 
2025-05-07T20:24:51.0737966Z 
2025-05-07T20:24:51.0738164Z 
2025-05-07T20:24:51.0793366Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.0793773Z 
2025-05-07T20:24:51.0793778Z 
2025-05-07T20:24:51.0793784Z 
2025-05-07T20:24:51.0793789Z 
2025-05-07T20:24:51.0793794Z 
2025-05-07T20:24:51.0793799Z 
2025-05-07T20:24:51.0793804Z 
2025-05-07T20:24:51.0793809Z 
2025-05-07T20:24:51.1045177Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.1045603Z 
2025-05-07T20:24:51.1045609Z 
2025-05-07T20:24:51.1045614Z 
2025-05-07T20:24:51.1045619Z 
2025-05-07T20:24:51.1045624Z 
2025-05-07T20:24:51.1045629Z 
2025-05-07T20:24:51.1045634Z 
2025-05-07T20:24:51.1361205Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:51.1361612Z 
2025-05-07T20:24:51.1362207Z 
2025-05-07T20:24:51.1376075Z libstdcxx-devel_linu | 11.1 MB   | #######1   |  71% [A[A
2025-05-07T20:24:51.1376486Z 
2025-05-07T20:24:51.1376509Z 
2025-05-07T20:24:51.1376515Z 
2025-05-07T20:24:51.1376520Z 
2025-05-07T20:24:51.1376525Z 
2025-05-07T20:24:51.1376530Z 
2025-05-07T20:24:51.1376535Z 
2025-05-07T20:24:51.1376540Z 
2025-05-07T20:24:51.1376546Z 
2025-05-07T20:24:51.1424782Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.1425179Z 
2025-05-07T20:24:51.1425184Z 
2025-05-07T20:24:51.1425189Z 
2025-05-07T20:24:51.1425194Z 
2025-05-07T20:24:51.1425199Z 
2025-05-07T20:24:51.1425205Z 
2025-05-07T20:24:51.1425210Z 
2025-05-07T20:24:51.1425215Z 
2025-05-07T20:24:51.1425220Z 
2025-05-07T20:24:51.1469645Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.1470046Z 
2025-05-07T20:24:51.1470051Z 
2025-05-07T20:24:51.1470056Z 
2025-05-07T20:24:51.1470061Z 
2025-05-07T20:24:51.1470067Z 
2025-05-07T20:24:51.1470072Z 
2025-05-07T20:24:51.1470089Z 
2025-05-07T20:24:51.1470094Z 
2025-05-07T20:24:51.1470100Z 
2025-05-07T20:24:51.1471490Z 
2025-05-07T20:24:51.1510085Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.1510410Z 
2025-05-07T20:24:51.1510414Z 
2025-05-07T20:24:51.1510418Z 
2025-05-07T20:24:51.1510422Z 
2025-05-07T20:24:51.1510425Z 
2025-05-07T20:24:51.1510429Z 
2025-05-07T20:24:51.1510433Z 
2025-05-07T20:24:51.1510436Z 
2025-05-07T20:24:51.1510440Z 
2025-05-07T20:24:51.1510444Z 
2025-05-07T20:24:51.1686302Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.2061627Z gcc_impl_linux-64-11 | 53.0 MB   | ###3       |  33% 
2025-05-07T20:24:51.2061953Z 
2025-05-07T20:24:51.2061958Z 
2025-05-07T20:24:51.2061962Z 
2025-05-07T20:24:51.2061965Z 
2025-05-07T20:24:51.2061969Z 
2025-05-07T20:24:51.2061973Z 
2025-05-07T20:24:51.2061977Z 
2025-05-07T20:24:51.2061980Z 
2025-05-07T20:24:51.2061984Z 
2025-05-07T20:24:51.2061988Z 
2025-05-07T20:24:51.2063075Z 
2025-05-07T20:24:51.2093954Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.2094455Z 
2025-05-07T20:24:51.2094460Z 
2025-05-07T20:24:51.2094464Z 
2025-05-07T20:24:51.2094467Z 
2025-05-07T20:24:51.2094471Z 
2025-05-07T20:24:51.2094475Z 
2025-05-07T20:24:51.2094478Z 
2025-05-07T20:24:51.2094482Z 
2025-05-07T20:24:51.2094486Z 
2025-05-07T20:24:51.2094489Z 
2025-05-07T20:24:51.2094493Z 
2025-05-07T20:24:51.2405448Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.2410490Z 
2025-05-07T20:24:51.2688807Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:51.2707685Z gcc_impl_linux-64-11 | 53.0 MB   | ####       |  41% 
2025-05-07T20:24:51.2707941Z 
2025-05-07T20:24:51.2707945Z 
2025-05-07T20:24:51.2707949Z 
2025-05-07T20:24:51.2707953Z 
2025-05-07T20:24:51.2712833Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:51.2713220Z 
2025-05-07T20:24:51.2713225Z 
2025-05-07T20:24:51.2713228Z 
2025-05-07T20:24:51.2713267Z 
2025-05-07T20:24:51.3417705Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:51.3418025Z 
2025-05-07T20:24:51.3418030Z 
2025-05-07T20:24:51.3418033Z 
2025-05-07T20:24:51.3418037Z 
2025-05-07T20:24:51.3418041Z 
2025-05-07T20:24:51.3418045Z 
2025-05-07T20:24:51.3691942Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:51.4374857Z gcc_impl_linux-64-11 | 53.0 MB   | ####9      |  49% 
2025-05-07T20:24:51.4375250Z 
2025-05-07T20:24:51.4375256Z 
2025-05-07T20:24:51.4375262Z 
2025-05-07T20:24:51.4375267Z 
2025-05-07T20:24:51.4375272Z 
2025-05-07T20:24:51.4375277Z 
2025-05-07T20:24:51.4375283Z 
2025-05-07T20:24:51.4377835Z 
2025-05-07T20:24:51.4386615Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.4387022Z 
2025-05-07T20:24:51.4387028Z 
2025-05-07T20:24:51.4387033Z 
2025-05-07T20:24:51.4387038Z 
2025-05-07T20:24:51.4387043Z 
2025-05-07T20:24:51.4387048Z 
2025-05-07T20:24:51.4387054Z 
2025-05-07T20:24:51.4388959Z 
2025-05-07T20:24:51.4547069Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.4547478Z 
2025-05-07T20:24:51.4547698Z 
2025-05-07T20:24:51.4548398Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:51.4548772Z 
2025-05-07T20:24:51.4548784Z 
2025-05-07T20:24:51.4692496Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:51.5143785Z gcc_impl_linux-64-11 | 53.0 MB   | #####7     |  57% 
2025-05-07T20:24:51.5144153Z 
2025-05-07T20:24:51.5144159Z 
2025-05-07T20:24:51.5144164Z 
2025-05-07T20:24:51.5144171Z 
2025-05-07T20:24:51.5144188Z 
2025-05-07T20:24:51.5144193Z 
2025-05-07T20:24:51.5144198Z 
2025-05-07T20:24:51.5147722Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:51.5148117Z 
2025-05-07T20:24:51.5148124Z 
2025-05-07T20:24:51.5148142Z 
2025-05-07T20:24:51.5148148Z 
2025-05-07T20:24:51.5148153Z 
2025-05-07T20:24:51.5148160Z 
2025-05-07T20:24:51.5148197Z 
2025-05-07T20:24:51.5695313Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:51.5810072Z gcc_impl_linux-64-11 | 53.0 MB   | ######8    |  69% 
2025-05-07T20:24:51.5810589Z 
2025-05-07T20:24:51.5810595Z 
2025-05-07T20:24:51.5810600Z 
2025-05-07T20:24:51.5810605Z 
2025-05-07T20:24:51.5810610Z 
2025-05-07T20:24:51.5810615Z 
2025-05-07T20:24:51.5810621Z 
2025-05-07T20:24:51.5810626Z 
2025-05-07T20:24:51.5810631Z 
2025-05-07T20:24:51.5814826Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.5815277Z 
2025-05-07T20:24:51.5815282Z 
2025-05-07T20:24:51.5815287Z 
2025-05-07T20:24:51.5815292Z 
2025-05-07T20:24:51.5815297Z 
2025-05-07T20:24:51.5815303Z 
2025-05-07T20:24:51.5815311Z 
2025-05-07T20:24:51.5815318Z 
2025-05-07T20:24:51.5815324Z 
2025-05-07T20:24:51.6389994Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6390627Z 
2025-05-07T20:24:51.6390631Z 
2025-05-07T20:24:51.6391020Z 
2025-05-07T20:24:51.6391025Z 
2025-05-07T20:24:51.6391088Z 
2025-05-07T20:24:51.6391092Z 
2025-05-07T20:24:51.6391096Z 
2025-05-07T20:24:51.6391100Z 
2025-05-07T20:24:51.6391103Z 
2025-05-07T20:24:51.6391767Z 
2025-05-07T20:24:51.6396452Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6396938Z 
2025-05-07T20:24:51.6396943Z 
2025-05-07T20:24:51.6396949Z 
2025-05-07T20:24:51.6396954Z 
2025-05-07T20:24:51.6396959Z 
2025-05-07T20:24:51.6396964Z 
2025-05-07T20:24:51.6396969Z 
2025-05-07T20:24:51.6396975Z 
2025-05-07T20:24:51.6396980Z 
2025-05-07T20:24:51.6396985Z 
2025-05-07T20:24:51.6699662Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6908440Z gcc_impl_linux-64-11 | 53.0 MB   | #######8   |  78% 
2025-05-07T20:24:51.6908906Z 
2025-05-07T20:24:51.6908912Z 
2025-05-07T20:24:51.6908917Z 
2025-05-07T20:24:51.6908922Z 
2025-05-07T20:24:51.6908927Z 
2025-05-07T20:24:51.6908961Z 
2025-05-07T20:24:51.6908981Z 
2025-05-07T20:24:51.6908987Z 
2025-05-07T20:24:51.6908992Z 
2025-05-07T20:24:51.6908997Z 
2025-05-07T20:24:51.6909002Z 
2025-05-07T20:24:51.6912951Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6913444Z 
2025-05-07T20:24:51.6913449Z 
2025-05-07T20:24:51.6913454Z 
2025-05-07T20:24:51.6913460Z 
2025-05-07T20:24:51.6913465Z 
2025-05-07T20:24:51.6913470Z 
2025-05-07T20:24:51.6913475Z 
2025-05-07T20:24:51.6913480Z 
2025-05-07T20:24:51.6913485Z 
2025-05-07T20:24:51.6913490Z 
2025-05-07T20:24:51.6913495Z 
2025-05-07T20:24:51.7659478Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.7659979Z 
2025-05-07T20:24:51.7659983Z 
2025-05-07T20:24:51.7659987Z 
2025-05-07T20:24:51.7659991Z 
2025-05-07T20:24:51.7660728Z 
2025-05-07T20:24:51.7701394Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:51.8705430Z gcc_impl_linux-64-11 | 53.0 MB   | ########7  |  88% 
2025-05-07T20:24:51.9321374Z gcc_impl_linux-64-11 | 53.0 MB   | #########9 |  99% 
2025-05-07T20:24:51.9321771Z 
2025-05-07T20:24:51.9321778Z 
2025-05-07T20:24:51.9321783Z 
2025-05-07T20:24:52.1053325Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:52.1128447Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:52.1128768Z 
2025-05-07T20:24:52.3807981Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:52.3808438Z 
2025-05-07T20:24:52.3808444Z 
2025-05-07T20:24:52.7902142Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:52.7909063Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:52.7909639Z                                                      
2025-05-07T20:24:52.7909885Z 
2025-05-07T20:24:52.7910141Z                                                      [A
2025-05-07T20:24:52.7910402Z 
2025-05-07T20:24:52.7910406Z 
2025-05-07T20:24:52.7910626Z                                                      [A[A
2025-05-07T20:24:52.7910887Z 
2025-05-07T20:24:52.7910891Z 
2025-05-07T20:24:52.7910895Z 
2025-05-07T20:24:52.7911150Z                                                      [A[A[A
2025-05-07T20:24:52.7911419Z 
2025-05-07T20:24:52.7911459Z 
2025-05-07T20:24:52.7911463Z 
2025-05-07T20:24:52.7911467Z 
2025-05-07T20:24:52.7911694Z                                                      [A[A[A[A
2025-05-07T20:24:52.7911969Z 
2025-05-07T20:24:52.7911973Z 
2025-05-07T20:24:52.7911977Z 
2025-05-07T20:24:52.7911980Z 
2025-05-07T20:24:52.7911984Z 
2025-05-07T20:24:52.7912215Z                                                      [A[A[A[A[A
2025-05-07T20:24:52.7912463Z 
2025-05-07T20:24:52.7912554Z 
2025-05-07T20:24:52.7912558Z 
2025-05-07T20:24:52.7912562Z 
2025-05-07T20:24:52.7912565Z 
2025-05-07T20:24:52.7912569Z 
2025-05-07T20:24:52.7912818Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:52.7913121Z 
2025-05-07T20:24:52.7913322Z 
2025-05-07T20:24:52.7913495Z 
2025-05-07T20:24:52.7913500Z 
2025-05-07T20:24:52.7913504Z 
2025-05-07T20:24:52.7913508Z 
2025-05-07T20:24:52.7913511Z 
2025-05-07T20:24:52.7913731Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:52.7913974Z 
2025-05-07T20:24:52.7913977Z 
2025-05-07T20:24:52.7914026Z 
2025-05-07T20:24:52.7914029Z 
2025-05-07T20:24:52.7914033Z 
2025-05-07T20:24:52.7914037Z 
2025-05-07T20:24:52.7914040Z 
2025-05-07T20:24:52.7914044Z 
2025-05-07T20:24:52.7914311Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.7914609Z 
2025-05-07T20:24:52.7914613Z 
2025-05-07T20:24:52.7914616Z 
2025-05-07T20:24:52.7914620Z 
2025-05-07T20:24:52.7914623Z 
2025-05-07T20:24:52.7914627Z 
2025-05-07T20:24:52.7914630Z 
2025-05-07T20:24:52.7914634Z 
2025-05-07T20:24:52.7914638Z 
2025-05-07T20:24:52.7914852Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.7915127Z 
2025-05-07T20:24:52.7915136Z 
2025-05-07T20:24:52.7915145Z 
2025-05-07T20:24:52.7915149Z 
2025-05-07T20:24:52.7915152Z 
2025-05-07T20:24:52.7915156Z 
2025-05-07T20:24:52.7915160Z 
2025-05-07T20:24:52.7915163Z 
2025-05-07T20:24:52.7915167Z 
2025-05-07T20:24:52.7915170Z 
2025-05-07T20:24:52.7915412Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.7915725Z 
2025-05-07T20:24:52.7915729Z 
2025-05-07T20:24:52.7915732Z 
2025-05-07T20:24:52.7915736Z 
2025-05-07T20:24:52.7915740Z 
2025-05-07T20:24:52.7915743Z 
2025-05-07T20:24:52.7915747Z 
2025-05-07T20:24:52.7915750Z 
2025-05-07T20:24:52.7915754Z 
2025-05-07T20:24:52.7915757Z 
2025-05-07T20:24:52.7915761Z 
2025-05-07T20:24:52.7916040Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:52.8916655Z Preparing transaction: \ done
2025-05-07T20:24:53.1925411Z Verifying transaction: / - \ done
2025-05-07T20:24:53.2935335Z Executing transaction: / done
2025-05-07T20:24:53.4579202Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:57.3527595Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:57.3528391Z 
2025-05-07T20:24:57.3539285Z 
2025-05-07T20:24:57.3558013Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:57.3558614Z 
2025-05-07T20:24:57.3571103Z 
2025-05-07T20:24:57.3588505Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:57.3589056Z 
2025-05-07T20:24:57.3600308Z 
2025-05-07T20:24:57.3618941Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:57.3619528Z 
2025-05-07T20:24:57.3631977Z 
2025-05-07T20:24:59.2503873Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:59.2504405Z 
2025-05-07T20:24:59.3123096Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:01.1852487Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:01.1852833Z 
2025-05-07T20:25:01.2468187Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:03.1209111Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:03.1209417Z 
2025-05-07T20:25:03.1841424Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:05.0631602Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:05.0631912Z 
2025-05-07T20:25:05.1258782Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:05.1263086Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:05.1263747Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:05.1263989Z 
2025-05-07T20:25:07.0066323Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:07.0066897Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:07.0067499Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:07.0068248Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:07.0068831Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:07.0069421Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:07.0069781Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:07.0070178Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:07.0070540Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:07.0071023Z #define __CHAR_BIT__ 8
2025-05-07T20:25:07.0071430Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:07.0071964Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:07.0072484Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:07.0072953Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:07.0073506Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:07.0074073Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.0074588Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:07.0075147Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:07.0075582Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:07.0076030Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:07.0076537Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:07.0077080Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:07.0077485Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:07.0077851Z #define __GCC_IEC_559 2
2025-05-07T20:25:07.0078228Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:07.0078597Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:07.0078998Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:07.0079358Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:07.0079780Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.0080236Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:07.0080590Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.0080962Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:07.0081385Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:07.0081707Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:07.0082072Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:07.0082500Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:07.0082861Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:07.0083171Z #define __INT8_C(c) c
2025-05-07T20:25:07.0083574Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:07.0083962Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.0084341Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:07.0084817Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.0085269Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:07.0085605Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.0086033Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.0086408Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:07.0086795Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:07.0087322Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:07.0087977Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:07.0088377Z #define __linux 1
2025-05-07T20:25:07.0088739Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:07.0089114Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:07.0089574Z #define __unix 1
2025-05-07T20:25:07.0089924Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:07.0090294Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:07.0090672Z #define __WINT_MIN__ 0U
2025-05-07T20:25:07.0091071Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.0091412Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:07.0091787Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:07.0092206Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:07.0092542Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:07.0092906Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:07.0093371Z #define __INT64_C(c) c ## L
2025-05-07T20:25:07.0093950Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:07.0094404Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:07.0094833Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:07.0095450Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:07.0095994Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:07.0096431Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:07.0096781Z #define __DBL_DIG__ 15
2025-05-07T20:25:07.0097054Z #define __FLT32_DIG__ 6
2025-05-07T20:25:07.0097533Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:07.0097972Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:07.0098263Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:07.0098767Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:07.0099281Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:07.0099688Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.0100035Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:07.0100501Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:07.0101048Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:07.0101399Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:07.0101751Z #define __unix__ 1
2025-05-07T20:25:07.0102125Z #define __INT_WIDTH__ 32
2025-05-07T20:25:07.0102446Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:07.0102782Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:07.0103187Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:07.0103528Z #define __UINT16_C(c) c
2025-05-07T20:25:07.0103961Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:07.0104340Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:07.0104801Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:07.0105236Z #define __gnu_linux__ 1
2025-05-07T20:25:07.0105987Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:07.0106455Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.0106837Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.0107248Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:07.0107613Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:07.0107944Z #define __GNUC__ 11
2025-05-07T20:25:07.0108288Z #define __pie__ 2
2025-05-07T20:25:07.0108711Z #define __MMX__ 1
2025-05-07T20:25:07.0109030Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:07.0109313Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:07.0109588Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:07.0109860Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:07.0110210Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.0110611Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0110921Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.0111188Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:07.0111454Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:07.0111746Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:07.0112018Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:07.0112282Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:07.0121645Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:07.0121976Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:07.0122266Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:07.0122641Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:07.0122903Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:07.0123172Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:07.0123452Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:07.0123713Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:07.0123976Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:07.0124294Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.0124654Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:07.0124926Z #define __SSE2_MATH__ 1
2025-05-07T20:25:07.0125169Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:07.0125471Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0125768Z #define __amd64 1
2025-05-07T20:25:07.0125989Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:07.0126261Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:07.0126572Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:07.0126877Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:07.0127637Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:07.0128080Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:07.0128338Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:07.0128602Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:07.0128874Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:07.0129144Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:07.0129403Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:07.0129684Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:07.0129931Z #define __x86_64 1
2025-05-07T20:25:07.0130158Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:07.0130527Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:07.0130986Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:07.0131432Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:07.0131900Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.0132295Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:07.0132558Z #define __LP64__ 1
2025-05-07T20:25:07.0132784Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.0133137Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:07.0133514Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:07.0133786Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:07.0134062Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.0134344Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:07.0134613Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:07.0134884Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:07.0135143Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:07.0135401Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:07.0135665Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.0135994Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:07.0136382Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:07.0136676Z #define __FLT_DIG__ 6
2025-05-07T20:25:07.0136916Z #define __NO_INLINE__ 1
2025-05-07T20:25:07.0137169Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:07.0137487Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:07.0137835Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:07.0138097Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:07.0138359Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:07.0138617Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:07.0138880Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:07.0139135Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:07.0139431Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:07.0139719Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:07.0139984Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:07.0140286Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.0140619Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:07.0140889Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:07.0141145Z #define __FLT128_DIG__ 33
2025-05-07T20:25:07.0141398Z #define __INT32_C(c) c
2025-05-07T20:25:07.0141646Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:07.0141922Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:07.0142202Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:07.0142490Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:07.0142803Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:07.0143114Z #define unix 1
2025-05-07T20:25:07.0143349Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:07.0143656Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.0143962Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:07.0144277Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:07.0144599Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:07.0144853Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:07.0145119Z #define __ELF__ 1
2025-05-07T20:25:07.0145344Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:07.0145631Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:07.0146051Z #define __FLT_RADIX__ 2
2025-05-07T20:25:07.0146483Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:07.0146833Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:07.0147195Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:07.0147453Z #define __SSE_MATH__ 1
2025-05-07T20:25:07.0147679Z #define __k8 1
2025-05-07T20:25:07.0147979Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:07.0148354Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:07.0148645Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:07.0148943Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:07.0149205Z #define __LDBL_DIG__ 18
2025-05-07T20:25:07.0149444Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:07.0149705Z #define __x86_64__ 1
2025-05-07T20:25:07.0149943Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:07.0150241Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:07.0150570Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0150882Z #define __FLT64_DIG__ 15
2025-05-07T20:25:07.0151173Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.0151515Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.0151830Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.0152100Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:07.0152371Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0152673Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:07.0153042Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:07.0153436Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:07.0153735Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:07.0154073Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:07.0154407Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:07.0154706Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:07.0154995Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:07.0155314Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:07.0155601Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:07.0155850Z #define __SEG_FS 1
2025-05-07T20:25:07.0156090Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:07.0156366Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:07.0156654Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0156950Z #define __SEG_GS 1
2025-05-07T20:25:07.0157263Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:07.0157646Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:07.0157931Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:07.0158213Z #define __INT16_TYPE__ short int
2025-05-07T20:25:07.0158497Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:07.0158798Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:07.0159067Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:07.0159314Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:07.0159582Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:07.0159930Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.0160320Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0160615Z #define linux 1
2025-05-07T20:25:07.0160848Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.0161126Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.0161405Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:07.0161663Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:07.0161923Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:07.0162190Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:07.0162542Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.0162957Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:07.0163283Z #define __code_model_small__ 1
2025-05-07T20:25:07.0163569Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:07.0163863Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:07.0164112Z #define __k8__ 1
2025-05-07T20:25:07.0164351Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:07.0164763Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:07.0165142Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:07.0165390Z #define __pic__ 2
2025-05-07T20:25:07.0165648Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.0165955Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:07.0166259Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0166598Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:07.0166961Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.0167329Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:07.0167759Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:07.0168059Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.0168366Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:07.0168627Z #define __linux__ 1
2025-05-07T20:25:07.0168862Z #define __INT64_TYPE__ long int
2025-05-07T20:25:07.0169122Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:07.0169383Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:07.0169664Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:07.0169920Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:07.0170215Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0170542Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:07.0170833Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:07.0171097Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:07.0171390Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:07.0171683Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:07.0172014Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.0172375Z #define __SSE__ 1
2025-05-07T20:25:07.0172601Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:07.0172931Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.0173272Z #define __amd64__ 1
2025-05-07T20:25:07.0173495Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:07.0173740Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:07.0174014Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:07.0174288Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:07.0174549Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:07.0174821Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:07.0175085Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:07.0175355Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:07.0175619Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:07.0175965Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:07.0176429Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:07.0176788Z #define _LP64 1
2025-05-07T20:25:07.0177004Z #define __UINT8_C(c) c
2025-05-07T20:25:07.0177238Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:07.0177506Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:07.0177776Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:07.0178046Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:07.0178340Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:07.0178702Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.0179173Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.0179545Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.0179844Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.0180161Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:07.0180522Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:07.0180893Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:07.0181163Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:07.0181500Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:07.0181856Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:07.0182116Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:07.0182363Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:07.0182606Z #define __FXSR__ 1
2025-05-07T20:25:07.0182905Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.0185085Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.0185591Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.0185893Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:07.0186153Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:07.0186478Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:07.0186833Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:07.0187078Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:07.0187317Z #define __PIC__ 2
2025-05-07T20:25:07.0187565Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:07.0187965Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.0188353Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:07.0188684Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.0189015Z #define __SSE2__ 1
2025-05-07T20:25:07.0189244Z #define __INT32_TYPE__ int
2025-05-07T20:25:07.0189495Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:07.0189763Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.0190098Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:07.0190454Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:07.0190728Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:07.0191001Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:07.0191271Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.0191542Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:07.0191791Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:07.0192039Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:07.0192323Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.0192618Z #define __PIE__ 2
2025-05-07T20:25:07.0192940Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:07.0193325Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:07.0193667Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:07.0194036Z #define __INT16_C(c) c
2025-05-07T20:25:07.0194263Z #define __STDC__ 1
2025-05-07T20:25:07.0194495Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:07.0194767Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:07.0195018Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.0195321Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:07.0195667Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:07.0196000Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:07.0196261Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.0196541Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:07.0196817Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:07.0197092Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:07.0197384Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.0197656Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:07.0197946Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.0198339Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.0198716Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:07.0199018Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:07.0199315Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:07.0199569Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:07.0199728Z 
2025-05-07T20:25:07.0698778Z 
2025-05-07T20:25:07.0699371Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:07.0699811Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:07.0700048Z 
2025-05-07T20:25:08.9606024Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:08.9606464Z #define __cpp_attributes 200809L
2025-05-07T20:25:08.9606931Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:08.9607376Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:08.9607767Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:08.9608031Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:08.9608361Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:08.9611212Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:08.9611782Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:08.9612214Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:08.9612620Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:08.9612950Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:08.9613198Z #define __CHAR_BIT__ 8
2025-05-07T20:25:08.9613422Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:08.9613668Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:08.9613913Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:08.9614171Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:08.9614441Z #define __cpp_static_assert 201411L
2025-05-07T20:25:08.9614727Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:08.9615019Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.9615316Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:08.9615609Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:08.9615930Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:08.9616267Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:08.9616678Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:08.9617088Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:08.9617392Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:08.9617671Z #define __GCC_IEC_559 2
2025-05-07T20:25:08.9617920Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:08.9618194Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:08.9618472Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:08.9618764Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:08.9619056Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:08.9619384Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:08.9619692Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:08.9620015Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.9620339Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:08.9620614Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.9620897Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:08.9621176Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:08.9621472Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:08.9621738Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:08.9621993Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:08.9622266Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:08.9622594Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:08.9622917Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:08.9623170Z #define __INT8_C(c) c
2025-05-07T20:25:08.9623409Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:08.9623672Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:08.9623992Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.9624316Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:08.9624587Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:08.9624875Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:08.9625198Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:08.9625557Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:08.9625832Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:08.9626112Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:08.9626375Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.9626645Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:08.9626923Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:08.9627310Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:08.9627719Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:08.9628006Z #define __linux 1
2025-05-07T20:25:08.9628233Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:08.9628503Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:08.9628783Z #define __unix 1
2025-05-07T20:25:08.9629010Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:08.9629298Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:08.9629578Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:08.9629972Z #define __WINT_MIN__ 0U
2025-05-07T20:25:08.9630296Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.9630573Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:08.9630848Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:08.9631114Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:08.9631433Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:08.9631718Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:08.9632020Z #define __INT64_C(c) c ## L
2025-05-07T20:25:08.9632284Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:08.9632584Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:08.9632863Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:08.9633172Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:08.9633445Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:08.9633713Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:08.9634067Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:08.9634439Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:08.9634702Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:08.9634992Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:08.9635262Z #define __DBL_DIG__ 15
2025-05-07T20:25:08.9635516Z #define __FLT32_DIG__ 6
2025-05-07T20:25:08.9635817Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:08.9636165Z #define __GXX_WEAK__ 1
2025-05-07T20:25:08.9636404Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:08.9636657Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:08.9637037Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:08.9637387Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:08.9637655Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:08.9637954Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:08.9638285Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:08.9638697Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:08.9639097Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:08.9639387Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:08.9639652Z #define __unix__ 1
2025-05-07T20:25:08.9639873Z #define __INT_WIDTH__ 32
2025-05-07T20:25:08.9640123Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:08.9640371Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:08.9649397Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:08.9649723Z #define __UINT16_C(c) c
2025-05-07T20:25:08.9649987Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:08.9650263Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:08.9650633Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:08.9651016Z #define __gnu_linux__ 1
2025-05-07T20:25:08.9651277Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:08.9651548Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:08.9651845Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.9652145Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.9652420Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:08.9652696Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:08.9652963Z #define __GNUC__ 11
2025-05-07T20:25:08.9653193Z #define __GXX_RTTI 1
2025-05-07T20:25:08.9653425Z #define __pie__ 2
2025-05-07T20:25:08.9653638Z #define __MMX__ 1
2025-05-07T20:25:08.9653867Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:08.9654143Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:08.9654422Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:08.9654696Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:08.9654954Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:08.9655253Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:08.9655576Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:08.9655926Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:08.9656301Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:08.9656611Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9656933Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:08.9657203Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:08.9657676Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:08.9658094Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:08.9658396Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:08.9658664Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:08.9658932Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:08.9659225Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:08.9659520Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:08.9659793Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:08.9660083Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:08.9660337Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:08.9660608Z #define __cplusplus 201703L
2025-05-07T20:25:08.9660883Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:08.9661163Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:08.9661421Z #define __DEPRECATED 1
2025-05-07T20:25:08.9661680Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:08.9661975Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:08.9662226Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:08.9662549Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:08.9662914Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:08.9663177Z #define __SSE2_MATH__ 1
2025-05-07T20:25:08.9663428Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:08.9663729Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9664012Z #define __amd64 1
2025-05-07T20:25:08.9664237Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:08.9664505Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:08.9664763Z #define __GNUG__ 11
2025-05-07T20:25:08.9665017Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:08.9665332Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:08.9665580Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:08.9665837Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:08.9666115Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:08.9666369Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:08.9666641Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:08.9666991Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:08.9667258Z #define __cpp_hex_float 201603L
2025-05-07T20:25:08.9667519Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:08.9667784Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:08.9668055Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:08.9668314Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:08.9668580Z #define __x86_64 1
2025-05-07T20:25:08.9668810Z #define __cpp_lambdas 200907L
2025-05-07T20:25:08.9669074Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:08.9669444Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:08.9669836Z #define __cpp_template_auto 201606L
2025-05-07T20:25:08.9670188Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:08.9670636Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:08.9671100Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:08.9671492Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:08.9671749Z #define __LP64__ 1
2025-05-07T20:25:08.9671978Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.9672336Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:08.9672709Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:08.9672983Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.9673267Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:08.9673536Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:08.9673804Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:08.9674064Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:08.9674323Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:08.9674650Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:08.9675009Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:08.9675287Z #define __FLT_DIG__ 6
2025-05-07T20:25:08.9675512Z #define __NO_INLINE__ 1
2025-05-07T20:25:08.9675754Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:08.9676289Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:08.9676719Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:08.9676973Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:08.9677239Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:08.9677490Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:08.9677765Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:08.9678059Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:08.9678311Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:08.9678606Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:08.9678888Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:08.9679154Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:08.9679451Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:08.9679785Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:08.9680069Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:08.9680326Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:08.9680584Z #define __FLT128_DIG__ 33
2025-05-07T20:25:08.9680827Z #define __INT32_C(c) c
2025-05-07T20:25:08.9681066Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:08.9681342Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:08.9681618Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:08.9681892Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:08.9682202Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:08.9682510Z #define unix 1
2025-05-07T20:25:08.9682723Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:08.9682980Z #define __cpp_rtti 199711L
2025-05-07T20:25:08.9683242Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:08.9683551Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.9683847Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:08.9684152Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:08.9684484Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:08.9684730Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:08.9685015Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:08.9685299Z #define __ELF__ 1
2025-05-07T20:25:08.9685528Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:08.9685809Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:08.9686084Z #define __FLT_RADIX__ 2
2025-05-07T20:25:08.9686326Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:08.9686682Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:08.9687050Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:08.9687315Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:08.9687759Z #define __k8 1
2025-05-07T20:25:08.9688069Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:08.9688444Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:08.9688740Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:08.9689047Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:08.9689307Z #define __LDBL_DIG__ 18
2025-05-07T20:25:08.9689545Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:08.9689804Z #define __x86_64__ 1
2025-05-07T20:25:08.9690050Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:08.9690348Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:08.9690684Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9690989Z #define __FLT64_DIG__ 15
2025-05-07T20:25:08.9691265Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.9691609Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:08.9691921Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.9692186Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:08.9692455Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9692749Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:08.9693118Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:08.9693508Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:08.9693799Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:08.9694116Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:08.9694548Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:08.9694961Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:08.9695261Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:08.9695536Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:08.9695842Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:08.9696123Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:08.9696358Z #define __SEG_FS 1
2025-05-07T20:25:08.9696593Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:08.9696896Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:08.9697198Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9697480Z #define __SEG_GS 1
2025-05-07T20:25:08.9697792Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:08.9698174Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:08.9698450Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:08.9698741Z #define __INT16_TYPE__ short int
2025-05-07T20:25:08.9699024Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:08.9699338Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:08.9699646Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:08.9699898Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:08.9700163Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:08.9700509Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:08.9700903Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9701221Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:08.9701544Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:08.9701847Z #define linux 1
2025-05-07T20:25:08.9702078Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.9702353Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:08.9702629Z #define __EXCEPTIONS 1
2025-05-07T20:25:08.9702878Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:08.9703136Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:08.9703417Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:08.9703701Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:08.9704059Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:08.9704448Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:08.9704790Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:08.9705117Z #define __code_model_small__ 1
2025-05-07T20:25:08.9705389Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:08.9706094Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:08.9706513Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:08.9706805Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:08.9707099Z #define __k8__ 1
2025-05-07T20:25:08.9707324Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:08.9707614Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:08.9707915Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:08.9708157Z #define __pic__ 2
2025-05-07T20:25:08.9708413Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.9708723Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:08.9708996Z #define __cpp_decltype 200707L
2025-05-07T20:25:08.9709297Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9709638Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:08.9710002Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:08.9710365Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:08.9710663Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:08.9710987Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:08.9711275Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:08.9711531Z #define __linux__ 1
2025-05-07T20:25:08.9711762Z #define __INT64_TYPE__ long int
2025-05-07T20:25:08.9712026Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:08.9712293Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:08.9712576Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:08.9712861Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:08.9713181Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:08.9713481Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9714046Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:08.9714460Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:08.9714768Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:08.9715070Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:08.9715415Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:08.9715784Z #define __SSE__ 1
2025-05-07T20:25:08.9716025Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:08.9716369Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:08.9716723Z #define __amd64__ 1
2025-05-07T20:25:08.9716961Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:08.9717227Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:08.9717510Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:08.9717788Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:08.9718063Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:08.9718333Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:08.9718614Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:08.9718898Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:08.9719253Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:08.9719725Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:08.9720091Z #define _LP64 1
2025-05-07T20:25:08.9720312Z #define __UINT8_C(c) c
2025-05-07T20:25:08.9720564Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:08.9720844Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:08.9721117Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:08.9721391Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:08.9721756Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:08.9722222Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:08.9722605Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.9722912Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.9723226Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:08.9723556Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:08.9723951Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:08.9724329Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:08.9724601Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:08.9724877Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:08.9725225Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:08.9725596Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:08.9725869Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:08.9726132Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:08.9726388Z #define __FXSR__ 1
2025-05-07T20:25:08.9726701Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:08.9727162Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:08.9727700Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:08.9728012Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:08.9728299Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:08.9728614Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:08.9728917Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:08.9729200Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:08.9729575Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:08.9729945Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:08.9730226Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:08.9730487Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:08.9730729Z #define __PIC__ 2
2025-05-07T20:25:08.9730995Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:08.9731413Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:08.9731813Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:08.9732155Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:08.9732521Z #define __cpp_constexpr 201603L
2025-05-07T20:25:08.9732789Z #define __SSE2__ 1
2025-05-07T20:25:08.9733129Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:08.9733576Z #define __INT32_TYPE__ int
2025-05-07T20:25:08.9733837Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:08.9734106Z #define __cpp_exceptions 199711L
2025-05-07T20:25:08.9734391Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:08.9734738Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:08.9735101Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:08.9735384Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:08.9735663Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:08.9735936Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.9736222Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:08.9736482Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:08.9736748Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:08.9737047Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:08.9737351Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.9737657Z #define __PIE__ 2
2025-05-07T20:25:08.9737989Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:08.9738420Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:08.9738737Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:08.9739090Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:08.9739467Z #define __INT16_C(c) c
2025-05-07T20:25:08.9739710Z #define __STDC__ 1
2025-05-07T20:25:08.9739940Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:08.9740209Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:08.9740493Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:08.9740752Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.9741060Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:08.9741416Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:08.9741763Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:08.9742032Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.9742333Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:08.9742631Z #define __SSE_MATH__ 1
2025-05-07T20:25:08.9742883Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:08.9743182Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:08.9743500Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:08.9743788Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:08.9744091Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.9744375Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:08.9744678Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.9745085Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:08.9745470Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:08.9745784Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:08.9746080Z #define _GNU_SOURCE 1
2025-05-07T20:25:08.9746340Z #define __cpp_init_captures 201304L
2025-05-07T20:25:08.9746633Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:08.9746891Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:08.9747062Z 
2025-05-07T20:25:09.0237307Z 
2025-05-07T20:25:09.0238120Z + conda run -n build_binary c++ --version
2025-05-07T20:25:09.0238388Z 
2025-05-07T20:25:10.8977669Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:10.8978052Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:10.8978499Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:10.8979042Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:10.8979364Z 
2025-05-07T20:25:10.8979368Z 
2025-05-07T20:25:10.9623264Z 
2025-05-07T20:25:10.9623666Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:10.9624205Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:10.9624519Z 
2025-05-07T20:25:12.9069149Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:12.9071100Z 
2025-05-07T20:25:12.9071923Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:12.9073828Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:12.9074650Z 
2025-05-07T20:25:14.8470694Z #define __cplusplus 201703L
2025-05-07T20:25:14.8473592Z 
2025-05-07T20:25:14.8475227Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:14.8520661Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:14.8521085Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:14.8532626Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:14.8532974Z env:
2025-05-07T20:25:14.8533201Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:14.8533497Z   BUILD_ENV: build_binary
2025-05-07T20:25:14.8533744Z   BUILD_TARGET: genai
2025-05-07T20:25:14.8533973Z   BUILD_VARIANT: cuda
2025-05-07T20:25:14.8534200Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:14.8534458Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:14.8534758Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:14.8535080Z ##[endgroup]
2025-05-07T20:25:15.1867272Z ################################################################################
2025-05-07T20:25:15.1867650Z # Install CUDA
2025-05-07T20:25:15.1867864Z #
2025-05-07T20:25:15.1883337Z # [2025-05-07T20:25:15.188Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:15.1883721Z ################################################################################
2025-05-07T20:25:15.1883941Z 
2025-05-07T20:25:15.1898736Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:15.2818769Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:15.2819646Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:15.2823710Z + conda clean --packages --tarball -y
2025-05-07T20:25:15.2823923Z 
2025-05-07T20:25:15.9898505Z Will remove 32 (148.9 MB) tarball(s).
2025-05-07T20:25:15.9898875Z Will remove 6 (619 KB) package(s).
2025-05-07T20:25:16.0524669Z 
2025-05-07T20:25:16.0533732Z + conda clean --all -y
2025-05-07T20:25:16.0533927Z 
2025-05-07T20:25:16.7201784Z There are no unused tarball(s) to remove.
2025-05-07T20:25:16.7202179Z Will remove 1 index cache(s).
2025-05-07T20:25:16.7202475Z There are no unused package(s) to remove.
2025-05-07T20:25:16.7202793Z There are no tempfile(s) to remove.
2025-05-07T20:25:16.7203096Z There are no logfile(s) to remove.
2025-05-07T20:25:16.7830861Z 
2025-05-07T20:25:16.7844856Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:25:16.7868119Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:25:17.7204886Z Channels:
2025-05-07T20:25:17.7205135Z  - conda-forge
2025-05-07T20:25:17.7205368Z Platform: linux-64
2025-05-07T20:25:28.2989504Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:25:29.4088329Z Solving environment: \ | / - done
2025-05-07T20:25:29.4833058Z 
2025-05-07T20:25:29.4833400Z ## Package Plan ##
2025-05-07T20:25:29.4833561Z 
2025-05-07T20:25:29.4833812Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:29.4834147Z 
2025-05-07T20:25:29.4834253Z   added / updated specs:
2025-05-07T20:25:29.4834502Z     - cuda=12.6.3
2025-05-07T20:25:29.4834638Z 
2025-05-07T20:25:29.4834658Z 
2025-05-07T20:25:29.4834778Z The following packages will be downloaded:
2025-05-07T20:25:29.4834995Z 
2025-05-07T20:25:29.4835119Z     package                    |            build
2025-05-07T20:25:29.4835435Z     ---------------------------|-----------------
2025-05-07T20:25:29.4835806Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:29.4836221Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:29.4836620Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:29.4837038Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:29.4837444Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:29.4838237Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:25:29.4838945Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:29.4839442Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:29.4839914Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:25:29.4840390Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:25:29.4840834Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:29.4841303Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:29.4841796Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:25:29.4842303Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:29.4842842Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:25:29.4843394Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:25:29.4843875Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:25:29.4844322Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:25:29.4844770Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:25:29.4845227Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:25:29.4845691Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:29.4846178Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:25:29.4846646Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:25:29.4847085Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:29.4847658Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:29.4848136Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:29.4848573Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:29.4849033Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:25:29.4849499Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:25:29.4849958Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:25:29.4850422Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:29.4850881Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:25:29.4851329Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:25:29.4851786Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:25:29.4852241Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:25:29.4852678Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:25:29.4853120Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:25:29.4853564Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:29.4854022Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:25:29.4854492Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:25:29.4854950Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:25:29.4855394Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:25:29.4855820Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:29.4856367Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:25:29.4857045Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:29.4857566Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:29.4858110Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:25:29.4858587Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:29.4859020Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:29.4859443Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:25:29.4859900Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:29.4860357Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:29.4860767Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:29.4861155Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:29.4861621Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:29.4862136Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:29.4862648Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:29.4863131Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:29.4863576Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:29.4864037Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:29.4864499Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:29.4864934Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:29.4865334Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:29.4865733Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:25:29.4866130Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:29.4866526Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:29.4866921Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:29.4867314Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:29.4867701Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:29.4868118Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:25:29.4868563Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:25:29.4868999Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:25:29.4869442Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:29.4869886Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:25:29.4870322Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:25:29.4870776Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:25:29.4871227Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:25:29.4871685Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:25:29.4872140Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:25:29.4872604Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:25:29.4873062Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:29.4873616Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:29.4874049Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:29.4874558Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:29.4875007Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:29.4875452Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:29.4875885Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:29.4876310Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:29.4876734Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:29.4877134Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:29.4877538Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:25:29.4877969Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:25:29.4878385Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:29.4878794Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:29.4879222Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:25:29.4879682Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:29.4880137Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:25:29.4880603Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:29.4881059Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:25:29.4881500Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:29.4881938Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:29.4882394Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:29.4882838Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:29.4883310Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:29.4883726Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:29.4884126Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:29.4884548Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:29.4884991Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:29.4885404Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:29.4885811Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:29.4886211Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:29.4886653Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:25:29.4887085Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:29.4887464Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:29.4887938Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:29.4888382Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:29.4888827Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:29.4889260Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:29.4889711Z     python-3.11.8              |hab00c5b_0_cpython        29.3 MB  conda-forge
2025-05-07T20:25:29.4890136Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:29.4890649Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:29.4891158Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:29.4891570Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:29.4891982Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:29.4892427Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:29.4892888Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:29.4893346Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:29.4893833Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:29.4894300Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:29.4894761Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:29.4895214Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:29.4895650Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:29.4896082Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:29.4896514Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:29.4896982Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:29.4897465Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:29.4897928Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:29.4898372Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:29.4898827Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:29.4899281Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:29.4899728Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:29.4900201Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:29.4900660Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:29.4901079Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:29.4901461Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:29.4901842Z     ------------------------------------------------------------
2025-05-07T20:25:29.4902193Z                                            Total:        1.64 GB
2025-05-07T20:25:29.4902404Z 
2025-05-07T20:25:29.4902541Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:29.4902765Z 
2025-05-07T20:25:29.4902973Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:29.4903390Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:29.4903805Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:29.4904259Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:29.4904682Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:25:29.4905148Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:25:29.4906161Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:29.4906772Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:25:29.4907319Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:29.4907876Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:25:29.4908655Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:25:29.4909262Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:29.4910064Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:29.4910678Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:25:29.4914739Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:29.4915345Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:29.4915905Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.4916421Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:25:29.4916925Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:25:29.4917455Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.4917998Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:29.4918576Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:29.4919106Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:25:29.4919594Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:25:29.4920158Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:25:29.4920701Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:25:29.4921185Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:25:29.4921693Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:25:29.4922248Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:25:29.4922813Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:25:29.4923389Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:25:29.4923923Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.4924439Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.4924945Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:25:29.4925438Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.4925935Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:25:29.4926434Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:29.4926928Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:25:29.4927438Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:29.4928084Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:25:29.4928627Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:25:29.4929132Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:25:29.4929603Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:25:29.4930122Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:29.4930687Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:25:29.4931227Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:25:29.4931764Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.4932307Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:25:29.4932894Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:29.4933365Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:25:29.4933967Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:29.4934511Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:29.4934958Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:29.4935358Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:29.4935863Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:29.4936462Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:29.4937059Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:29.4937626Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:29.4938124Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:29.4938622Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:29.4939118Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:29.4939574Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:29.4939993Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:29.4940418Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:25:29.4940846Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:29.4941223Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:29.4941634Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:29.4942049Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:29.4942457Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:29.4942950Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:25:29.4943456Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:25:29.4943953Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:25:29.4944441Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:25:29.4944927Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:25:29.4945427Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:25:29.4945927Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:25:29.4946424Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:25:29.4946940Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:25:29.4947480Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:25:29.4948011Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:25:29.4948542Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:25:29.4949054Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:29.4949515Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:29.4949986Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:29.4950478Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:29.4950988Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:29.4951468Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:29.4951928Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:29.4952477Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:29.4952905Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:29.4953414Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:25:29.4953881Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:25:29.4954329Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:29.4954755Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:29.4955221Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:25:29.4955742Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:29.4956279Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:25:29.4956820Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:29.4957352Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:25:29.4957858Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:25:29.4958347Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:29.4958882Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:29.4959421Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:29.4959961Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:29.4960390Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:29.4960848Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:29.4961330Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:29.4961769Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:29.4962198Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:29.4962610Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:29.4963092Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:25:29.4963562Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:29.4963936Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:29.4964328Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:29.4964808Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:29.4965293Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:29.4965755Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:29.4975602Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:29.4976058Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:29.4976502Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:29.4976993Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:29.4977521Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:29.4978058Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:29.4978629Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:29.4979170Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:29.4979680Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:29.4980203Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:29.4980681Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:29.4981307Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:29.4981791Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:29.4982418Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:29.4983001Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:29.4983543Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:29.4984062Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:29.4984574Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:29.4985083Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:29.4985584Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:29.4986133Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:29.4986664Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:29.4987117Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:29.4987372Z 
2025-05-07T20:25:29.4987500Z The following packages will be UPDATED:
2025-05-07T20:25:29.4987705Z 
2025-05-07T20:25:29.4987984Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:29.4988569Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:29.4988902Z 
2025-05-07T20:25:29.4989116Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:29.4989431Z 
2025-05-07T20:25:29.4989715Z   python               pkgs/main::python-3.11.11-he870216_0 --> conda-forge::python-3.11.8-hab00c5b_0_cpython 
2025-05-07T20:25:29.4990341Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:29.4990915Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:29.4991230Z 
2025-05-07T20:25:29.4991257Z 
2025-05-07T20:25:29.4991261Z 
2025-05-07T20:25:29.4991417Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:29.4991793Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:29.4992035Z 
2025-05-07T20:25:29.4992419Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:29.4992684Z 
2025-05-07T20:25:29.4992695Z 
2025-05-07T20:25:29.4992931Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:29.4993171Z 
2025-05-07T20:25:29.4993175Z 
2025-05-07T20:25:29.4993179Z 
2025-05-07T20:25:29.4993412Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:29.4993672Z 
2025-05-07T20:25:29.4993676Z 
2025-05-07T20:25:29.4993679Z 
2025-05-07T20:25:29.4993683Z 
2025-05-07T20:25:29.4993923Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:29.4994190Z 
2025-05-07T20:25:29.4994194Z 
2025-05-07T20:25:29.4994198Z 
2025-05-07T20:25:29.4994201Z 
2025-05-07T20:25:29.4994205Z 
2025-05-07T20:25:29.4994450Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:29.4994708Z 
2025-05-07T20:25:29.4994712Z 
2025-05-07T20:25:29.4994716Z 
2025-05-07T20:25:29.4994719Z 
2025-05-07T20:25:29.4994723Z 
2025-05-07T20:25:29.4994727Z 
2025-05-07T20:25:29.4994971Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:29.4995252Z 
2025-05-07T20:25:29.4995256Z 
2025-05-07T20:25:29.4995260Z 
2025-05-07T20:25:29.4995263Z 
2025-05-07T20:25:29.4995267Z 
2025-05-07T20:25:29.4995270Z 
2025-05-07T20:25:29.4995274Z 
2025-05-07T20:25:29.4995506Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:29.4995771Z 
2025-05-07T20:25:29.4995774Z 
2025-05-07T20:25:29.4995778Z 
2025-05-07T20:25:29.4995782Z 
2025-05-07T20:25:29.4995785Z 
2025-05-07T20:25:29.4995882Z 
2025-05-07T20:25:29.4995886Z 
2025-05-07T20:25:29.4995889Z 
2025-05-07T20:25:29.4996148Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:29.4996432Z 
2025-05-07T20:25:29.4996516Z 
2025-05-07T20:25:29.4996521Z 
2025-05-07T20:25:29.4996524Z 
2025-05-07T20:25:29.4996528Z 
2025-05-07T20:25:29.4996532Z 
2025-05-07T20:25:29.4996535Z 
2025-05-07T20:25:29.4996539Z 
2025-05-07T20:25:29.4996543Z 
2025-05-07T20:25:29.4996801Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.4997075Z 
2025-05-07T20:25:29.4997078Z 
2025-05-07T20:25:29.4997082Z 
2025-05-07T20:25:29.4997086Z 
2025-05-07T20:25:29.4997089Z 
2025-05-07T20:25:29.4997093Z 
2025-05-07T20:25:29.4997097Z 
2025-05-07T20:25:29.4997100Z 
2025-05-07T20:25:29.4997104Z 
2025-05-07T20:25:29.4997123Z 
2025-05-07T20:25:29.4997368Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.4997645Z 
2025-05-07T20:25:29.4997654Z 
2025-05-07T20:25:29.4997658Z 
2025-05-07T20:25:29.4997662Z 
2025-05-07T20:25:29.4997665Z 
2025-05-07T20:25:29.4997669Z 
2025-05-07T20:25:29.4997680Z 
2025-05-07T20:25:29.4997684Z 
2025-05-07T20:25:29.4997687Z 
2025-05-07T20:25:29.4997696Z 
2025-05-07T20:25:29.4997699Z 
2025-05-07T20:25:29.4997937Z python-3.11.8        | 29.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.4998205Z 
2025-05-07T20:25:29.4998216Z 
2025-05-07T20:25:29.4998220Z 
2025-05-07T20:25:29.4998224Z 
2025-05-07T20:25:29.4998227Z 
2025-05-07T20:25:29.4998231Z 
2025-05-07T20:25:29.4998234Z 
2025-05-07T20:25:29.4998238Z 
2025-05-07T20:25:29.4998242Z 
2025-05-07T20:25:29.4998245Z 
2025-05-07T20:25:29.4998249Z 
2025-05-07T20:25:29.4998255Z 
2025-05-07T20:25:29.4998525Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.4998860Z 
2025-05-07T20:25:29.4998866Z 
2025-05-07T20:25:29.4998871Z 
2025-05-07T20:25:29.4998876Z 
2025-05-07T20:25:29.4998889Z 
2025-05-07T20:25:29.4998894Z 
2025-05-07T20:25:29.4998899Z 
2025-05-07T20:25:29.4998904Z 
2025-05-07T20:25:29.4998909Z 
2025-05-07T20:25:29.4998914Z 
2025-05-07T20:25:29.4998919Z 
2025-05-07T20:25:29.4998924Z 
2025-05-07T20:25:29.4998936Z 
2025-05-07T20:25:29.5000719Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5001108Z 
2025-05-07T20:25:29.5001112Z 
2025-05-07T20:25:29.5001116Z 
2025-05-07T20:25:29.5001119Z 
2025-05-07T20:25:29.5001123Z 
2025-05-07T20:25:29.5001126Z 
2025-05-07T20:25:29.5001130Z 
2025-05-07T20:25:29.5001134Z 
2025-05-07T20:25:29.5001146Z 
2025-05-07T20:25:29.5001149Z 
2025-05-07T20:25:29.5001153Z 
2025-05-07T20:25:29.5001157Z 
2025-05-07T20:25:29.5001160Z 
2025-05-07T20:25:29.5001164Z 
2025-05-07T20:25:29.5002166Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5002518Z 
2025-05-07T20:25:29.5002522Z 
2025-05-07T20:25:29.5002526Z 
2025-05-07T20:25:29.5002541Z 
2025-05-07T20:25:29.5002545Z 
2025-05-07T20:25:29.5002549Z 
2025-05-07T20:25:29.5002552Z 
2025-05-07T20:25:29.5002556Z 
2025-05-07T20:25:29.5002559Z 
2025-05-07T20:25:29.5002563Z 
2025-05-07T20:25:29.5002566Z 
2025-05-07T20:25:29.5002574Z 
2025-05-07T20:25:29.5002577Z 
2025-05-07T20:25:29.5002581Z 
2025-05-07T20:25:29.5002584Z 
2025-05-07T20:25:29.5006265Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5006582Z 
2025-05-07T20:25:29.5006585Z 
2025-05-07T20:25:29.5006589Z 
2025-05-07T20:25:29.5006593Z 
2025-05-07T20:25:29.5006596Z 
2025-05-07T20:25:29.5006600Z 
2025-05-07T20:25:29.5006603Z 
2025-05-07T20:25:29.5006607Z 
2025-05-07T20:25:29.5006611Z 
2025-05-07T20:25:29.5006614Z 
2025-05-07T20:25:29.5006618Z 
2025-05-07T20:25:29.5006622Z 
2025-05-07T20:25:29.5006625Z 
2025-05-07T20:25:29.5006633Z 
2025-05-07T20:25:29.5006636Z 
2025-05-07T20:25:29.5021083Z 
2025-05-07T20:25:29.5023023Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5023554Z 
2025-05-07T20:25:29.5023558Z 
2025-05-07T20:25:29.5023561Z 
2025-05-07T20:25:29.5023565Z 
2025-05-07T20:25:29.5023569Z 
2025-05-07T20:25:29.5023694Z 
2025-05-07T20:25:29.5023698Z 
2025-05-07T20:25:29.5023702Z 
2025-05-07T20:25:29.5023705Z 
2025-05-07T20:25:29.5023717Z 
2025-05-07T20:25:29.5023721Z 
2025-05-07T20:25:29.5023724Z 
2025-05-07T20:25:29.5023728Z 
2025-05-07T20:25:29.5023732Z 
2025-05-07T20:25:29.5023735Z 
2025-05-07T20:25:29.5023739Z 
2025-05-07T20:25:29.5023743Z 
2025-05-07T20:25:29.5025302Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5025759Z 
2025-05-07T20:25:29.5025762Z 
2025-05-07T20:25:29.5025766Z 
2025-05-07T20:25:29.5025770Z 
2025-05-07T20:25:29.5025773Z 
2025-05-07T20:25:29.5025777Z 
2025-05-07T20:25:29.5025781Z 
2025-05-07T20:25:29.5025784Z 
2025-05-07T20:25:29.5025788Z 
2025-05-07T20:25:29.5025791Z 
2025-05-07T20:25:29.5025802Z 
2025-05-07T20:25:29.5025806Z 
2025-05-07T20:25:29.5025809Z 
2025-05-07T20:25:29.5025813Z 
2025-05-07T20:25:29.5025816Z 
2025-05-07T20:25:29.5025820Z 
2025-05-07T20:25:29.5025824Z 
2025-05-07T20:25:29.5025827Z 
2025-05-07T20:25:29.5026672Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5027061Z 
2025-05-07T20:25:29.5027065Z 
2025-05-07T20:25:29.5027075Z 
2025-05-07T20:25:29.5027078Z 
2025-05-07T20:25:29.5027091Z 
2025-05-07T20:25:29.5027095Z 
2025-05-07T20:25:29.5027098Z 
2025-05-07T20:25:29.5027102Z 
2025-05-07T20:25:29.5027106Z 
2025-05-07T20:25:29.5027109Z 
2025-05-07T20:25:29.5027113Z 
2025-05-07T20:25:29.5027116Z 
2025-05-07T20:25:29.5027120Z 
2025-05-07T20:25:29.5027124Z 
2025-05-07T20:25:29.5027127Z 
2025-05-07T20:25:29.5027131Z 
2025-05-07T20:25:29.5027134Z 
2025-05-07T20:25:29.5027138Z 
2025-05-07T20:25:29.5027141Z 
2025-05-07T20:25:29.5933187Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5933525Z 
2025-05-07T20:25:29.5953844Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   1% [A
2025-05-07T20:25:29.5955821Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:29.5956186Z 
2025-05-07T20:25:29.5956191Z 
2025-05-07T20:25:29.5956195Z 
2025-05-07T20:25:29.5974370Z libcusparse-12.5.4.2 | 118.6 MB  | 1          |   2% [A[A[A
2025-05-07T20:25:29.5974749Z 
2025-05-07T20:25:29.5974753Z 
2025-05-07T20:25:29.5974757Z 
2025-05-07T20:25:29.5975110Z 
2025-05-07T20:25:29.6236917Z cuda-nsight-12.6.77  | 113.2 MB  |            |   1% [A[A[A[A
2025-05-07T20:25:29.6237258Z 
2025-05-07T20:25:29.6237463Z 
2025-05-07T20:25:29.6936939Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:29.6938107Z 
2025-05-07T20:25:29.6954886Z libcublas-12.6.4.1   | 256.2 MB  | 2          |   3% [A
2025-05-07T20:25:29.6958845Z nsight-compute-2024. | 443.1 MB  |            |   1% 
2025-05-07T20:25:29.6959124Z 
2025-05-07T20:25:29.6959129Z 
2025-05-07T20:25:29.6961277Z 
2025-05-07T20:25:29.6975736Z libcusparse-12.5.4.2 | 118.6 MB  | 5          |   5% [A[A[A
2025-05-07T20:25:29.6976017Z 
2025-05-07T20:25:29.6976021Z 
2025-05-07T20:25:29.6976025Z 
2025-05-07T20:25:29.6976252Z 
2025-05-07T20:25:29.7237442Z cuda-nsight-12.6.77  | 113.2 MB  | 4          |   5% [A[A[A[A
2025-05-07T20:25:29.7237837Z 
2025-05-07T20:25:29.7237843Z 
2025-05-07T20:25:29.7938551Z libcufft-11.3.0.4    | 156.2 MB  | 1          |   2% [A[A
2025-05-07T20:25:29.7939253Z 
2025-05-07T20:25:29.7957305Z libcublas-12.6.4.1   | 256.2 MB  | 4          |   4% [A
2025-05-07T20:25:29.7961050Z nsight-compute-2024. | 443.1 MB  | 1          |   2% 
2025-05-07T20:25:29.7961341Z 
2025-05-07T20:25:29.7961541Z 
2025-05-07T20:25:29.7964570Z 
2025-05-07T20:25:29.7977689Z libcusparse-12.5.4.2 | 118.6 MB  | 8          |   9% [A[A[A
2025-05-07T20:25:29.7978110Z 
2025-05-07T20:25:29.7978123Z 
2025-05-07T20:25:29.7978127Z 
2025-05-07T20:25:29.7978131Z 
2025-05-07T20:25:29.8241150Z cuda-nsight-12.6.77  | 113.2 MB  | 7          |   8% [A[A[A[A
2025-05-07T20:25:29.8241817Z 
2025-05-07T20:25:29.8241822Z 
2025-05-07T20:25:29.8944668Z libcufft-11.3.0.4    | 156.2 MB  | 4          |   4% [A[A
2025-05-07T20:25:29.8946349Z 
2025-05-07T20:25:29.8958769Z libcublas-12.6.4.1   | 256.2 MB  | 5          |   6% [A
2025-05-07T20:25:29.8963302Z nsight-compute-2024. | 443.1 MB  | 2          |   2% 
2025-05-07T20:25:29.8963568Z 
2025-05-07T20:25:29.8963575Z 
2025-05-07T20:25:29.8967145Z 
2025-05-07T20:25:29.8982822Z libcusparse-12.5.4.2 | 118.6 MB  | #1         |  12% [A[A[A
2025-05-07T20:25:29.8983176Z 
2025-05-07T20:25:29.8983183Z 
2025-05-07T20:25:29.8983188Z 
2025-05-07T20:25:29.8983618Z 
2025-05-07T20:25:29.9241375Z cuda-nsight-12.6.77  | 113.2 MB  | #          |  11% [A[A[A[A
2025-05-07T20:25:29.9241727Z 
2025-05-07T20:25:29.9241732Z 
2025-05-07T20:25:29.9985832Z libcufft-11.3.0.4    | 156.2 MB  | 6          |   7% [A[A
2025-05-07T20:25:29.9988611Z nsight-compute-2024. | 443.1 MB  | 3          |   3% 
2025-05-07T20:25:29.9991611Z 
2025-05-07T20:25:30.0004303Z libcublas-12.6.4.1   | 256.2 MB  | 7          |   7% [A
2025-05-07T20:25:30.0004658Z 
2025-05-07T20:25:30.0004663Z 
2025-05-07T20:25:30.0004667Z 
2025-05-07T20:25:30.0007068Z 
2025-05-07T20:25:30.0107036Z cuda-nsight-12.6.77  | 113.2 MB  | #3         |  14% [A[A[A[A
2025-05-07T20:25:30.0107328Z 
2025-05-07T20:25:30.0107332Z 
2025-05-07T20:25:30.0108540Z 
2025-05-07T20:25:30.0244078Z libcusparse-12.5.4.2 | 118.6 MB  | #4         |  15% [A[A[A
2025-05-07T20:25:30.0244370Z 
2025-05-07T20:25:30.0245014Z 
2025-05-07T20:25:30.0987352Z libcufft-11.3.0.4    | 156.2 MB  | 8          |   9% [A[A
2025-05-07T20:25:30.1098021Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:25:30.1098390Z 
2025-05-07T20:25:30.1115764Z libcublas-12.6.4.1   | 256.2 MB  | 8          |   8% [A
2025-05-07T20:25:30.1116101Z 
2025-05-07T20:25:30.1116106Z 
2025-05-07T20:25:30.1116110Z 
2025-05-07T20:25:30.1122631Z libcusparse-12.5.4.2 | 118.6 MB  | #8         |  18% [A[A[A
2025-05-07T20:25:30.1122937Z 
2025-05-07T20:25:30.1122942Z 
2025-05-07T20:25:30.1122946Z 
2025-05-07T20:25:30.1126343Z 
2025-05-07T20:25:30.1247745Z cuda-nsight-12.6.77  | 113.2 MB  | #6         |  17% [A[A[A[A
2025-05-07T20:25:30.1248085Z 
2025-05-07T20:25:30.1249534Z 
2025-05-07T20:25:30.1990446Z libcufft-11.3.0.4    | 156.2 MB  | #          |  11% [A[A
2025-05-07T20:25:30.2098010Z nsight-compute-2024. | 443.1 MB  | 4          |   5% 
2025-05-07T20:25:30.2099767Z 
2025-05-07T20:25:30.2118517Z libcublas-12.6.4.1   | 256.2 MB  | 9          |  10% [A
2025-05-07T20:25:30.2119259Z 
2025-05-07T20:25:30.2119441Z 
2025-05-07T20:25:30.2120416Z 
2025-05-07T20:25:30.2125691Z libcusparse-12.5.4.2 | 118.6 MB  | ##1        |  21% [A[A[A
2025-05-07T20:25:30.2126190Z 
2025-05-07T20:25:30.2126194Z 
2025-05-07T20:25:30.2126198Z 
2025-05-07T20:25:30.2126202Z 
2025-05-07T20:25:30.2249825Z cuda-nsight-12.6.77  | 113.2 MB  | #9         |  20% [A[A[A[A
2025-05-07T20:25:30.2250261Z 
2025-05-07T20:25:30.2250292Z 
2025-05-07T20:25:30.3026694Z libcufft-11.3.0.4    | 156.2 MB  | #3         |  13% [A[A
2025-05-07T20:25:30.3128560Z nsight-compute-2024. | 443.1 MB  | 5          |   5% 
2025-05-07T20:25:30.3129107Z 
2025-05-07T20:25:30.3129140Z 
2025-05-07T20:25:30.3129147Z 
2025-05-07T20:25:30.3129152Z 
2025-05-07T20:25:30.3150151Z cuda-nsight-12.6.77  | 113.2 MB  | ##3        |  23% [A[A[A[A
2025-05-07T20:25:30.3150703Z 
2025-05-07T20:25:30.3150708Z 
2025-05-07T20:25:30.3151839Z 
2025-05-07T20:25:30.3201959Z libcusparse-12.5.4.2 | 118.6 MB  | ##4        |  25% [A[A[A
2025-05-07T20:25:30.3204329Z 
2025-05-07T20:25:30.3409263Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  11% [A
2025-05-07T20:25:30.3409837Z 
2025-05-07T20:25:30.3412917Z 
2025-05-07T20:25:30.4028121Z libcufft-11.3.0.4    | 156.2 MB  | #5         |  15% [A[A
2025-05-07T20:25:30.4142652Z nsight-compute-2024. | 443.1 MB  | 6          |   6% 
2025-05-07T20:25:30.4142917Z 
2025-05-07T20:25:30.4142921Z 
2025-05-07T20:25:30.4142925Z 
2025-05-07T20:25:30.4143246Z 
2025-05-07T20:25:30.4160309Z cuda-nsight-12.6.77  | 113.2 MB  | ##6        |  26% [A[A[A[A
2025-05-07T20:25:30.4160603Z 
2025-05-07T20:25:30.4160607Z 
2025-05-07T20:25:30.4161720Z 
2025-05-07T20:25:30.4262726Z libcusparse-12.5.4.2 | 118.6 MB  | ##7        |  28% [A[A[A
2025-05-07T20:25:30.4265758Z 
2025-05-07T20:25:30.4411661Z libcublas-12.6.4.1   | 256.2 MB  | #2         |  13% [A
2025-05-07T20:25:30.4412010Z 
2025-05-07T20:25:30.4412014Z 
2025-05-07T20:25:30.5031705Z libcufft-11.3.0.4    | 156.2 MB  | #7         |  17% [A[A
2025-05-07T20:25:30.5166457Z nsight-compute-2024. | 443.1 MB  | 7          |   7% 
2025-05-07T20:25:30.5166737Z 
2025-05-07T20:25:30.5166900Z 
2025-05-07T20:25:30.5166906Z 
2025-05-07T20:25:30.5169455Z 
2025-05-07T20:25:30.5186104Z cuda-nsight-12.6.77  | 113.2 MB  | ##9        |  29% [A[A[A[A
2025-05-07T20:25:30.5186392Z 
2025-05-07T20:25:30.5186396Z 
2025-05-07T20:25:30.5187879Z 
2025-05-07T20:25:30.5263917Z libcusparse-12.5.4.2 | 118.6 MB  | ###        |  31% [A[A[A
2025-05-07T20:25:30.5264817Z 
2025-05-07T20:25:30.5415189Z libcublas-12.6.4.1   | 256.2 MB  | #4         |  14% [A
2025-05-07T20:25:30.5415559Z 
2025-05-07T20:25:30.5415563Z 
2025-05-07T20:25:30.6035710Z libcufft-11.3.0.4    | 156.2 MB  | #9         |  20% [A[A
2025-05-07T20:25:30.6168626Z nsight-compute-2024. | 443.1 MB  | 7          |   8% 
2025-05-07T20:25:30.6168960Z 
2025-05-07T20:25:30.6169024Z 
2025-05-07T20:25:30.6169032Z 
2025-05-07T20:25:30.6169055Z 
2025-05-07T20:25:30.6358176Z cuda-nsight-12.6.77  | 113.2 MB  | ###1       |  32% [A[A[A[A
2025-05-07T20:25:30.6358461Z 
2025-05-07T20:25:30.6358465Z 
2025-05-07T20:25:30.6359173Z 
2025-05-07T20:25:30.6401167Z libcusparse-12.5.4.2 | 118.6 MB  | ###3       |  34% [A[A[A
2025-05-07T20:25:30.6401512Z 
2025-05-07T20:25:30.6416592Z libcublas-12.6.4.1   | 256.2 MB  | #5         |  15% [A
2025-05-07T20:25:30.6416847Z 
2025-05-07T20:25:30.6420204Z 
2025-05-07T20:25:30.7039221Z libcufft-11.3.0.4    | 156.2 MB  | ##1        |  22% [A[A
2025-05-07T20:25:30.7169081Z nsight-compute-2024. | 443.1 MB  | 8          |   9% 
2025-05-07T20:25:30.7169480Z 
2025-05-07T20:25:30.7169486Z 
2025-05-07T20:25:30.7169492Z 
2025-05-07T20:25:30.7170739Z 
2025-05-07T20:25:30.7362622Z cuda-nsight-12.6.77  | 113.2 MB  | ###5       |  35% [A[A[A[A
2025-05-07T20:25:30.7362931Z 
2025-05-07T20:25:30.7362935Z 
2025-05-07T20:25:30.7362939Z 
2025-05-07T20:25:30.7402000Z libcusparse-12.5.4.2 | 118.6 MB  | ###6       |  37% [A[A[A
2025-05-07T20:25:30.7403061Z 
2025-05-07T20:25:30.7420110Z libcublas-12.6.4.1   | 256.2 MB  | #6         |  17% [A
2025-05-07T20:25:30.7421020Z 
2025-05-07T20:25:30.7421028Z 
2025-05-07T20:25:30.8129288Z libcufft-11.3.0.4    | 156.2 MB  | ##4        |  24% [A[A
2025-05-07T20:25:30.8169935Z nsight-compute-2024. | 443.1 MB  | 9          |  10% 
2025-05-07T20:25:30.8170226Z 
2025-05-07T20:25:30.8170230Z 
2025-05-07T20:25:30.8170234Z 
2025-05-07T20:25:30.8170238Z 
2025-05-07T20:25:30.8364084Z cuda-nsight-12.6.77  | 113.2 MB  | ###8       |  38% [A[A[A[A
2025-05-07T20:25:30.8364403Z 
2025-05-07T20:25:30.8364411Z 
2025-05-07T20:25:30.8365350Z 
2025-05-07T20:25:30.8402852Z libcusparse-12.5.4.2 | 118.6 MB  | ###9       |  40% [A[A[A
2025-05-07T20:25:30.8404061Z 
2025-05-07T20:25:30.8423613Z libcublas-12.6.4.1   | 256.2 MB  | #8         |  18% [A
2025-05-07T20:25:30.8423897Z 
2025-05-07T20:25:30.8429153Z 
2025-05-07T20:25:30.9157707Z libcufft-11.3.0.4    | 156.2 MB  | ##6        |  27% [A[A
2025-05-07T20:25:30.9231390Z nsight-compute-2024. | 443.1 MB  | #          |  10% 
2025-05-07T20:25:30.9231655Z 
2025-05-07T20:25:30.9231659Z 
2025-05-07T20:25:30.9231663Z 
2025-05-07T20:25:30.9231667Z 
2025-05-07T20:25:30.9395686Z cuda-nsight-12.6.77  | 113.2 MB  | ####1      |  42% [A[A[A[A
2025-05-07T20:25:30.9396064Z 
2025-05-07T20:25:30.9396068Z 
2025-05-07T20:25:30.9398998Z 
2025-05-07T20:25:30.9542672Z libcusparse-12.5.4.2 | 118.6 MB  | ####2      |  43% [A[A[A
2025-05-07T20:25:30.9543988Z 
2025-05-07T20:25:30.9577814Z libcublas-12.6.4.1   | 256.2 MB  | #9         |  20% [A
2025-05-07T20:25:30.9578900Z 
2025-05-07T20:25:30.9578905Z 
2025-05-07T20:25:31.0158698Z libcufft-11.3.0.4    | 156.2 MB  | ##8        |  29% [A[A
2025-05-07T20:25:31.0231980Z nsight-compute-2024. | 443.1 MB  | #1         |  11% 
2025-05-07T20:25:31.0232347Z 
2025-05-07T20:25:31.0232353Z 
2025-05-07T20:25:31.0232369Z 
2025-05-07T20:25:31.0232374Z 
2025-05-07T20:25:31.0433885Z cuda-nsight-12.6.77  | 113.2 MB  | ####4      |  45% [A[A[A[A
2025-05-07T20:25:31.0434270Z 
2025-05-07T20:25:31.0434276Z 
2025-05-07T20:25:31.0434281Z 
2025-05-07T20:25:31.0546575Z libcusparse-12.5.4.2 | 118.6 MB  | ####5      |  46% [A[A[A
2025-05-07T20:25:31.0550490Z 
2025-05-07T20:25:31.0618086Z libcublas-12.6.4.1   | 256.2 MB  | ##1        |  21% [A
2025-05-07T20:25:31.0618751Z 
2025-05-07T20:25:31.0620596Z 
2025-05-07T20:25:31.1159815Z libcufft-11.3.0.4    | 156.2 MB  | ###1       |  31% [A[A
2025-05-07T20:25:31.1233350Z nsight-compute-2024. | 443.1 MB  | #1         |  12% 
2025-05-07T20:25:31.1233656Z 
2025-05-07T20:25:31.1233660Z 
2025-05-07T20:25:31.1233664Z 
2025-05-07T20:25:31.1233668Z 
2025-05-07T20:25:31.1434730Z cuda-nsight-12.6.77  | 113.2 MB  | ####8      |  48% [A[A[A[A
2025-05-07T20:25:31.1435039Z 
2025-05-07T20:25:31.1435044Z 
2025-05-07T20:25:31.1435768Z 
2025-05-07T20:25:31.1547247Z libcusparse-12.5.4.2 | 118.6 MB  | ####9      |  49% [A[A[A
2025-05-07T20:25:31.1547563Z 
2025-05-07T20:25:31.1619012Z libcublas-12.6.4.1   | 256.2 MB  | ##2        |  23% [A
2025-05-07T20:25:31.1619362Z 
2025-05-07T20:25:31.1620433Z 
2025-05-07T20:25:31.2166156Z libcufft-11.3.0.4    | 156.2 MB  | ###3       |  33% [A[A
2025-05-07T20:25:31.2235694Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:25:31.2235969Z 
2025-05-07T20:25:31.2235976Z 
2025-05-07T20:25:31.2235980Z 
2025-05-07T20:25:31.2235984Z 
2025-05-07T20:25:31.2436053Z cuda-nsight-12.6.77  | 113.2 MB  | #####1     |  51% [A[A[A[A
2025-05-07T20:25:31.2436746Z 
2025-05-07T20:25:31.2436753Z 
2025-05-07T20:25:31.2436790Z 
2025-05-07T20:25:31.2549169Z libcusparse-12.5.4.2 | 118.6 MB  | #####2     |  52% [A[A[A
2025-05-07T20:25:31.2551494Z 
2025-05-07T20:25:31.2623051Z libcublas-12.6.4.1   | 256.2 MB  | ##4        |  24% [A
2025-05-07T20:25:31.2623761Z 
2025-05-07T20:25:31.2623892Z 
2025-05-07T20:25:31.3177485Z libcufft-11.3.0.4    | 156.2 MB  | ###5       |  36% [A[A
2025-05-07T20:25:31.3261864Z nsight-compute-2024. | 443.1 MB  | #3         |  14% 
2025-05-07T20:25:31.3262130Z 
2025-05-07T20:25:31.3262134Z 
2025-05-07T20:25:31.3262138Z 
2025-05-07T20:25:31.3262877Z 
2025-05-07T20:25:31.3437223Z cuda-nsight-12.6.77  | 113.2 MB  | #####4     |  54% [A[A[A[A
2025-05-07T20:25:31.3437557Z 
2025-05-07T20:25:31.3437561Z 
2025-05-07T20:25:31.3437564Z 
2025-05-07T20:25:31.3628544Z libcusparse-12.5.4.2 | 118.6 MB  | #####5     |  56% [A[A[A
2025-05-07T20:25:31.3632590Z 
2025-05-07T20:25:31.3701279Z libcublas-12.6.4.1   | 256.2 MB  | ##5        |  25% [A
2025-05-07T20:25:31.3701549Z 
2025-05-07T20:25:31.3704409Z 
2025-05-07T20:25:31.4194897Z libcufft-11.3.0.4    | 156.2 MB  | ###8       |  38% [A[A
2025-05-07T20:25:31.4264589Z nsight-compute-2024. | 443.1 MB  | #4         |  14% 
2025-05-07T20:25:31.4264848Z 
2025-05-07T20:25:31.4264873Z 
2025-05-07T20:25:31.4264878Z 
2025-05-07T20:25:31.4265995Z 
2025-05-07T20:25:31.4470498Z cuda-nsight-12.6.77  | 113.2 MB  | #####7     |  58% [A[A[A[A
2025-05-07T20:25:31.4470880Z 
2025-05-07T20:25:31.4470885Z 
2025-05-07T20:25:31.4471405Z 
2025-05-07T20:25:31.4694572Z libcusparse-12.5.4.2 | 118.6 MB  | #####8     |  59% [A[A[A
2025-05-07T20:25:31.4695548Z 
2025-05-07T20:25:31.4904211Z libcublas-12.6.4.1   | 256.2 MB  | ##6        |  27% [A
2025-05-07T20:25:31.4904567Z 
2025-05-07T20:25:31.4910524Z 
2025-05-07T20:25:31.5199199Z libcufft-11.3.0.4    | 156.2 MB  | ####       |  40% [A[A
2025-05-07T20:25:31.5271292Z nsight-compute-2024. | 443.1 MB  | #5         |  15% 
2025-05-07T20:25:31.5271652Z 
2025-05-07T20:25:31.5271663Z 
2025-05-07T20:25:31.5272019Z 
2025-05-07T20:25:31.5274583Z 
2025-05-07T20:25:31.5471290Z cuda-nsight-12.6.77  | 113.2 MB  | ######     |  61% [A[A[A[A
2025-05-07T20:25:31.5471675Z 
2025-05-07T20:25:31.5471681Z 
2025-05-07T20:25:31.5471686Z 
2025-05-07T20:25:31.5695235Z libcusparse-12.5.4.2 | 118.6 MB  | ######1    |  62% [A[A[A
2025-05-07T20:25:31.5697137Z 
2025-05-07T20:25:31.6203051Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  28% [A
2025-05-07T20:25:31.6275583Z nsight-compute-2024. | 443.1 MB  | #6         |  16% 
2025-05-07T20:25:31.6275922Z 
2025-05-07T20:25:31.6275928Z 
2025-05-07T20:25:31.6275934Z 
2025-05-07T20:25:31.6277907Z 
2025-05-07T20:25:31.6472964Z cuda-nsight-12.6.77  | 113.2 MB  | ######4    |  64% [A[A[A[A
2025-05-07T20:25:31.6473263Z 
2025-05-07T20:25:31.6473267Z 
2025-05-07T20:25:31.6473401Z 
2025-05-07T20:25:31.6633160Z libcusparse-12.5.4.2 | 118.6 MB  | ######5    |  65% [A[A[A
2025-05-07T20:25:31.6633548Z 
2025-05-07T20:25:31.6633553Z 
2025-05-07T20:25:31.6755878Z libcufft-11.3.0.4    | 156.2 MB  | ####2      |  42% [A[A
2025-05-07T20:25:31.6756180Z 
2025-05-07T20:25:31.7426153Z libcublas-12.6.4.1   | 256.2 MB  | ##9        |  30% [A
2025-05-07T20:25:31.7431676Z nsight-compute-2024. | 443.1 MB  | #7         |  17% 
2025-05-07T20:25:31.7432023Z 
2025-05-07T20:25:31.7432029Z 
2025-05-07T20:25:31.7432035Z 
2025-05-07T20:25:31.7432040Z 
2025-05-07T20:25:31.7479591Z cuda-nsight-12.6.77  | 113.2 MB  | ######7    |  68% [A[A[A[A
2025-05-07T20:25:31.7479875Z 
2025-05-07T20:25:31.7479887Z 
2025-05-07T20:25:31.7479891Z 
2025-05-07T20:25:31.7634030Z libcusparse-12.5.4.2 | 118.6 MB  | ######8    |  69% [A[A[A
2025-05-07T20:25:31.7634473Z 
2025-05-07T20:25:31.7634479Z 
2025-05-07T20:25:31.7916536Z libcufft-11.3.0.4    | 156.2 MB  | ####4      |  44% [A[A
2025-05-07T20:25:31.7921256Z 
2025-05-07T20:25:31.8491291Z libcublas-12.6.4.1   | 256.2 MB  | ###        |  31% [A
2025-05-07T20:25:31.8517326Z nsight-compute-2024. | 443.1 MB  | #7         |  18% 
2025-05-07T20:25:31.8517678Z 
2025-05-07T20:25:31.8517710Z 
2025-05-07T20:25:31.8517714Z 
2025-05-07T20:25:31.8517718Z 
2025-05-07T20:25:31.8565652Z cuda-nsight-12.6.77  | 113.2 MB  | #######    |  71% [A[A[A[A
2025-05-07T20:25:31.8566188Z 
2025-05-07T20:25:31.8566193Z 
2025-05-07T20:25:31.8566528Z 
2025-05-07T20:25:31.8716837Z libcusparse-12.5.4.2 | 118.6 MB  | #######1   |  72% [A[A[A
2025-05-07T20:25:31.8717403Z 
2025-05-07T20:25:31.8717408Z 
2025-05-07T20:25:31.9003331Z libcufft-11.3.0.4    | 156.2 MB  | ####6      |  46% [A[A
2025-05-07T20:25:31.9005744Z 
2025-05-07T20:25:31.9492616Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  32% [A
2025-05-07T20:25:31.9518101Z nsight-compute-2024. | 443.1 MB  | #8         |  19% 
2025-05-07T20:25:31.9518421Z 
2025-05-07T20:25:31.9518427Z 
2025-05-07T20:25:31.9518432Z 
2025-05-07T20:25:31.9522681Z 
2025-05-07T20:25:31.9575985Z cuda-nsight-12.6.77  | 113.2 MB  | #######4   |  74% [A[A[A[A
2025-05-07T20:25:31.9576318Z 
2025-05-07T20:25:31.9576322Z 
2025-05-07T20:25:31.9576326Z 
2025-05-07T20:25:32.0084255Z libcusparse-12.5.4.2 | 118.6 MB  | #######5   |  75% [A[A[A
2025-05-07T20:25:32.0090265Z 
2025-05-07T20:25:32.0478728Z libcublas-12.6.4.1   | 256.2 MB  | ###3       |  33% [A
2025-05-07T20:25:32.0479148Z 
2025-05-07T20:25:32.0480486Z 
2025-05-07T20:25:32.0523888Z libcufft-11.3.0.4    | 156.2 MB  | ####8      |  48% [A[A
2025-05-07T20:25:32.0524237Z 
2025-05-07T20:25:32.0524243Z 
2025-05-07T20:25:32.0524248Z 
2025-05-07T20:25:32.0524253Z 
2025-05-07T20:25:32.0603700Z cuda-nsight-12.6.77  | 113.2 MB  | #######7   |  77% [A[A[A[A
2025-05-07T20:25:32.0604073Z 
2025-05-07T20:25:32.0604077Z 
2025-05-07T20:25:32.0604081Z 
2025-05-07T20:25:32.0665228Z libcusparse-12.5.4.2 | 118.6 MB  | #######8   |  79% [A[A[A
2025-05-07T20:25:32.1136572Z nsight-compute-2024. | 443.1 MB  | #9         |  19% 
2025-05-07T20:25:32.1136917Z 
2025-05-07T20:25:32.1482015Z libcublas-12.6.4.1   | 256.2 MB  | ###4       |  35% [A
2025-05-07T20:25:32.1482691Z 
2025-05-07T20:25:32.1483552Z 
2025-05-07T20:25:32.1637507Z libcufft-11.3.0.4    | 156.2 MB  | #####      |  50% [A[A
2025-05-07T20:25:32.1638096Z 
2025-05-07T20:25:32.1638101Z 
2025-05-07T20:25:32.1638106Z 
2025-05-07T20:25:32.1666140Z libcusparse-12.5.4.2 | 118.6 MB  | ########1  |  82% [A[A[A
2025-05-07T20:25:32.1673068Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:25:32.1673391Z 
2025-05-07T20:25:32.1673396Z 
2025-05-07T20:25:32.1673399Z 
2025-05-07T20:25:32.1675763Z 
2025-05-07T20:25:32.2138614Z cuda-nsight-12.6.77  | 113.2 MB  | ########   |  81% [A[A[A[A
2025-05-07T20:25:32.2139723Z 
2025-05-07T20:25:32.2482985Z libcublas-12.6.4.1   | 256.2 MB  | ###5       |  36% [A
2025-05-07T20:25:32.2483280Z 
2025-05-07T20:25:32.2485503Z 
2025-05-07T20:25:32.2638489Z libcufft-11.3.0.4    | 156.2 MB  | #####2     |  52% [A[A
2025-05-07T20:25:32.2638760Z 
2025-05-07T20:25:32.2638764Z 
2025-05-07T20:25:32.2639073Z 
2025-05-07T20:25:32.2673682Z libcusparse-12.5.4.2 | 118.6 MB  | ########5  |  85% [A[A[A
2025-05-07T20:25:32.2674386Z 
2025-05-07T20:25:32.2674422Z 
2025-05-07T20:25:32.2674426Z 
2025-05-07T20:25:32.2674430Z 
2025-05-07T20:25:32.2709420Z cuda-nsight-12.6.77  | 113.2 MB  | ########3  |  84% [A[A[A[A
2025-05-07T20:25:32.3140608Z nsight-compute-2024. | 443.1 MB  | ##1        |  21% 
2025-05-07T20:25:32.3140908Z 
2025-05-07T20:25:32.3592999Z libcublas-12.6.4.1   | 256.2 MB  | ###7       |  37% [A
2025-05-07T20:25:32.3593366Z 
2025-05-07T20:25:32.3597155Z 
2025-05-07T20:25:32.3638402Z libcufft-11.3.0.4    | 156.2 MB  | #####4     |  54% [A[A
2025-05-07T20:25:32.3638679Z 
2025-05-07T20:25:32.3638683Z 
2025-05-07T20:25:32.3638687Z 
2025-05-07T20:25:32.3676714Z libcusparse-12.5.4.2 | 118.6 MB  | ########8  |  89% [A[A[A
2025-05-07T20:25:32.3677138Z 
2025-05-07T20:25:32.3677144Z 
2025-05-07T20:25:32.3677149Z 
2025-05-07T20:25:32.3680397Z 
2025-05-07T20:25:32.3761802Z cuda-nsight-12.6.77  | 113.2 MB  | ########7  |  87% [A[A[A[A
2025-05-07T20:25:32.4160831Z nsight-compute-2024. | 443.1 MB  | ##1        |  22% 
2025-05-07T20:25:32.4162007Z 
2025-05-07T20:25:32.4593684Z libcublas-12.6.4.1   | 256.2 MB  | ###8       |  39% [A
2025-05-07T20:25:32.4594151Z 
2025-05-07T20:25:32.4596676Z 
2025-05-07T20:25:32.4696519Z libcufft-11.3.0.4    | 156.2 MB  | #####6     |  56% [A[A
2025-05-07T20:25:32.4696927Z 
2025-05-07T20:25:32.4696933Z 
2025-05-07T20:25:32.4699648Z 
2025-05-07T20:25:32.4702603Z libcusparse-12.5.4.2 | 118.6 MB  | #########2 |  92% [A[A[A
2025-05-07T20:25:32.4702977Z 
2025-05-07T20:25:32.4702982Z 
2025-05-07T20:25:32.4702986Z 
2025-05-07T20:25:32.4702989Z 
2025-05-07T20:25:32.4780832Z cuda-nsight-12.6.77  | 113.2 MB  | #########  |  90% [A[A[A[A
2025-05-07T20:25:32.5167478Z nsight-compute-2024. | 443.1 MB  | ##2        |  23% 
2025-05-07T20:25:32.5168679Z 
2025-05-07T20:25:32.5703086Z libcublas-12.6.4.1   | 256.2 MB  | ####       |  40% [A
2025-05-07T20:25:32.5703360Z 
2025-05-07T20:25:32.5703364Z 
2025-05-07T20:25:32.5703831Z 
2025-05-07T20:25:32.5706115Z libcusparse-12.5.4.2 | 118.6 MB  | #########5 |  95% [A[A[A
2025-05-07T20:25:32.5706538Z 
2025-05-07T20:25:32.5706544Z 
2025-05-07T20:25:32.5706550Z 
2025-05-07T20:25:32.5707369Z 
2025-05-07T20:25:32.5782683Z cuda-nsight-12.6.77  | 113.2 MB  | #########3 |  94% [A[A[A[A
2025-05-07T20:25:32.5975010Z nsight-compute-2024. | 443.1 MB  | ##3        |  23% 
2025-05-07T20:25:32.5975282Z 
2025-05-07T20:25:32.5977782Z 
2025-05-07T20:25:32.6185977Z libcufft-11.3.0.4    | 156.2 MB  | #####8     |  58% [A[A
2025-05-07T20:25:32.6187186Z 
2025-05-07T20:25:32.6708174Z libcublas-12.6.4.1   | 256.2 MB  | ####1      |  42% [A
2025-05-07T20:25:32.6708509Z 
2025-05-07T20:25:32.6708513Z 
2025-05-07T20:25:32.6708517Z 
2025-05-07T20:25:32.6709506Z libcusparse-12.5.4.2 | 118.6 MB  | #########8 |  99% [A[A[A
2025-05-07T20:25:32.6709873Z 
2025-05-07T20:25:32.6709879Z 
2025-05-07T20:25:32.6709882Z 
2025-05-07T20:25:32.6712254Z 
2025-05-07T20:25:32.6864121Z cuda-nsight-12.6.77  | 113.2 MB  | #########7 |  97% [A[A[A[A
2025-05-07T20:25:32.7186765Z nsight-compute-2024. | 443.1 MB  | ##4        |  24% 
2025-05-07T20:25:32.7187508Z 
2025-05-07T20:25:32.7763261Z libcublas-12.6.4.1   | 256.2 MB  | ####3      |  43% [A
2025-05-07T20:25:32.7763639Z 
2025-05-07T20:25:32.7763643Z 
2025-05-07T20:25:32.8187916Z libcufft-11.3.0.4    | 156.2 MB  | ######     |  60% [A[A
2025-05-07T20:25:32.8188482Z 
2025-05-07T20:25:32.8534738Z libcublas-12.6.4.1   | 256.2 MB  | ####6      |  47% [A
2025-05-07T20:25:32.8764098Z nsight-compute-2024. | 443.1 MB  | ##4        |  25% 
2025-05-07T20:25:32.8764496Z 
2025-05-07T20:25:32.8765355Z 
2025-05-07T20:25:32.9499645Z libcufft-11.3.0.4    | 156.2 MB  | ######2    |  62% [A[A
2025-05-07T20:25:32.9500038Z 
2025-05-07T20:25:32.9535233Z libcublas-12.6.4.1   | 256.2 MB  | ####8      |  49% [A
2025-05-07T20:25:32.9770321Z nsight-compute-2024. | 443.1 MB  | ##5        |  26% 
2025-05-07T20:25:32.9770644Z 
2025-05-07T20:25:32.9772772Z 
2025-05-07T20:25:33.0538367Z libcufft-11.3.0.4    | 156.2 MB  | ######4    |  65% [A[A
2025-05-07T20:25:33.0771581Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:25:33.0772003Z 
2025-05-07T20:25:33.0772023Z 
2025-05-07T20:25:33.0826244Z libcufft-11.3.0.4    | 156.2 MB  | ######7    |  67% [A[A
2025-05-07T20:25:33.0829408Z 
2025-05-07T20:25:33.1541992Z libcublas-12.6.4.1   | 256.2 MB  | #####      |  50% [A
2025-05-07T20:25:33.1896758Z nsight-compute-2024. | 443.1 MB  | ##7        |  27% 
2025-05-07T20:25:33.1897123Z 
2025-05-07T20:25:33.1900385Z 
2025-05-07T20:25:33.1990253Z libcufft-11.3.0.4    | 156.2 MB  | ######9    |  70% [A[A
2025-05-07T20:25:33.1991079Z 
2025-05-07T20:25:33.2542792Z libcublas-12.6.4.1   | 256.2 MB  | #####2     |  52% [A
2025-05-07T20:25:33.2991578Z nsight-compute-2024. | 443.1 MB  | ##8        |  29% 
2025-05-07T20:25:33.2991979Z 
2025-05-07T20:25:33.3545234Z libcublas-12.6.4.1   | 256.2 MB  | #####4     |  55% [A
2025-05-07T20:25:33.3992382Z nsight-compute-2024. | 443.1 MB  | ##9        |  30% 
2025-05-07T20:25:33.3992777Z 
2025-05-07T20:25:33.4003336Z libcublas-12.6.4.1   | 256.2 MB  | #####7     |  57% [A
2025-05-07T20:25:33.4003662Z 
2025-05-07T20:25:33.4005448Z 
2025-05-07T20:25:33.4546168Z libcufft-11.3.0.4    | 156.2 MB  | #######1   |  72% [A[A
2025-05-07T20:25:33.5005381Z nsight-compute-2024. | 443.1 MB  | ###1       |  31% 
2025-05-07T20:25:33.5005833Z 
2025-05-07T20:25:33.5005839Z 
2025-05-07T20:25:33.5110816Z libcufft-11.3.0.4    | 156.2 MB  | #######4   |  75% [A[A
2025-05-07T20:25:33.5111710Z 
2025-05-07T20:25:33.5546435Z libcublas-12.6.4.1   | 256.2 MB  | #####9     |  59% [A
2025-05-07T20:25:33.6045754Z nsight-compute-2024. | 443.1 MB  | ###2       |  32% 
2025-05-07T20:25:33.6046026Z 
2025-05-07T20:25:33.6046031Z 
2025-05-07T20:25:33.6112120Z libcufft-11.3.0.4    | 156.2 MB  | #######6   |  77% [A[A
2025-05-07T20:25:33.6112394Z 
2025-05-07T20:25:33.6548618Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  61% [A
2025-05-07T20:25:33.7047150Z nsight-compute-2024. | 443.1 MB  | ###3       |  34% 
2025-05-07T20:25:33.7047433Z 
2025-05-07T20:25:33.7047997Z 
2025-05-07T20:25:33.7183591Z libcufft-11.3.0.4    | 156.2 MB  | #######9   |  80% [A[A
2025-05-07T20:25:33.7185669Z 
2025-05-07T20:25:33.7665810Z libcublas-12.6.4.1   | 256.2 MB  | ######3    |  63% [A
2025-05-07T20:25:33.8049431Z nsight-compute-2024. | 443.1 MB  | ###4       |  35% 
2025-05-07T20:25:33.8049828Z 
2025-05-07T20:25:33.8051356Z 
2025-05-07T20:25:33.8184593Z libcufft-11.3.0.4    | 156.2 MB  | ########2  |  82% [A[A
2025-05-07T20:25:33.8186052Z 
2025-05-07T20:25:33.8667454Z libcublas-12.6.4.1   | 256.2 MB  | ######5    |  65% [A
2025-05-07T20:25:33.9196765Z nsight-compute-2024. | 443.1 MB  | ###5       |  36% 
2025-05-07T20:25:33.9197505Z 
2025-05-07T20:25:33.9290761Z libcublas-12.6.4.1   | 256.2 MB  | ######7    |  67% [A
2025-05-07T20:25:33.9291031Z 
2025-05-07T20:25:33.9291251Z 
2025-05-07T20:25:33.9668141Z libcufft-11.3.0.4    | 156.2 MB  | ########4  |  85% [A[A
2025-05-07T20:25:34.0254872Z nsight-compute-2024. | 443.1 MB  | ###7       |  37% 
2025-05-07T20:25:34.0255592Z 
2025-05-07T20:25:34.0294500Z libcublas-12.6.4.1   | 256.2 MB  | ######9    |  69% [A
2025-05-07T20:25:34.0294873Z 
2025-05-07T20:25:34.0295669Z 
2025-05-07T20:25:34.0691563Z libcufft-11.3.0.4    | 156.2 MB  | ########7  |  88% [A[A
2025-05-07T20:25:34.1258352Z nsight-compute-2024. | 443.1 MB  | ###8       |  38% 
2025-05-07T20:25:34.1261038Z 
2025-05-07T20:25:34.1418122Z libcublas-12.6.4.1   | 256.2 MB  | #######1   |  71% [A
2025-05-07T20:25:34.1418430Z 
2025-05-07T20:25:34.1418434Z 
2025-05-07T20:25:34.1697753Z libcufft-11.3.0.4    | 156.2 MB  | #########  |  90% [A[A
2025-05-07T20:25:34.2361938Z nsight-compute-2024. | 443.1 MB  | ###9       |  39% 
2025-05-07T20:25:34.2362317Z 
2025-05-07T20:25:34.2418960Z libcublas-12.6.4.1   | 256.2 MB  | #######3   |  73% [A
2025-05-07T20:25:34.2419350Z 
2025-05-07T20:25:34.2419357Z 
2025-05-07T20:25:34.2784731Z libcufft-11.3.0.4    | 156.2 MB  | #########3 |  93% [A[A
2025-05-07T20:25:34.3367026Z nsight-compute-2024. | 443.1 MB  | ####       |  41% 
2025-05-07T20:25:34.3367811Z 
2025-05-07T20:25:34.3419797Z libcublas-12.6.4.1   | 256.2 MB  | #######5   |  75% [A
2025-05-07T20:25:34.3420207Z 
2025-05-07T20:25:34.3420213Z 
2025-05-07T20:25:34.3853600Z libcufft-11.3.0.4    | 156.2 MB  | #########5 |  96% [A[A
2025-05-07T20:25:34.4422640Z nsight-compute-2024. | 443.1 MB  | ####1      |  42% 
2025-05-07T20:25:34.4422927Z 
2025-05-07T20:25:34.4422932Z 
2025-05-07T20:25:34.4498081Z libcufft-11.3.0.4    | 156.2 MB  | #########8 |  99% [A[A
2025-05-07T20:25:34.4498461Z 
2025-05-07T20:25:34.4869214Z libcublas-12.6.4.1   | 256.2 MB  | #######7   |  77% [A
2025-05-07T20:25:34.5499428Z nsight-compute-2024. | 443.1 MB  | ####2      |  43% 
2025-05-07T20:25:34.5503157Z 
2025-05-07T20:25:34.5869308Z libcublas-12.6.4.1   | 256.2 MB  | #######9   |  79% [A
2025-05-07T20:25:34.6351758Z nsight-compute-2024. | 443.1 MB  | ####4      |  44% 
2025-05-07T20:25:34.6352028Z 
2025-05-07T20:25:34.6352033Z 
2025-05-07T20:25:34.6352037Z 
2025-05-07T20:25:34.6352041Z 
2025-05-07T20:25:34.6500949Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:34.6503556Z 
2025-05-07T20:25:34.6792881Z libcublas-12.6.4.1   | 256.2 MB  | ########1  |  81% [A
2025-05-07T20:25:34.6793227Z 
2025-05-07T20:25:34.6793233Z 
2025-05-07T20:25:34.6793259Z 
2025-05-07T20:25:34.6793265Z 
2025-05-07T20:25:34.6794736Z 
2025-05-07T20:25:34.6889358Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:34.6889641Z 
2025-05-07T20:25:34.6889645Z 
2025-05-07T20:25:34.6889648Z 
2025-05-07T20:25:34.7490499Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:34.7490810Z 
2025-05-07T20:25:34.7490814Z 
2025-05-07T20:25:34.7490818Z 
2025-05-07T20:25:34.7490822Z 
2025-05-07T20:25:34.7490826Z 
2025-05-07T20:25:34.7492804Z 
2025-05-07T20:25:34.7786484Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:34.7786790Z 
2025-05-07T20:25:34.7799365Z libcublas-12.6.4.1   | 256.2 MB  | ########3  |  83% [A
2025-05-07T20:25:34.7799631Z 
2025-05-07T20:25:34.7799659Z 
2025-05-07T20:25:34.7799665Z 
2025-05-07T20:25:34.7799668Z 
2025-05-07T20:25:34.7801551Z 
2025-05-07T20:25:34.8045848Z cuda-nvvp-12.6.80    | 109.3 MB  | 3          |   3% [A[A[A[A[A
2025-05-07T20:25:34.8496320Z nsight-compute-2024. | 443.1 MB  | ####5      |  45% 
2025-05-07T20:25:34.8496645Z 
2025-05-07T20:25:34.8496652Z 
2025-05-07T20:25:34.8496658Z 
2025-05-07T20:25:34.8496663Z 
2025-05-07T20:25:34.8496668Z 
2025-05-07T20:25:34.8496673Z 
2025-05-07T20:25:34.8804689Z libcusolver-11.7.1.2 | 95.8 MB   | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:25:34.8805094Z 
2025-05-07T20:25:34.8805102Z 
2025-05-07T20:25:34.8805107Z 
2025-05-07T20:25:34.8805112Z 
2025-05-07T20:25:34.8810935Z 
2025-05-07T20:25:34.9365368Z cuda-nvvp-12.6.80    | 109.3 MB  | 5          |   6% [A[A[A[A[A
2025-05-07T20:25:34.9370335Z 
2025-05-07T20:25:34.9500545Z libcublas-12.6.4.1   | 256.2 MB  | ########4  |  85% [A
2025-05-07T20:25:34.9500913Z 
2025-05-07T20:25:34.9500918Z 
2025-05-07T20:25:34.9501185Z 
2025-05-07T20:25:34.9501189Z 
2025-05-07T20:25:34.9501193Z 
2025-05-07T20:25:34.9501198Z 
2025-05-07T20:25:34.9682741Z libcusolver-11.7.1.2 | 95.8 MB   | 5          |   6% [A[A[A[A[A[A
2025-05-07T20:25:34.9809453Z nsight-compute-2024. | 443.1 MB  | ####6      |  46% 
2025-05-07T20:25:34.9809838Z 
2025-05-07T20:25:34.9809843Z 
2025-05-07T20:25:34.9809846Z 
2025-05-07T20:25:34.9809859Z 
2025-05-07T20:25:34.9812171Z 
2025-05-07T20:25:35.0508088Z cuda-nvvp-12.6.80    | 109.3 MB  | 8          |   8% [A[A[A[A[A
2025-05-07T20:25:35.0508389Z 
2025-05-07T20:25:35.0508393Z 
2025-05-07T20:25:35.0508397Z 
2025-05-07T20:25:35.0508409Z 
2025-05-07T20:25:35.0508413Z 
2025-05-07T20:25:35.0510233Z 
2025-05-07T20:25:35.0810509Z libcusolver-11.7.1.2 | 95.8 MB   | 8          |   8% [A[A[A[A[A[A
2025-05-07T20:25:35.0810826Z 
2025-05-07T20:25:35.0810838Z 
2025-05-07T20:25:35.0810842Z 
2025-05-07T20:25:35.0810846Z 
2025-05-07T20:25:35.0810850Z 
2025-05-07T20:25:35.0872025Z cuda-nvvp-12.6.80    | 109.3 MB  | #          |  11% [A[A[A[A[A
2025-05-07T20:25:35.0873164Z 
2025-05-07T20:25:35.1048215Z libcublas-12.6.4.1   | 256.2 MB  | ########6  |  87% [A
2025-05-07T20:25:35.1511545Z nsight-compute-2024. | 443.1 MB  | ####6      |  47% 
2025-05-07T20:25:35.1511839Z 
2025-05-07T20:25:35.1511843Z 
2025-05-07T20:25:35.1511847Z 
2025-05-07T20:25:35.1511851Z 
2025-05-07T20:25:35.1511855Z 
2025-05-07T20:25:35.1511858Z 
2025-05-07T20:25:35.1811888Z libcusolver-11.7.1.2 | 95.8 MB   | #1         |  11% [A[A[A[A[A[A
2025-05-07T20:25:35.1812182Z 
2025-05-07T20:25:35.1812185Z 
2025-05-07T20:25:35.1812189Z 
2025-05-07T20:25:35.1812193Z 
2025-05-07T20:25:35.1812197Z 
2025-05-07T20:25:35.2274515Z cuda-nvvp-12.6.80    | 109.3 MB  | #3         |  13% [A[A[A[A[A
2025-05-07T20:25:35.2276248Z 
2025-05-07T20:25:35.2462351Z libcublas-12.6.4.1   | 256.2 MB  | ########8  |  88% [A
2025-05-07T20:25:35.2514032Z nsight-compute-2024. | 443.1 MB  | ####7      |  48% 
2025-05-07T20:25:35.2514359Z 
2025-05-07T20:25:35.2514363Z 
2025-05-07T20:25:35.2514388Z 
2025-05-07T20:25:35.2514393Z 
2025-05-07T20:25:35.2514397Z 
2025-05-07T20:25:35.2516875Z 
2025-05-07T20:25:35.2814481Z libcusolver-11.7.1.2 | 95.8 MB   | #3         |  14% [A[A[A[A[A[A
2025-05-07T20:25:35.2814848Z 
2025-05-07T20:25:35.2814852Z 
2025-05-07T20:25:35.2814856Z 
2025-05-07T20:25:35.2814860Z 
2025-05-07T20:25:35.2814864Z 
2025-05-07T20:25:35.3434256Z cuda-nvvp-12.6.80    | 109.3 MB  | #5         |  16% [A[A[A[A[A
2025-05-07T20:25:35.3434549Z 
2025-05-07T20:25:35.3523835Z libcublas-12.6.4.1   | 256.2 MB  | ########9  |  89% [A
2025-05-07T20:25:35.3524477Z 
2025-05-07T20:25:35.3524485Z 
2025-05-07T20:25:35.3524492Z 
2025-05-07T20:25:35.3524499Z 
2025-05-07T20:25:35.3524505Z 
2025-05-07T20:25:35.3524513Z 
2025-05-07T20:25:35.3542465Z libcusolver-11.7.1.2 | 95.8 MB   | #6         |  17% [A[A[A[A[A[A
2025-05-07T20:25:35.3815437Z nsight-compute-2024. | 443.1 MB  | ####8      |  48% 
2025-05-07T20:25:35.3815700Z 
2025-05-07T20:25:35.3815704Z 
2025-05-07T20:25:35.3815732Z 
2025-05-07T20:25:35.3815744Z 
2025-05-07T20:25:35.3815748Z 
2025-05-07T20:25:35.4526325Z cuda-nvvp-12.6.80    | 109.3 MB  | #8         |  19% [A[A[A[A[A
2025-05-07T20:25:35.4526814Z 
2025-05-07T20:25:35.4526844Z 
2025-05-07T20:25:35.4526847Z 
2025-05-07T20:25:35.4526851Z 
2025-05-07T20:25:35.4526855Z 
2025-05-07T20:25:35.4526863Z 
2025-05-07T20:25:35.4580594Z libcusolver-11.7.1.2 | 95.8 MB   | #9         |  20% [A[A[A[A[A[A
2025-05-07T20:25:35.4585978Z 
2025-05-07T20:25:35.4610979Z libcublas-12.6.4.1   | 256.2 MB  | #########  |  91% [A
2025-05-07T20:25:35.4822911Z nsight-compute-2024. | 443.1 MB  | ####9      |  49% 
2025-05-07T20:25:35.4823277Z 
2025-05-07T20:25:35.4823513Z 
2025-05-07T20:25:35.4823520Z 
2025-05-07T20:25:35.4823523Z 
2025-05-07T20:25:35.4823546Z 
2025-05-07T20:25:35.5532133Z cuda-nvvp-12.6.80    | 109.3 MB  | ##1        |  21% [A[A[A[A[A
2025-05-07T20:25:35.5532439Z 
2025-05-07T20:25:35.5532443Z 
2025-05-07T20:25:35.5532447Z 
2025-05-07T20:25:35.5532451Z 
2025-05-07T20:25:35.5532763Z 
2025-05-07T20:25:35.5532767Z 
2025-05-07T20:25:35.5615062Z libcusolver-11.7.1.2 | 95.8 MB   | ##3        |  23% [A[A[A[A[A[A
2025-05-07T20:25:35.5616156Z 
2025-05-07T20:25:35.5618814Z libcublas-12.6.4.1   | 256.2 MB  | #########1 |  92% [A
2025-05-07T20:25:35.5829341Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:25:35.5829650Z 
2025-05-07T20:25:35.5829654Z 
2025-05-07T20:25:35.5829658Z 
2025-05-07T20:25:35.5829662Z 
2025-05-07T20:25:35.5829665Z 
2025-05-07T20:25:35.6569136Z cuda-nvvp-12.6.80    | 109.3 MB  | ##3        |  24% [A[A[A[A[A
2025-05-07T20:25:35.6569539Z 
2025-05-07T20:25:35.6569542Z 
2025-05-07T20:25:35.6569546Z 
2025-05-07T20:25:35.6569550Z 
2025-05-07T20:25:35.6569554Z 
2025-05-07T20:25:35.6571769Z 
2025-05-07T20:25:35.6623690Z libcusolver-11.7.1.2 | 95.8 MB   | ##6        |  26% [A[A[A[A[A[A
2025-05-07T20:25:35.6701082Z nsight-compute-2024. | 443.1 MB  | #####      |  51% 
2025-05-07T20:25:35.6703260Z 
2025-05-07T20:25:35.7624355Z libcublas-12.6.4.1   | 256.2 MB  | #########3 |  93% [A
2025-05-07T20:25:35.7702711Z nsight-compute-2024. | 443.1 MB  | #####1     |  51% 
2025-05-07T20:25:35.7705091Z 
2025-05-07T20:25:35.7716917Z libcublas-12.6.4.1   | 256.2 MB  | #########4 |  95% [A
2025-05-07T20:25:35.7717181Z 
2025-05-07T20:25:35.7717186Z 
2025-05-07T20:25:35.7717189Z 
2025-05-07T20:25:35.7717193Z 
2025-05-07T20:25:35.7717197Z 
2025-05-07T20:25:35.7718910Z 
2025-05-07T20:25:35.8436519Z libcusolver-11.7.1.2 | 95.8 MB   | ##9        |  29% [A[A[A[A[A[A
2025-05-07T20:25:35.8436825Z 
2025-05-07T20:25:35.8436830Z 
2025-05-07T20:25:35.8436833Z 
2025-05-07T20:25:35.8436837Z 
2025-05-07T20:25:35.8437991Z 
2025-05-07T20:25:35.8630872Z cuda-nvvp-12.6.80    | 109.3 MB  | ##6        |  26% [A[A[A[A[A
2025-05-07T20:25:35.8709922Z nsight-compute-2024. | 443.1 MB  | #####2     |  52% 
2025-05-07T20:25:35.8710455Z 
2025-05-07T20:25:35.8800341Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  96% [A
2025-05-07T20:25:35.8800687Z 
2025-05-07T20:25:35.8800720Z 
2025-05-07T20:25:35.8800725Z 
2025-05-07T20:25:35.8800731Z 
2025-05-07T20:25:35.8800736Z 
2025-05-07T20:25:35.8804832Z 
2025-05-07T20:25:35.9444760Z libcusolver-11.7.1.2 | 95.8 MB   | ###2       |  32% [A[A[A[A[A[A
2025-05-07T20:25:35.9445063Z 
2025-05-07T20:25:35.9445071Z 
2025-05-07T20:25:35.9445075Z 
2025-05-07T20:25:35.9445079Z 
2025-05-07T20:25:35.9448566Z 
2025-05-07T20:25:35.9784586Z cuda-nvvp-12.6.80    | 109.3 MB  | ##9        |  29% [A[A[A[A[A
2025-05-07T20:25:35.9803787Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:25:35.9804162Z 
2025-05-07T20:25:35.9804169Z 
2025-05-07T20:25:35.9804176Z 
2025-05-07T20:25:35.9804182Z 
2025-05-07T20:25:35.9804187Z 
2025-05-07T20:25:35.9805801Z 
2025-05-07T20:25:35.9922672Z libcusolver-11.7.1.2 | 95.8 MB   | ###4       |  35% [A[A[A[A[A[A
2025-05-07T20:25:35.9924144Z 
2025-05-07T20:25:36.0446004Z libcublas-12.6.4.1   | 256.2 MB  | #########7 |  97% [A
2025-05-07T20:25:36.0446277Z 
2025-05-07T20:25:36.0446305Z 
2025-05-07T20:25:36.0446309Z 
2025-05-07T20:25:36.0446313Z 
2025-05-07T20:25:36.0447850Z 
2025-05-07T20:25:36.0805136Z cuda-nvvp-12.6.80    | 109.3 MB  | ###1       |  32% [A[A[A[A[A
2025-05-07T20:25:36.0805567Z 
2025-05-07T20:25:36.0805592Z 
2025-05-07T20:25:36.0805597Z 
2025-05-07T20:25:36.0805795Z 
2025-05-07T20:25:36.0805801Z 
2025-05-07T20:25:36.0806752Z 
2025-05-07T20:25:36.0854592Z libcusolver-11.7.1.2 | 95.8 MB   | ###8       |  38% [A[A[A[A[A[A
2025-05-07T20:25:36.0973030Z nsight-compute-2024. | 443.1 MB  | #####3     |  54% 
2025-05-07T20:25:36.0975920Z 
2025-05-07T20:25:36.1447274Z libcublas-12.6.4.1   | 256.2 MB  | #########8 |  98% [A
2025-05-07T20:25:36.1447647Z 
2025-05-07T20:25:36.1447657Z 
2025-05-07T20:25:36.1447661Z 
2025-05-07T20:25:36.1447665Z 
2025-05-07T20:25:36.1447668Z 
2025-05-07T20:25:36.1807022Z cuda-nvvp-12.6.80    | 109.3 MB  | ###4       |  35% [A[A[A[A[A
2025-05-07T20:25:36.1807318Z 
2025-05-07T20:25:36.1807322Z 
2025-05-07T20:25:36.1807325Z 
2025-05-07T20:25:36.1807677Z 
2025-05-07T20:25:36.1807681Z 
2025-05-07T20:25:36.1808406Z 
2025-05-07T20:25:36.1875876Z libcusolver-11.7.1.2 | 95.8 MB   | ####1      |  41% [A[A[A[A[A[A
2025-05-07T20:25:36.1984431Z nsight-compute-2024. | 443.1 MB  | #####4     |  55% 
2025-05-07T20:25:36.1991717Z 
2025-05-07T20:25:36.2454254Z libcublas-12.6.4.1   | 256.2 MB  | #########9 | 100% [A
2025-05-07T20:25:36.2454602Z 
2025-05-07T20:25:36.2454607Z 
2025-05-07T20:25:36.2454612Z 
2025-05-07T20:25:36.2454616Z 
2025-05-07T20:25:36.2454631Z 
2025-05-07T20:25:36.2809444Z cuda-nvvp-12.6.80    | 109.3 MB  | ###7       |  38% [A[A[A[A[A
2025-05-07T20:25:36.2809730Z 
2025-05-07T20:25:36.2809735Z 
2025-05-07T20:25:36.2809738Z 
2025-05-07T20:25:36.2809742Z 
2025-05-07T20:25:36.2809754Z 
2025-05-07T20:25:36.2809757Z 
2025-05-07T20:25:36.2878313Z libcusolver-11.7.1.2 | 95.8 MB   | ####4      |  45% [A[A[A[A[A[A
2025-05-07T20:25:36.3457416Z nsight-compute-2024. | 443.1 MB  | #####5     |  55% 
2025-05-07T20:25:36.3457682Z 
2025-05-07T20:25:36.3457898Z 
2025-05-07T20:25:36.3457903Z 
2025-05-07T20:25:36.3457908Z 
2025-05-07T20:25:36.3460810Z 
2025-05-07T20:25:36.3813102Z cuda-nvvp-12.6.80    | 109.3 MB  | ####       |  41% [A[A[A[A[A
2025-05-07T20:25:36.3813452Z 
2025-05-07T20:25:36.3813456Z 
2025-05-07T20:25:36.3813460Z 
2025-05-07T20:25:36.3813464Z 
2025-05-07T20:25:36.3813467Z 
2025-05-07T20:25:36.3813471Z 
2025-05-07T20:25:36.4009933Z libcusolver-11.7.1.2 | 95.8 MB   | ####8      |  48% [A[A[A[A[A[A
2025-05-07T20:25:36.4502491Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:25:36.4502761Z 
2025-05-07T20:25:36.4502765Z 
2025-05-07T20:25:36.4502769Z 
2025-05-07T20:25:36.4502772Z 
2025-05-07T20:25:36.4505139Z 
2025-05-07T20:25:36.4815513Z cuda-nvvp-12.6.80    | 109.3 MB  | ####3      |  44% [A[A[A[A[A
2025-05-07T20:25:36.4815895Z 
2025-05-07T20:25:36.4815906Z 
2025-05-07T20:25:36.4815911Z 
2025-05-07T20:25:36.4815915Z 
2025-05-07T20:25:36.4815920Z 
2025-05-07T20:25:36.4815924Z 
2025-05-07T20:25:36.5013323Z libcusolver-11.7.1.2 | 95.8 MB   | #####1     |  51% [A[A[A[A[A[A
2025-05-07T20:25:36.5505580Z nsight-compute-2024. | 443.1 MB  | #####6     |  57% 
2025-05-07T20:25:36.5514816Z 
2025-05-07T20:25:36.5514827Z 
2025-05-07T20:25:36.5514856Z 
2025-05-07T20:25:36.5514866Z 
2025-05-07T20:25:36.5514874Z 
2025-05-07T20:25:36.5867581Z cuda-nvvp-12.6.80    | 109.3 MB  | ####6      |  47% [A[A[A[A[A
2025-05-07T20:25:36.5867933Z 
2025-05-07T20:25:36.5867937Z 
2025-05-07T20:25:36.5867941Z 
2025-05-07T20:25:36.5867944Z 
2025-05-07T20:25:36.5867948Z 
2025-05-07T20:25:36.5871615Z 
2025-05-07T20:25:36.6155728Z libcusolver-11.7.1.2 | 95.8 MB   | #####4     |  55% [A[A[A[A[A[A
2025-05-07T20:25:36.6660129Z nsight-compute-2024. | 443.1 MB  | #####7     |  58% 
2025-05-07T20:25:36.6660416Z 
2025-05-07T20:25:36.6660420Z 
2025-05-07T20:25:36.6660424Z 
2025-05-07T20:25:36.6660428Z 
2025-05-07T20:25:36.6667479Z 
2025-05-07T20:25:36.6868533Z cuda-nvvp-12.6.80    | 109.3 MB  | ####9      |  50% [A[A[A[A[A
2025-05-07T20:25:36.6868831Z 
2025-05-07T20:25:36.6868835Z 
2025-05-07T20:25:36.6868839Z 
2025-05-07T20:25:36.6868843Z 
2025-05-07T20:25:36.6868846Z 
2025-05-07T20:25:36.6868850Z 
2025-05-07T20:25:36.7272926Z libcusolver-11.7.1.2 | 95.8 MB   | #####7     |  58% [A[A[A[A[A[A
2025-05-07T20:25:36.7768310Z nsight-compute-2024. | 443.1 MB  | #####8     |  58% 
2025-05-07T20:25:36.7768602Z 
2025-05-07T20:25:36.7768606Z 
2025-05-07T20:25:36.7768610Z 
2025-05-07T20:25:36.7768613Z 
2025-05-07T20:25:36.7771159Z 
2025-05-07T20:25:36.7900276Z cuda-nvvp-12.6.80    | 109.3 MB  | #####2     |  52% [A[A[A[A[A
2025-05-07T20:25:36.7900665Z 
2025-05-07T20:25:36.7900669Z 
2025-05-07T20:25:36.7900673Z 
2025-05-07T20:25:36.7900686Z 
2025-05-07T20:25:36.7900690Z 
2025-05-07T20:25:36.7900693Z 
2025-05-07T20:25:36.8322604Z libcusolver-11.7.1.2 | 95.8 MB   | ######1    |  61% [A[A[A[A[A[A
2025-05-07T20:25:36.8770970Z nsight-compute-2024. | 443.1 MB  | #####9     |  59% 
2025-05-07T20:25:36.8771352Z 
2025-05-07T20:25:36.8771657Z 
2025-05-07T20:25:36.8771662Z 
2025-05-07T20:25:36.8771668Z 
2025-05-07T20:25:36.8773142Z 
2025-05-07T20:25:36.8972806Z cuda-nvvp-12.6.80    | 109.3 MB  | #####5     |  55% [A[A[A[A[A
2025-05-07T20:25:36.8973375Z 
2025-05-07T20:25:36.8973383Z 
2025-05-07T20:25:36.8973388Z 
2025-05-07T20:25:36.8973393Z 
2025-05-07T20:25:36.8973398Z 
2025-05-07T20:25:36.8975680Z 
2025-05-07T20:25:36.9422925Z libcusolver-11.7.1.2 | 95.8 MB   | ######4    |  64% [A[A[A[A[A[A
2025-05-07T20:25:36.9771430Z nsight-compute-2024. | 443.1 MB  | #####9     |  60% 
2025-05-07T20:25:36.9771696Z 
2025-05-07T20:25:36.9771700Z 
2025-05-07T20:25:36.9771704Z 
2025-05-07T20:25:36.9771708Z 
2025-05-07T20:25:36.9776076Z 
2025-05-07T20:25:36.9974750Z cuda-nvvp-12.6.80    | 109.3 MB  | #####8     |  58% [A[A[A[A[A
2025-05-07T20:25:36.9975042Z 
2025-05-07T20:25:36.9975047Z 
2025-05-07T20:25:36.9975050Z 
2025-05-07T20:25:36.9975063Z 
2025-05-07T20:25:36.9975066Z 
2025-05-07T20:25:36.9976905Z 
2025-05-07T20:25:37.0429443Z libcusolver-11.7.1.2 | 95.8 MB   | ######7    |  68% [A[A[A[A[A[A
2025-05-07T20:25:37.0773863Z nsight-compute-2024. | 443.1 MB  | ######     |  60% 
2025-05-07T20:25:37.0774129Z 
2025-05-07T20:25:37.0774133Z 
2025-05-07T20:25:37.0774149Z 
2025-05-07T20:25:37.0774153Z 
2025-05-07T20:25:37.0774157Z 
2025-05-07T20:25:37.0985298Z cuda-nvvp-12.6.80    | 109.3 MB  | ######     |  61% [A[A[A[A[A
2025-05-07T20:25:37.0985582Z 
2025-05-07T20:25:37.0985586Z 
2025-05-07T20:25:37.0985590Z 
2025-05-07T20:25:37.0985594Z 
2025-05-07T20:25:37.0985597Z 
2025-05-07T20:25:37.0987157Z 
2025-05-07T20:25:37.1439638Z libcusolver-11.7.1.2 | 95.8 MB   | #######    |  71% [A[A[A[A[A[A
2025-05-07T20:25:37.1774120Z nsight-compute-2024. | 443.1 MB  | ######1    |  61% 
2025-05-07T20:25:37.1774377Z 
2025-05-07T20:25:37.1774515Z 
2025-05-07T20:25:37.1774520Z 
2025-05-07T20:25:37.1774524Z 
2025-05-07T20:25:37.1775883Z 
2025-05-07T20:25:37.1986279Z cuda-nvvp-12.6.80    | 109.3 MB  | ######3    |  64% [A[A[A[A[A
2025-05-07T20:25:37.1986575Z 
2025-05-07T20:25:37.1986583Z 
2025-05-07T20:25:37.1986587Z 
2025-05-07T20:25:37.1986591Z 
2025-05-07T20:25:37.1986594Z 
2025-05-07T20:25:37.1990440Z 
2025-05-07T20:25:37.2775418Z libcusolver-11.7.1.2 | 95.8 MB   | #######4   |  75% [A[A[A[A[A[A
2025-05-07T20:25:37.2775714Z 
2025-05-07T20:25:37.2775718Z 
2025-05-07T20:25:37.2775722Z 
2025-05-07T20:25:37.2775725Z 
2025-05-07T20:25:37.2779742Z 
2025-05-07T20:25:37.2992875Z cuda-nvvp-12.6.80    | 109.3 MB  | ######7    |  67% [A[A[A[A[A
2025-05-07T20:25:37.2993153Z 
2025-05-07T20:25:37.2993157Z 
2025-05-07T20:25:37.2993160Z 
2025-05-07T20:25:37.2993164Z 
2025-05-07T20:25:37.2993168Z 
2025-05-07T20:25:37.2995720Z 
2025-05-07T20:25:37.3425171Z libcusolver-11.7.1.2 | 95.8 MB   | #######8   |  79% [A[A[A[A[A[A
2025-05-07T20:25:37.3782019Z nsight-compute-2024. | 443.1 MB  | ######1    |  62% 
2025-05-07T20:25:37.3782281Z 
2025-05-07T20:25:37.3782286Z 
2025-05-07T20:25:37.3782289Z 
2025-05-07T20:25:37.3782293Z 
2025-05-07T20:25:37.3784824Z 
2025-05-07T20:25:37.3993330Z cuda-nvvp-12.6.80    | 109.3 MB  | #######    |  70% [A[A[A[A[A
2025-05-07T20:25:37.3993618Z 
2025-05-07T20:25:37.3993622Z 
2025-05-07T20:25:37.3993625Z 
2025-05-07T20:25:37.3993629Z 
2025-05-07T20:25:37.3993644Z 
2025-05-07T20:25:37.3993648Z 
2025-05-07T20:25:37.4429644Z libcusolver-11.7.1.2 | 95.8 MB   | ########2  |  83% [A[A[A[A[A[A
2025-05-07T20:25:37.4937332Z nsight-compute-2024. | 443.1 MB  | ######2    |  63% 
2025-05-07T20:25:37.4937599Z 
2025-05-07T20:25:37.4937603Z 
2025-05-07T20:25:37.4937607Z 
2025-05-07T20:25:37.4937611Z 
2025-05-07T20:25:37.4937614Z 
2025-05-07T20:25:37.5219535Z cuda-nvvp-12.6.80    | 109.3 MB  | #######3   |  73% [A[A[A[A[A
2025-05-07T20:25:37.5219871Z 
2025-05-07T20:25:37.5219875Z 
2025-05-07T20:25:37.5219879Z 
2025-05-07T20:25:37.5219882Z 
2025-05-07T20:25:37.5219886Z 
2025-05-07T20:25:37.5219890Z 
2025-05-07T20:25:37.5433545Z libcusolver-11.7.1.2 | 95.8 MB   | ########6  |  86% [A[A[A[A[A[A
2025-05-07T20:25:37.5944655Z nsight-compute-2024. | 443.1 MB  | ######3    |  63% 
2025-05-07T20:25:37.5944910Z 
2025-05-07T20:25:37.5944914Z 
2025-05-07T20:25:37.5944918Z 
2025-05-07T20:25:37.5944921Z 
2025-05-07T20:25:37.5948849Z 
2025-05-07T20:25:37.6294299Z cuda-nvvp-12.6.80    | 109.3 MB  | #######6   |  76% [A[A[A[A[A
2025-05-07T20:25:37.6294678Z 
2025-05-07T20:25:37.6294683Z 
2025-05-07T20:25:37.6294687Z 
2025-05-07T20:25:37.6294704Z 
2025-05-07T20:25:37.6294708Z 
2025-05-07T20:25:37.6299322Z 
2025-05-07T20:25:37.6436819Z libcusolver-11.7.1.2 | 95.8 MB   | ########9  |  90% [A[A[A[A[A[A
2025-05-07T20:25:37.7084163Z nsight-compute-2024. | 443.1 MB  | ######4    |  64% 
2025-05-07T20:25:37.7084438Z 
2025-05-07T20:25:37.7084442Z 
2025-05-07T20:25:37.7084446Z 
2025-05-07T20:25:37.7084450Z 
2025-05-07T20:25:37.7087338Z 
2025-05-07T20:25:37.7295423Z cuda-nvvp-12.6.80    | 109.3 MB  | #######9   |  79% [A[A[A[A[A
2025-05-07T20:25:37.7295705Z 
2025-05-07T20:25:37.7295709Z 
2025-05-07T20:25:37.7295712Z 
2025-05-07T20:25:37.7295737Z 
2025-05-07T20:25:37.7295748Z 
2025-05-07T20:25:37.7299979Z 
2025-05-07T20:25:37.7441763Z libcusolver-11.7.1.2 | 95.8 MB   | #########3 |  93% [A[A[A[A[A[A
2025-05-07T20:25:37.8175018Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:25:37.8175279Z 
2025-05-07T20:25:37.8175283Z 
2025-05-07T20:25:37.8175286Z 
2025-05-07T20:25:37.8175290Z 
2025-05-07T20:25:37.8176580Z 
2025-05-07T20:25:37.8307741Z cuda-nvvp-12.6.80    | 109.3 MB  | ########2  |  82% [A[A[A[A[A
2025-05-07T20:25:37.8308067Z 
2025-05-07T20:25:37.8308299Z 
2025-05-07T20:25:37.8308307Z 
2025-05-07T20:25:37.8308312Z 
2025-05-07T20:25:37.8308317Z 
2025-05-07T20:25:37.8308322Z 
2025-05-07T20:25:37.8504806Z libcusolver-11.7.1.2 | 95.8 MB   | #########6 |  97% [A[A[A[A[A[A
2025-05-07T20:25:37.9180185Z nsight-compute-2024. | 443.1 MB  | ######5    |  66% 
2025-05-07T20:25:37.9180449Z 
2025-05-07T20:25:37.9180454Z 
2025-05-07T20:25:37.9180458Z 
2025-05-07T20:25:37.9180461Z 
2025-05-07T20:25:37.9182070Z 
2025-05-07T20:25:37.9506910Z cuda-nvvp-12.6.80    | 109.3 MB  | ########4  |  85% [A[A[A[A[A
2025-05-07T20:25:38.0185965Z nsight-compute-2024. | 443.1 MB  | ######6    |  66% 
2025-05-07T20:25:38.0186227Z 
2025-05-07T20:25:38.0186258Z 
2025-05-07T20:25:38.0186264Z 
2025-05-07T20:25:38.0186269Z 
2025-05-07T20:25:38.0188156Z 
2025-05-07T20:25:38.0511284Z cuda-nvvp-12.6.80    | 109.3 MB  | ########7  |  88% [A[A[A[A[A
2025-05-07T20:25:38.0815126Z nsight-compute-2024. | 443.1 MB  | ######7    |  67% 
2025-05-07T20:25:38.0815384Z 
2025-05-07T20:25:38.0815388Z 
2025-05-07T20:25:38.0815392Z 
2025-05-07T20:25:38.0815457Z 
2025-05-07T20:25:38.1190454Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:38.1190737Z 
2025-05-07T20:25:38.1190742Z 
2025-05-07T20:25:38.1190748Z 
2025-05-07T20:25:38.1190752Z 
2025-05-07T20:25:38.1190922Z 
2025-05-07T20:25:38.1512557Z cuda-nvvp-12.6.80    | 109.3 MB  | #########1 |  91% [A[A[A[A[A
2025-05-07T20:25:38.2195625Z nsight-compute-2024. | 443.1 MB  | ######8    |  68% 
2025-05-07T20:25:38.2195948Z 
2025-05-07T20:25:38.2195952Z 
2025-05-07T20:25:38.2195957Z 
2025-05-07T20:25:38.2195961Z 
2025-05-07T20:25:38.2196091Z 
2025-05-07T20:25:38.2512807Z cuda-nvvp-12.6.80    | 109.3 MB  | #########5 |  95% [A[A[A[A[A
2025-05-07T20:25:38.3196369Z nsight-compute-2024. | 443.1 MB  | ######9    |  69% 
2025-05-07T20:25:38.3196625Z 
2025-05-07T20:25:38.3196629Z 
2025-05-07T20:25:38.3196633Z 
2025-05-07T20:25:38.3196637Z 
2025-05-07T20:25:38.3198241Z 
2025-05-07T20:25:38.3516052Z cuda-nvvp-12.6.80    | 109.3 MB  | #########8 |  99% [A[A[A[A[A
2025-05-07T20:25:38.4518126Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:25:38.5518269Z nsight-compute-2024. | 443.1 MB  | #######1   |  71% 
2025-05-07T20:25:38.6519879Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:25:38.7521267Z nsight-compute-2024. | 443.1 MB  | #######3   |  73% 
2025-05-07T20:25:38.8516107Z nsight-compute-2024. | 443.1 MB  | #######4   |  74% 
2025-05-07T20:25:38.8516835Z 
2025-05-07T20:25:38.8517663Z 
2025-05-07T20:25:38.8596623Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:38.9105065Z nsight-compute-2024. | 443.1 MB  | #######5   |  75% 
2025-05-07T20:25:38.9105425Z 
2025-05-07T20:25:38.9105431Z 
2025-05-07T20:25:38.9105436Z 
2025-05-07T20:25:38.9105441Z 
2025-05-07T20:25:38.9105446Z 
2025-05-07T20:25:38.9105451Z 
2025-05-07T20:25:38.9118779Z 
2025-05-07T20:25:38.9735561Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:39.0109158Z nsight-compute-2024. | 443.1 MB  | #######6   |  76% 
2025-05-07T20:25:39.0109471Z 
2025-05-07T20:25:39.0109475Z 
2025-05-07T20:25:39.0109479Z 
2025-05-07T20:25:39.0109483Z 
2025-05-07T20:25:39.0109487Z 
2025-05-07T20:25:39.0109492Z 
2025-05-07T20:25:39.0109495Z 
2025-05-07T20:25:39.0840635Z libnpp-12.3.1.54     | 93.4 MB   | 3          |   4% [A[A[A[A[A[A[A
2025-05-07T20:25:39.1111907Z nsight-compute-2024. | 443.1 MB  | #######7   |  77% 
2025-05-07T20:25:39.1112214Z 
2025-05-07T20:25:39.1112218Z 
2025-05-07T20:25:39.1112222Z 
2025-05-07T20:25:39.1112227Z 
2025-05-07T20:25:39.1112232Z 
2025-05-07T20:25:39.1112235Z 
2025-05-07T20:25:39.1112239Z 
2025-05-07T20:25:39.2008951Z libnpp-12.3.1.54     | 93.4 MB   | 7          |   8% [A[A[A[A[A[A[A
2025-05-07T20:25:39.2114548Z nsight-compute-2024. | 443.1 MB  | #######8   |  78% 
2025-05-07T20:25:39.2114811Z 
2025-05-07T20:25:39.2114815Z 
2025-05-07T20:25:39.2114828Z 
2025-05-07T20:25:39.2114832Z 
2025-05-07T20:25:39.2114836Z 
2025-05-07T20:25:39.2114840Z 
2025-05-07T20:25:39.2118454Z 
2025-05-07T20:25:39.3169933Z libnpp-12.3.1.54     | 93.4 MB   | #          |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:39.3363881Z nsight-compute-2024. | 443.1 MB  | #######9   |  79% 
2025-05-07T20:25:39.3364254Z 
2025-05-07T20:25:39.3364261Z 
2025-05-07T20:25:39.3364266Z 
2025-05-07T20:25:39.3364273Z 
2025-05-07T20:25:39.3364278Z 
2025-05-07T20:25:39.3364295Z 
2025-05-07T20:25:39.3368660Z 
2025-05-07T20:25:39.4177309Z libnpp-12.3.1.54     | 93.4 MB   | #4         |  14% [A[A[A[A[A[A[A
2025-05-07T20:25:39.4366698Z nsight-compute-2024. | 443.1 MB  | #######9   |  80% 
2025-05-07T20:25:39.4367083Z 
2025-05-07T20:25:39.4367108Z 
2025-05-07T20:25:39.4367114Z 
2025-05-07T20:25:39.4367119Z 
2025-05-07T20:25:39.4367124Z 
2025-05-07T20:25:39.4367129Z 
2025-05-07T20:25:39.4369952Z 
2025-05-07T20:25:39.5236273Z libnpp-12.3.1.54     | 93.4 MB   | #7         |  18% [A[A[A[A[A[A[A
2025-05-07T20:25:39.5822728Z nsight-compute-2024. | 443.1 MB  | ########   |  81% 
2025-05-07T20:25:39.5823101Z 
2025-05-07T20:25:39.5823107Z 
2025-05-07T20:25:39.5823113Z 
2025-05-07T20:25:39.5823118Z 
2025-05-07T20:25:39.5823123Z 
2025-05-07T20:25:39.5823129Z 
2025-05-07T20:25:39.5823135Z 
2025-05-07T20:25:39.6242302Z libnpp-12.3.1.54     | 93.4 MB   | ##         |  21% [A[A[A[A[A[A[A
2025-05-07T20:25:39.7120823Z nsight-compute-2024. | 443.1 MB  | ########1  |  82% 
2025-05-07T20:25:39.7121195Z 
2025-05-07T20:25:39.7121233Z 
2025-05-07T20:25:39.7121238Z 
2025-05-07T20:25:39.7121244Z 
2025-05-07T20:25:39.7121249Z 
2025-05-07T20:25:39.7121254Z 
2025-05-07T20:25:39.7121259Z 
2025-05-07T20:25:39.7342456Z libnpp-12.3.1.54     | 93.4 MB   | ##3        |  24% [A[A[A[A[A[A[A
2025-05-07T20:25:39.8122614Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:25:39.8122992Z 
2025-05-07T20:25:39.8122998Z 
2025-05-07T20:25:39.8123003Z 
2025-05-07T20:25:39.8123008Z 
2025-05-07T20:25:39.8123013Z 
2025-05-07T20:25:39.8123018Z 
2025-05-07T20:25:39.8123023Z 
2025-05-07T20:25:39.8698420Z libnpp-12.3.1.54     | 93.4 MB   | ##6        |  27% [A[A[A[A[A[A[A
2025-05-07T20:25:39.9123808Z nsight-compute-2024. | 443.1 MB  | ########3  |  83% 
2025-05-07T20:25:39.9124176Z 
2025-05-07T20:25:39.9124182Z 
2025-05-07T20:25:39.9124187Z 
2025-05-07T20:25:39.9124192Z 
2025-05-07T20:25:39.9124196Z 
2025-05-07T20:25:39.9124201Z 
2025-05-07T20:25:39.9124206Z 
2025-05-07T20:25:39.9713538Z libnpp-12.3.1.54     | 93.4 MB   | ###        |  30% [A[A[A[A[A[A[A
2025-05-07T20:25:40.0125597Z nsight-compute-2024. | 443.1 MB  | ########4  |  84% 
2025-05-07T20:25:40.0126350Z 
2025-05-07T20:25:40.0126369Z 
2025-05-07T20:25:40.0126375Z 
2025-05-07T20:25:40.0126381Z 
2025-05-07T20:25:40.0126657Z 
2025-05-07T20:25:40.0126662Z 
2025-05-07T20:25:40.0126907Z 
2025-05-07T20:25:40.0782758Z libnpp-12.3.1.54     | 93.4 MB   | ###3       |  33% [A[A[A[A[A[A[A
2025-05-07T20:25:40.1131011Z nsight-compute-2024. | 443.1 MB  | ########4  |  85% 
2025-05-07T20:25:40.1131374Z 
2025-05-07T20:25:40.1131379Z 
2025-05-07T20:25:40.1131385Z 
2025-05-07T20:25:40.1131402Z 
2025-05-07T20:25:40.1131407Z 
2025-05-07T20:25:40.1131413Z 
2025-05-07T20:25:40.1133112Z 
2025-05-07T20:25:40.1787383Z libnpp-12.3.1.54     | 93.4 MB   | ###6       |  37% [A[A[A[A[A[A[A
2025-05-07T20:25:40.2139047Z nsight-compute-2024. | 443.1 MB  | ########5  |  86% 
2025-05-07T20:25:40.2139413Z 
2025-05-07T20:25:40.2139419Z 
2025-05-07T20:25:40.2139424Z 
2025-05-07T20:25:40.2139430Z 
2025-05-07T20:25:40.2139462Z 
2025-05-07T20:25:40.2139467Z 
2025-05-07T20:25:40.2139481Z 
2025-05-07T20:25:40.2820297Z libnpp-12.3.1.54     | 93.4 MB   | ####       |  41% [A[A[A[A[A[A[A
2025-05-07T20:25:40.3148235Z nsight-compute-2024. | 443.1 MB  | ########6  |  86% 
2025-05-07T20:25:40.3148597Z 
2025-05-07T20:25:40.3148604Z 
2025-05-07T20:25:40.3148609Z 
2025-05-07T20:25:40.3148614Z 
2025-05-07T20:25:40.3148619Z 
2025-05-07T20:25:40.3148624Z 
2025-05-07T20:25:40.3152730Z 
2025-05-07T20:25:40.3912756Z libnpp-12.3.1.54     | 93.4 MB   | ####3      |  44% [A[A[A[A[A[A[A
2025-05-07T20:25:40.4224653Z nsight-compute-2024. | 443.1 MB  | ########7  |  87% 
2025-05-07T20:25:40.4225020Z 
2025-05-07T20:25:40.4225027Z 
2025-05-07T20:25:40.4225042Z 
2025-05-07T20:25:40.4225047Z 
2025-05-07T20:25:40.4225052Z 
2025-05-07T20:25:40.4225057Z 
2025-05-07T20:25:40.4227642Z 
2025-05-07T20:25:40.4917861Z libnpp-12.3.1.54     | 93.4 MB   | ####7      |  47% [A[A[A[A[A[A[A
2025-05-07T20:25:40.5226499Z nsight-compute-2024. | 443.1 MB  | ########7  |  88% 
2025-05-07T20:25:40.5226902Z 
2025-05-07T20:25:40.5226908Z 
2025-05-07T20:25:40.5226913Z 
2025-05-07T20:25:40.5226918Z 
2025-05-07T20:25:40.5226923Z 
2025-05-07T20:25:40.5226937Z 
2025-05-07T20:25:40.5230162Z 
2025-05-07T20:25:40.5950921Z libnpp-12.3.1.54     | 93.4 MB   | #####      |  51% [A[A[A[A[A[A[A
2025-05-07T20:25:40.6119379Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:25:40.6119646Z 
2025-05-07T20:25:40.6119857Z 
2025-05-07T20:25:40.6119862Z 
2025-05-07T20:25:40.6119875Z 
2025-05-07T20:25:40.6119891Z 
2025-05-07T20:25:40.6133109Z 
2025-05-07T20:25:40.6227706Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:40.6228159Z 
2025-05-07T20:25:40.6228163Z 
2025-05-07T20:25:40.6228168Z 
2025-05-07T20:25:40.6228172Z 
2025-05-07T20:25:40.6228176Z 
2025-05-07T20:25:40.6228181Z 
2025-05-07T20:25:40.6228185Z 
2025-05-07T20:25:40.6570375Z libnpp-12.3.1.54     | 93.4 MB   | #####4     |  54% [A[A[A[A[A[A[A
2025-05-07T20:25:40.6570717Z 
2025-05-07T20:25:40.6570721Z 
2025-05-07T20:25:40.6570724Z 
2025-05-07T20:25:40.6570728Z 
2025-05-07T20:25:40.6570732Z 
2025-05-07T20:25:40.6570736Z 
2025-05-07T20:25:40.6570739Z 
2025-05-07T20:25:40.6572614Z 
2025-05-07T20:25:40.6968029Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.7403839Z nsight-compute-2024. | 443.1 MB  | ########9  |  89% 
2025-05-07T20:25:40.7404150Z 
2025-05-07T20:25:40.7404156Z 
2025-05-07T20:25:40.7404161Z 
2025-05-07T20:25:40.7404166Z 
2025-05-07T20:25:40.7404172Z 
2025-05-07T20:25:40.7404177Z 
2025-05-07T20:25:40.7406239Z 
2025-05-07T20:25:40.7571750Z libnpp-12.3.1.54     | 93.4 MB   | #####7     |  57% [A[A[A[A[A[A[A
2025-05-07T20:25:40.7572187Z 
2025-05-07T20:25:40.7572194Z 
2025-05-07T20:25:40.7572199Z 
2025-05-07T20:25:40.7572205Z 
2025-05-07T20:25:40.7572210Z 
2025-05-07T20:25:40.7572216Z 
2025-05-07T20:25:40.7572221Z 
2025-05-07T20:25:40.7577762Z 
2025-05-07T20:25:40.8099744Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 6          |   6% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.8574612Z nsight-compute-2024. | 443.1 MB  | #########  |  90% 
2025-05-07T20:25:40.8574895Z 
2025-05-07T20:25:40.8574899Z 
2025-05-07T20:25:40.8575146Z 
2025-05-07T20:25:40.8575151Z 
2025-05-07T20:25:40.8575155Z 
2025-05-07T20:25:40.8575158Z 
2025-05-07T20:25:40.8575162Z 
2025-05-07T20:25:40.8575811Z 
2025-05-07T20:25:40.9099489Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #2         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.9470092Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:25:40.9470429Z 
2025-05-07T20:25:40.9470644Z 
2025-05-07T20:25:40.9470657Z 
2025-05-07T20:25:40.9470661Z 
2025-05-07T20:25:40.9470664Z 
2025-05-07T20:25:40.9470668Z 
2025-05-07T20:25:40.9472555Z 
2025-05-07T20:25:40.9578550Z libnpp-12.3.1.54     | 93.4 MB   | ######     |  61% [A[A[A[A[A[A[A
2025-05-07T20:25:40.9578844Z 
2025-05-07T20:25:40.9578848Z 
2025-05-07T20:25:40.9578851Z 
2025-05-07T20:25:40.9578883Z 
2025-05-07T20:25:40.9578887Z 
2025-05-07T20:25:40.9578892Z 
2025-05-07T20:25:40.9578896Z 
2025-05-07T20:25:40.9580490Z 
2025-05-07T20:25:41.0105218Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##         |  20% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.0471091Z nsight-compute-2024. | 443.1 MB  | #########1 |  92% 
2025-05-07T20:25:41.0471369Z 
2025-05-07T20:25:41.0471373Z 
2025-05-07T20:25:41.0471377Z 
2025-05-07T20:25:41.0471380Z 
2025-05-07T20:25:41.0471384Z 
2025-05-07T20:25:41.0471387Z 
2025-05-07T20:25:41.0472828Z 
2025-05-07T20:25:41.0703547Z libnpp-12.3.1.54     | 93.4 MB   | ######3    |  63% [A[A[A[A[A[A[A
2025-05-07T20:25:41.0703830Z 
2025-05-07T20:25:41.0703834Z 
2025-05-07T20:25:41.0703837Z 
2025-05-07T20:25:41.0703841Z 
2025-05-07T20:25:41.0703845Z 
2025-05-07T20:25:41.0703849Z 
2025-05-07T20:25:41.0703853Z 
2025-05-07T20:25:41.0704959Z 
2025-05-07T20:25:41.1362127Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##7        |  27% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.1561005Z nsight-compute-2024. | 443.1 MB  | #########2 |  92% 
2025-05-07T20:25:41.1561416Z 
2025-05-07T20:25:41.1561422Z 
2025-05-07T20:25:41.1561427Z 
2025-05-07T20:25:41.1561432Z 
2025-05-07T20:25:41.1561438Z 
2025-05-07T20:25:41.1561442Z 
2025-05-07T20:25:41.1561460Z 
2025-05-07T20:25:41.1918531Z libnpp-12.3.1.54     | 93.4 MB   | ######6    |  66% [A[A[A[A[A[A[A
2025-05-07T20:25:41.1918920Z 
2025-05-07T20:25:41.1918925Z 
2025-05-07T20:25:41.1918931Z 
2025-05-07T20:25:41.1918936Z 
2025-05-07T20:25:41.1918941Z 
2025-05-07T20:25:41.1918946Z 
2025-05-07T20:25:41.1918952Z 
2025-05-07T20:25:41.1918957Z 
2025-05-07T20:25:41.2363671Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###4       |  34% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.2562695Z nsight-compute-2024. | 443.1 MB  | #########2 |  93% 
2025-05-07T20:25:41.2563066Z 
2025-05-07T20:25:41.2563073Z 
2025-05-07T20:25:41.2563078Z 
2025-05-07T20:25:41.2563083Z 
2025-05-07T20:25:41.2563088Z 
2025-05-07T20:25:41.2563093Z 
2025-05-07T20:25:41.2564673Z 
2025-05-07T20:25:41.2940101Z libnpp-12.3.1.54     | 93.4 MB   | ######8    |  69% [A[A[A[A[A[A[A
2025-05-07T20:25:41.2940519Z 
2025-05-07T20:25:41.2940525Z 
2025-05-07T20:25:41.2940531Z 
2025-05-07T20:25:41.2940536Z 
2025-05-07T20:25:41.2940541Z 
2025-05-07T20:25:41.2940559Z 
2025-05-07T20:25:41.2940565Z 
2025-05-07T20:25:41.2940570Z 
2025-05-07T20:25:41.3531873Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####       |  40% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.3565937Z nsight-compute-2024. | 443.1 MB  | #########3 |  94% 
2025-05-07T20:25:41.3566285Z 
2025-05-07T20:25:41.3566292Z 
2025-05-07T20:25:41.3566299Z 
2025-05-07T20:25:41.3566310Z 
2025-05-07T20:25:41.3566316Z 
2025-05-07T20:25:41.3566322Z 
2025-05-07T20:25:41.3568029Z 
2025-05-07T20:25:41.3951024Z libnpp-12.3.1.54     | 93.4 MB   | #######1   |  72% [A[A[A[A[A[A[A
2025-05-07T20:25:41.3951410Z 
2025-05-07T20:25:41.3951414Z 
2025-05-07T20:25:41.3951418Z 
2025-05-07T20:25:41.3951422Z 
2025-05-07T20:25:41.3951426Z 
2025-05-07T20:25:41.3951429Z 
2025-05-07T20:25:41.3951433Z 
2025-05-07T20:25:41.3956250Z 
2025-05-07T20:25:41.4568488Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####6      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.4568880Z 
2025-05-07T20:25:41.4568884Z 
2025-05-07T20:25:41.4568888Z 
2025-05-07T20:25:41.4569107Z 
2025-05-07T20:25:41.4569112Z 
2025-05-07T20:25:41.4569125Z 
2025-05-07T20:25:41.4570967Z 
2025-05-07T20:25:41.4585377Z libnpp-12.3.1.54     | 93.4 MB   | #######4   |  75% [A[A[A[A[A[A[A
2025-05-07T20:25:41.4951558Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:25:41.4951921Z 
2025-05-07T20:25:41.4951927Z 
2025-05-07T20:25:41.4951932Z 
2025-05-07T20:25:41.4951937Z 
2025-05-07T20:25:41.4951942Z 
2025-05-07T20:25:41.4951947Z 
2025-05-07T20:25:41.4951952Z 
2025-05-07T20:25:41.4951957Z 
2025-05-07T20:25:41.5581495Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####3     |  53% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.5581883Z 
2025-05-07T20:25:41.5581887Z 
2025-05-07T20:25:41.5581892Z 
2025-05-07T20:25:41.5581896Z 
2025-05-07T20:25:41.5581935Z 
2025-05-07T20:25:41.5581939Z 
2025-05-07T20:25:41.5583455Z 
2025-05-07T20:25:41.5590985Z libnpp-12.3.1.54     | 93.4 MB   | #######7   |  78% [A[A[A[A[A[A[A
2025-05-07T20:25:41.6097910Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:25:41.6098190Z 
2025-05-07T20:25:41.6098195Z 
2025-05-07T20:25:41.6098198Z 
2025-05-07T20:25:41.6098202Z 
2025-05-07T20:25:41.6098206Z 
2025-05-07T20:25:41.6098209Z 
2025-05-07T20:25:41.6098213Z 
2025-05-07T20:25:41.6101044Z 
2025-05-07T20:25:41.6582933Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####9     |  60% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.6583343Z 
2025-05-07T20:25:41.6583347Z 
2025-05-07T20:25:41.6583360Z 
2025-05-07T20:25:41.6583364Z 
2025-05-07T20:25:41.6583367Z 
2025-05-07T20:25:41.6583372Z 
2025-05-07T20:25:41.6584183Z 
2025-05-07T20:25:41.6654173Z libnpp-12.3.1.54     | 93.4 MB   | ########   |  81% [A[A[A[A[A[A[A
2025-05-07T20:25:41.7098023Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:25:41.7098448Z 
2025-05-07T20:25:41.7098454Z 
2025-05-07T20:25:41.7098460Z 
2025-05-07T20:25:41.7098465Z 
2025-05-07T20:25:41.7098470Z 
2025-05-07T20:25:41.7098486Z 
2025-05-07T20:25:41.7098491Z 
2025-05-07T20:25:41.7101106Z 
2025-05-07T20:25:41.7617464Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######5    |  66% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.7617865Z 
2025-05-07T20:25:41.7617880Z 
2025-05-07T20:25:41.7617884Z 
2025-05-07T20:25:41.7617888Z 
2025-05-07T20:25:41.7617892Z 
2025-05-07T20:25:41.7617895Z 
2025-05-07T20:25:41.7621068Z 
2025-05-07T20:25:41.7758230Z libnpp-12.3.1.54     | 93.4 MB   | ########3  |  84% [A[A[A[A[A[A[A
2025-05-07T20:25:41.8101228Z nsight-compute-2024. | 443.1 MB  | #########6 |  96% 
2025-05-07T20:25:41.8101584Z 
2025-05-07T20:25:41.8101591Z 
2025-05-07T20:25:41.8101596Z 
2025-05-07T20:25:41.8101601Z 
2025-05-07T20:25:41.8101615Z 
2025-05-07T20:25:41.8101621Z 
2025-05-07T20:25:41.8101626Z 
2025-05-07T20:25:41.8101631Z 
2025-05-07T20:25:41.8200489Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######2   |  72% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.8200926Z 
2025-05-07T20:25:41.8200941Z 
2025-05-07T20:25:41.8200946Z 
2025-05-07T20:25:41.8200952Z 
2025-05-07T20:25:41.8201716Z 
2025-05-07T20:25:41.8623520Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:41.8623833Z 
2025-05-07T20:25:41.8623838Z 
2025-05-07T20:25:41.8623841Z 
2025-05-07T20:25:41.8623845Z 
2025-05-07T20:25:41.8623849Z 
2025-05-07T20:25:41.8623853Z 
2025-05-07T20:25:41.8633137Z 
2025-05-07T20:25:41.8761348Z libnpp-12.3.1.54     | 93.4 MB   | ########7  |  87% [A[A[A[A[A[A[A
2025-05-07T20:25:41.8883950Z nsight-compute-2024. | 443.1 MB  | #########7 |  97% 
2025-05-07T20:25:41.8884360Z 
2025-05-07T20:25:41.8884374Z 
2025-05-07T20:25:41.8884380Z 
2025-05-07T20:25:41.8884385Z 
2025-05-07T20:25:41.8884390Z 
2025-05-07T20:25:41.8884396Z 
2025-05-07T20:25:41.8884401Z 
2025-05-07T20:25:41.8884406Z 
2025-05-07T20:25:41.8889073Z 
2025-05-07T20:25:41.9105532Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.9106482Z 
2025-05-07T20:25:41.9106487Z 
2025-05-07T20:25:41.9106491Z 
2025-05-07T20:25:41.9106495Z 
2025-05-07T20:25:41.9106498Z 
2025-05-07T20:25:41.9106502Z 
2025-05-07T20:25:41.9106683Z 
2025-05-07T20:25:41.9107386Z 
2025-05-07T20:25:41.9727323Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######8   |  79% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.9727716Z 
2025-05-07T20:25:41.9727721Z 
2025-05-07T20:25:41.9727724Z 
2025-05-07T20:25:41.9727728Z 
2025-05-07T20:25:41.9727732Z 
2025-05-07T20:25:41.9727736Z 
2025-05-07T20:25:41.9736070Z 
2025-05-07T20:25:41.9868785Z libnpp-12.3.1.54     | 93.4 MB   | #########  |  90% [A[A[A[A[A[A[A
2025-05-07T20:25:41.9887052Z nsight-compute-2024. | 443.1 MB  | #########7 |  98% 
2025-05-07T20:25:41.9887441Z 
2025-05-07T20:25:41.9887447Z 
2025-05-07T20:25:41.9887453Z 
2025-05-07T20:25:41.9887459Z 
2025-05-07T20:25:41.9887464Z 
2025-05-07T20:25:41.9887470Z 
2025-05-07T20:25:41.9887477Z 
2025-05-07T20:25:41.9887483Z 
2025-05-07T20:25:41.9889422Z 
2025-05-07T20:25:42.0208523Z libcurand-10.3.7.77  | 39.9 MB   | 5          |   6% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.0208834Z 
2025-05-07T20:25:42.0208839Z 
2025-05-07T20:25:42.0208843Z 
2025-05-07T20:25:42.0208864Z 
2025-05-07T20:25:42.0208867Z 
2025-05-07T20:25:42.0208871Z 
2025-05-07T20:25:42.0208875Z 
2025-05-07T20:25:42.0208878Z 
2025-05-07T20:25:42.0847661Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########5  |  85% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.0848097Z 
2025-05-07T20:25:42.0848103Z 
2025-05-07T20:25:42.0848108Z 
2025-05-07T20:25:42.0848113Z 
2025-05-07T20:25:42.0848118Z 
2025-05-07T20:25:42.0848123Z 
2025-05-07T20:25:42.0849726Z 
2025-05-07T20:25:42.0977434Z libnpp-12.3.1.54     | 93.4 MB   | #########3 |  93% [A[A[A[A[A[A[A
2025-05-07T20:25:42.0977800Z 
2025-05-07T20:25:42.0977806Z 
2025-05-07T20:25:42.0977810Z 
2025-05-07T20:25:42.0977823Z 
2025-05-07T20:25:42.0977826Z 
2025-05-07T20:25:42.0977831Z 
2025-05-07T20:25:42.0977835Z 
2025-05-07T20:25:42.0977873Z 
2025-05-07T20:25:42.0982667Z 
2025-05-07T20:25:42.0986131Z libcurand-10.3.7.77  | 39.9 MB   | #1         |  12% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.1326382Z nsight-compute-2024. | 443.1 MB  | #########8 |  98% 
2025-05-07T20:25:42.1326788Z 
2025-05-07T20:25:42.1326795Z 
2025-05-07T20:25:42.1326800Z 
2025-05-07T20:25:42.1326816Z 
2025-05-07T20:25:42.1326822Z 
2025-05-07T20:25:42.1326827Z 
2025-05-07T20:25:42.1326832Z 
2025-05-07T20:25:42.1330627Z 
2025-05-07T20:25:42.1849893Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########1 |  91% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.1850216Z 
2025-05-07T20:25:42.1850221Z 
2025-05-07T20:25:42.1850224Z 
2025-05-07T20:25:42.1850229Z 
2025-05-07T20:25:42.1850232Z 
2025-05-07T20:25:42.1850236Z 
2025-05-07T20:25:42.1851594Z 
2025-05-07T20:25:42.1987896Z libnpp-12.3.1.54     | 93.4 MB   | #########6 |  96% [A[A[A[A[A[A[A
2025-05-07T20:25:42.1988288Z 
2025-05-07T20:25:42.1988292Z 
2025-05-07T20:25:42.1988296Z 
2025-05-07T20:25:42.1988309Z 
2025-05-07T20:25:42.1988343Z 
2025-05-07T20:25:42.1988346Z 
2025-05-07T20:25:42.1988351Z 
2025-05-07T20:25:42.1988355Z 
2025-05-07T20:25:42.1990349Z 
2025-05-07T20:25:42.2146042Z libcurand-10.3.7.77  | 39.9 MB   | #7         |  18% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.2589163Z nsight-compute-2024. | 443.1 MB  | #########8 |  99% 
2025-05-07T20:25:42.2589440Z 
2025-05-07T20:25:42.2589444Z 
2025-05-07T20:25:42.2589448Z 
2025-05-07T20:25:42.2589459Z 
2025-05-07T20:25:42.2589463Z 
2025-05-07T20:25:42.2589467Z 
2025-05-07T20:25:42.2589470Z 
2025-05-07T20:25:42.2591242Z 
2025-05-07T20:25:42.2892681Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########7 |  97% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.2893000Z 
2025-05-07T20:25:42.2893004Z 
2025-05-07T20:25:42.2893008Z 
2025-05-07T20:25:42.2893013Z 
2025-05-07T20:25:42.2893016Z 
2025-05-07T20:25:42.2893021Z 
2025-05-07T20:25:42.2895642Z 
2025-05-07T20:25:42.2989621Z libnpp-12.3.1.54     | 93.4 MB   | #########9 |  99% [A[A[A[A[A[A[A
2025-05-07T20:25:42.2990186Z 
2025-05-07T20:25:42.2990191Z 
2025-05-07T20:25:42.2990194Z 
2025-05-07T20:25:42.2990198Z 
2025-05-07T20:25:42.2990202Z 
2025-05-07T20:25:42.2990206Z 
2025-05-07T20:25:42.2990209Z 
2025-05-07T20:25:42.2990213Z 
2025-05-07T20:25:42.2992190Z 
2025-05-07T20:25:42.3312845Z libcurand-10.3.7.77  | 39.9 MB   | ##3        |  24% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.3991754Z nsight-compute-2024. | 443.1 MB  | #########9 | 100% 
2025-05-07T20:25:42.3992032Z 
2025-05-07T20:25:42.3992036Z 
2025-05-07T20:25:42.3992040Z 
2025-05-07T20:25:42.3992043Z 
2025-05-07T20:25:42.3992047Z 
2025-05-07T20:25:42.3992051Z 
2025-05-07T20:25:42.3992054Z 
2025-05-07T20:25:42.3992058Z 
2025-05-07T20:25:42.3994144Z 
2025-05-07T20:25:42.4993675Z libcurand-10.3.7.77  | 39.9 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.4994030Z 
2025-05-07T20:25:42.4994036Z 
2025-05-07T20:25:42.4994041Z 
2025-05-07T20:25:42.4994046Z 
2025-05-07T20:25:42.4994051Z 
2025-05-07T20:25:42.4994057Z 
2025-05-07T20:25:42.4994096Z 
2025-05-07T20:25:42.4994101Z 
2025-05-07T20:25:42.4995787Z 
2025-05-07T20:25:42.6000346Z libcurand-10.3.7.77  | 39.9 MB   | ####1      |  42% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.6000702Z 
2025-05-07T20:25:42.6000729Z 
2025-05-07T20:25:42.6000733Z 
2025-05-07T20:25:42.6000737Z 
2025-05-07T20:25:42.6000741Z 
2025-05-07T20:25:42.6000745Z 
2025-05-07T20:25:42.6000756Z 
2025-05-07T20:25:42.6000760Z 
2025-05-07T20:25:42.6000763Z 
2025-05-07T20:25:42.7004313Z libcurand-10.3.7.77  | 39.9 MB   | #####      |  51% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7004742Z 
2025-05-07T20:25:42.7004755Z 
2025-05-07T20:25:42.7004760Z 
2025-05-07T20:25:42.7004764Z 
2025-05-07T20:25:42.7004768Z 
2025-05-07T20:25:42.7004772Z 
2025-05-07T20:25:42.7004776Z 
2025-05-07T20:25:42.7004779Z 
2025-05-07T20:25:42.7006654Z 
2025-05-07T20:25:42.8005260Z libcurand-10.3.7.77  | 39.9 MB   | ######1    |  61% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8005865Z 
2025-05-07T20:25:42.8005870Z 
2025-05-07T20:25:42.8005907Z 
2025-05-07T20:25:42.8005913Z 
2025-05-07T20:25:42.8005918Z 
2025-05-07T20:25:42.8005924Z 
2025-05-07T20:25:42.8005930Z 
2025-05-07T20:25:42.8005935Z 
2025-05-07T20:25:42.8007672Z 
2025-05-07T20:25:42.9006263Z libcurand-10.3.7.77  | 39.9 MB   | #######    |  70% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9006638Z 
2025-05-07T20:25:42.9006644Z 
2025-05-07T20:25:42.9006649Z 
2025-05-07T20:25:42.9006654Z 
2025-05-07T20:25:42.9006660Z 
2025-05-07T20:25:42.9006665Z 
2025-05-07T20:25:42.9006682Z 
2025-05-07T20:25:42.9006688Z 
2025-05-07T20:25:42.9009070Z 
2025-05-07T20:25:43.0013502Z libcurand-10.3.7.77  | 39.9 MB   | ########   |  81% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0013884Z 
2025-05-07T20:25:43.0013896Z 
2025-05-07T20:25:43.0013900Z 
2025-05-07T20:25:43.0013904Z 
2025-05-07T20:25:43.0013908Z 
2025-05-07T20:25:43.0013912Z 
2025-05-07T20:25:43.0013916Z 
2025-05-07T20:25:43.0013921Z 
2025-05-07T20:25:43.0013924Z 
2025-05-07T20:25:43.0174691Z libcurand-10.3.7.77  | 39.9 MB   | #########  |  90% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0175072Z 
2025-05-07T20:25:43.0175077Z 
2025-05-07T20:25:43.0178129Z 
2025-05-07T20:25:43.1016550Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:43.1016860Z 
2025-05-07T20:25:43.1018899Z 
2025-05-07T20:25:43.1019154Z 
2025-05-07T20:25:43.1019162Z 
2025-05-07T20:25:43.1019165Z 
2025-05-07T20:25:43.1019169Z 
2025-05-07T20:25:43.1019173Z 
2025-05-07T20:25:43.1019177Z 
2025-05-07T20:25:43.1021152Z 
2025-05-07T20:25:43.9679546Z libcurand-10.3.7.77  | 39.9 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9679890Z 
2025-05-07T20:25:43.9679895Z 
2025-05-07T20:25:43.9679898Z 
2025-05-07T20:25:43.9679902Z 
2025-05-07T20:25:43.9679906Z 
2025-05-07T20:25:43.9679910Z 
2025-05-07T20:25:43.9679913Z 
2025-05-07T20:25:43.9680855Z 
2025-05-07T20:25:44.0078792Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0079121Z 
2025-05-07T20:25:44.0079126Z 
2025-05-07T20:25:44.0079386Z 
2025-05-07T20:25:44.0079389Z 
2025-05-07T20:25:44.0079393Z 
2025-05-07T20:25:44.0079397Z 
2025-05-07T20:25:44.0079400Z 
2025-05-07T20:25:44.0079405Z 
2025-05-07T20:25:44.0079409Z 
2025-05-07T20:25:44.0085261Z 
2025-05-07T20:25:44.1079162Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1079639Z 
2025-05-07T20:25:44.1079646Z 
2025-05-07T20:25:44.1079652Z 
2025-05-07T20:25:44.1079657Z 
2025-05-07T20:25:44.1079677Z 
2025-05-07T20:25:44.1079683Z 
2025-05-07T20:25:44.1079688Z 
2025-05-07T20:25:44.1079694Z 
2025-05-07T20:25:44.1079699Z 
2025-05-07T20:25:44.1079705Z 
2025-05-07T20:25:44.2142865Z gds-tools-1.11.1.6   | 37.8 MB   | 9          |   9% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2143184Z 
2025-05-07T20:25:44.2143188Z 
2025-05-07T20:25:44.2143192Z 
2025-05-07T20:25:44.2143196Z 
2025-05-07T20:25:44.2143199Z 
2025-05-07T20:25:44.2143204Z 
2025-05-07T20:25:44.2143208Z 
2025-05-07T20:25:44.2143212Z 
2025-05-07T20:25:44.2143215Z 
2025-05-07T20:25:44.2143237Z 
2025-05-07T20:25:44.3166471Z gds-tools-1.11.1.6   | 37.8 MB   | #8         |  19% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.3166839Z 
2025-05-07T20:25:44.3166843Z 
2025-05-07T20:25:44.3166854Z 
2025-05-07T20:25:44.3166872Z 
2025-05-07T20:25:44.3166876Z 
2025-05-07T20:25:44.3166880Z 
2025-05-07T20:25:44.3166884Z 
2025-05-07T20:25:44.3166887Z 
2025-05-07T20:25:44.3166891Z 
2025-05-07T20:25:44.3169126Z 
2025-05-07T20:25:44.4153795Z gds-tools-1.11.1.6   | 37.8 MB   | ##7        |  28% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4154096Z 
2025-05-07T20:25:44.4167744Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:44.4168021Z 
2025-05-07T20:25:44.4168025Z 
2025-05-07T20:25:44.4168029Z 
2025-05-07T20:25:44.4168032Z 
2025-05-07T20:25:44.4168044Z 
2025-05-07T20:25:44.4168048Z 
2025-05-07T20:25:44.4168051Z 
2025-05-07T20:25:44.4168055Z 
2025-05-07T20:25:44.4168063Z 
2025-05-07T20:25:44.4170652Z 
2025-05-07T20:25:44.4348617Z gds-tools-1.11.1.6   | 37.8 MB   | ###7       |  38% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4348948Z 
2025-05-07T20:25:44.4348952Z 
2025-05-07T20:25:44.4348955Z 
2025-05-07T20:25:44.4348959Z 
2025-05-07T20:25:44.4348963Z 
2025-05-07T20:25:44.4348967Z 
2025-05-07T20:25:44.4348979Z 
2025-05-07T20:25:44.4348982Z 
2025-05-07T20:25:44.4350814Z 
2025-05-07T20:25:44.4507753Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4508214Z 
2025-05-07T20:25:44.4508219Z 
2025-05-07T20:25:44.4508225Z 
2025-05-07T20:25:44.4508230Z 
2025-05-07T20:25:44.4508235Z 
2025-05-07T20:25:44.4508240Z 
2025-05-07T20:25:44.4508246Z 
2025-05-07T20:25:44.4508251Z 
2025-05-07T20:25:44.4508256Z 
2025-05-07T20:25:44.4508261Z 
2025-05-07T20:25:44.4509514Z 
2025-05-07T20:25:44.4860688Z python-3.11.8        | 29.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4860986Z 
2025-05-07T20:25:44.4860990Z 
2025-05-07T20:25:44.4860994Z 
2025-05-07T20:25:44.4860997Z 
2025-05-07T20:25:44.4861001Z 
2025-05-07T20:25:44.4861017Z 
2025-05-07T20:25:44.4861021Z 
2025-05-07T20:25:44.4861024Z 
2025-05-07T20:25:44.4861035Z 
2025-05-07T20:25:44.4861039Z 
2025-05-07T20:25:44.4861043Z 
2025-05-07T20:25:44.4863949Z 
2025-05-07T20:25:44.5375022Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5375350Z 
2025-05-07T20:25:44.5375354Z 
2025-05-07T20:25:44.5375358Z 
2025-05-07T20:25:44.5375362Z 
2025-05-07T20:25:44.5375366Z 
2025-05-07T20:25:44.5375369Z 
2025-05-07T20:25:44.5375373Z 
2025-05-07T20:25:44.5375377Z 
2025-05-07T20:25:44.5375381Z 
2025-05-07T20:25:44.5377233Z 
2025-05-07T20:25:44.5509700Z gds-tools-1.11.1.6   | 37.8 MB   | ####6      |  47% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5510006Z 
2025-05-07T20:25:44.5510010Z 
2025-05-07T20:25:44.5510014Z 
2025-05-07T20:25:44.5510017Z 
2025-05-07T20:25:44.5510021Z 
2025-05-07T20:25:44.5510025Z 
2025-05-07T20:25:44.5510028Z 
2025-05-07T20:25:44.5510032Z 
2025-05-07T20:25:44.5510036Z 
2025-05-07T20:25:44.5510293Z 
2025-05-07T20:25:44.5512613Z 
2025-05-07T20:25:44.5863763Z python-3.11.8        | 29.3 MB   | 8          |   9% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5864058Z 
2025-05-07T20:25:44.5864062Z 
2025-05-07T20:25:44.5864262Z 
2025-05-07T20:25:44.5864268Z 
2025-05-07T20:25:44.5864272Z 
2025-05-07T20:25:44.5864276Z 
2025-05-07T20:25:44.5864280Z 
2025-05-07T20:25:44.5864283Z 
2025-05-07T20:25:44.5864296Z 
2025-05-07T20:25:44.5864299Z 
2025-05-07T20:25:44.5864303Z 
2025-05-07T20:25:44.5865078Z 
2025-05-07T20:25:44.6515090Z cuda-nvcc-tools-12.6 | 23.0 MB   | #1         |  12% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6515469Z 
2025-05-07T20:25:44.6515473Z 
2025-05-07T20:25:44.6515477Z 
2025-05-07T20:25:44.6515481Z 
2025-05-07T20:25:44.6515484Z 
2025-05-07T20:25:44.6515488Z 
2025-05-07T20:25:44.6515492Z 
2025-05-07T20:25:44.6515496Z 
2025-05-07T20:25:44.6515499Z 
2025-05-07T20:25:44.6515503Z 
2025-05-07T20:25:44.6515511Z 
2025-05-07T20:25:44.6579620Z python-3.11.8        | 29.3 MB   | #8         |  18% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6580001Z 
2025-05-07T20:25:44.6580006Z 
2025-05-07T20:25:44.6580010Z 
2025-05-07T20:25:44.6580013Z 
2025-05-07T20:25:44.6580017Z 
2025-05-07T20:25:44.6580028Z 
2025-05-07T20:25:44.6580032Z 
2025-05-07T20:25:44.6580036Z 
2025-05-07T20:25:44.6580040Z 
2025-05-07T20:25:44.6580043Z 
2025-05-07T20:25:44.6867140Z gds-tools-1.11.1.6   | 37.8 MB   | #####5     |  56% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6867531Z 
2025-05-07T20:25:44.6867537Z 
2025-05-07T20:25:44.6867542Z 
2025-05-07T20:25:44.6867547Z 
2025-05-07T20:25:44.6867552Z 
2025-05-07T20:25:44.6867557Z 
2025-05-07T20:25:44.6867562Z 
2025-05-07T20:25:44.6867567Z 
2025-05-07T20:25:44.6867573Z 
2025-05-07T20:25:44.6867578Z 
2025-05-07T20:25:44.6867596Z 
2025-05-07T20:25:44.6869520Z 
2025-05-07T20:25:44.7550153Z cuda-nvcc-tools-12.6 | 23.0 MB   | ##5        |  26% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7550511Z 
2025-05-07T20:25:44.7550523Z 
2025-05-07T20:25:44.7550544Z 
2025-05-07T20:25:44.7550548Z 
2025-05-07T20:25:44.7550552Z 
2025-05-07T20:25:44.7550555Z 
2025-05-07T20:25:44.7550560Z 
2025-05-07T20:25:44.7550564Z 
2025-05-07T20:25:44.7550567Z 
2025-05-07T20:25:44.7550580Z 
2025-05-07T20:25:44.7550583Z 
2025-05-07T20:25:44.7618814Z python-3.11.8        | 29.3 MB   | ##7        |  27% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7619108Z 
2025-05-07T20:25:44.7619112Z 
2025-05-07T20:25:44.7619116Z 
2025-05-07T20:25:44.7619119Z 
2025-05-07T20:25:44.7619123Z 
2025-05-07T20:25:44.7619127Z 
2025-05-07T20:25:44.7619131Z 
2025-05-07T20:25:44.7619134Z 
2025-05-07T20:25:44.7619138Z 
2025-05-07T20:25:44.7619142Z 
2025-05-07T20:25:44.7867254Z gds-tools-1.11.1.6   | 37.8 MB   | ######3    |  64% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7867561Z 
2025-05-07T20:25:44.7867565Z 
2025-05-07T20:25:44.7867568Z 
2025-05-07T20:25:44.7867572Z 
2025-05-07T20:25:44.7867576Z 
2025-05-07T20:25:44.7867579Z 
2025-05-07T20:25:44.7867583Z 
2025-05-07T20:25:44.7867598Z 
2025-05-07T20:25:44.7867602Z 
2025-05-07T20:25:44.7867606Z 
2025-05-07T20:25:44.7867610Z 
2025-05-07T20:25:44.7869778Z 
2025-05-07T20:25:44.8555043Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8555406Z 
2025-05-07T20:25:44.8555412Z 
2025-05-07T20:25:44.8555417Z 
2025-05-07T20:25:44.8555434Z 
2025-05-07T20:25:44.8555440Z 
2025-05-07T20:25:44.8555445Z 
2025-05-07T20:25:44.8555450Z 
2025-05-07T20:25:44.8555455Z 
2025-05-07T20:25:44.8555460Z 
2025-05-07T20:25:44.8555465Z 
2025-05-07T20:25:44.8555470Z 
2025-05-07T20:25:44.8746670Z python-3.11.8        | 29.3 MB   | ###6       |  37% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8746969Z 
2025-05-07T20:25:44.8746973Z 
2025-05-07T20:25:44.8746977Z 
2025-05-07T20:25:44.8746981Z 
2025-05-07T20:25:44.8746984Z 
2025-05-07T20:25:44.8746988Z 
2025-05-07T20:25:44.8746992Z 
2025-05-07T20:25:44.8746995Z 
2025-05-07T20:25:44.8746999Z 
2025-05-07T20:25:44.8749918Z 
2025-05-07T20:25:44.8870111Z gds-tools-1.11.1.6   | 37.8 MB   | #######2   |  72% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8870690Z 
2025-05-07T20:25:44.8870694Z 
2025-05-07T20:25:44.8870698Z 
2025-05-07T20:25:44.8870701Z 
2025-05-07T20:25:44.8870705Z 
2025-05-07T20:25:44.8870836Z 
2025-05-07T20:25:44.8870841Z 
2025-05-07T20:25:44.8870845Z 
2025-05-07T20:25:44.8870848Z 
2025-05-07T20:25:44.8870852Z 
2025-05-07T20:25:44.8870856Z 
2025-05-07T20:25:44.8870860Z 
2025-05-07T20:25:44.9555205Z cuda-nvcc-tools-12.6 | 23.0 MB   | #####1     |  52% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9555526Z 
2025-05-07T20:25:44.9555530Z 
2025-05-07T20:25:44.9555534Z 
2025-05-07T20:25:44.9555537Z 
2025-05-07T20:25:44.9555541Z 
2025-05-07T20:25:44.9555545Z 
2025-05-07T20:25:44.9555549Z 
2025-05-07T20:25:44.9555560Z 
2025-05-07T20:25:44.9555564Z 
2025-05-07T20:25:44.9555568Z 
2025-05-07T20:25:44.9555571Z 
2025-05-07T20:25:44.9757651Z python-3.11.8        | 29.3 MB   | ####5      |  46% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9758049Z 
2025-05-07T20:25:44.9758053Z 
2025-05-07T20:25:44.9758057Z 
2025-05-07T20:25:44.9758061Z 
2025-05-07T20:25:44.9758064Z 
2025-05-07T20:25:44.9758068Z 
2025-05-07T20:25:44.9758071Z 
2025-05-07T20:25:44.9758075Z 
2025-05-07T20:25:44.9758096Z 
2025-05-07T20:25:44.9763975Z 
2025-05-07T20:25:44.9940989Z gds-tools-1.11.1.6   | 37.8 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9941414Z 
2025-05-07T20:25:44.9941420Z 
2025-05-07T20:25:44.9941425Z 
2025-05-07T20:25:44.9941430Z 
2025-05-07T20:25:44.9941435Z 
2025-05-07T20:25:44.9941440Z 
2025-05-07T20:25:44.9941445Z 
2025-05-07T20:25:44.9941451Z 
2025-05-07T20:25:44.9941456Z 
2025-05-07T20:25:44.9941461Z 
2025-05-07T20:25:44.9941466Z 
2025-05-07T20:25:44.9941471Z 
2025-05-07T20:25:45.0598556Z cuda-nvcc-tools-12.6 | 23.0 MB   | ######4    |  65% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0598889Z 
2025-05-07T20:25:45.0598893Z 
2025-05-07T20:25:45.0598897Z 
2025-05-07T20:25:45.0598901Z 
2025-05-07T20:25:45.0598904Z 
2025-05-07T20:25:45.0598922Z 
2025-05-07T20:25:45.0598926Z 
2025-05-07T20:25:45.0598929Z 
2025-05-07T20:25:45.0598933Z 
2025-05-07T20:25:45.0598944Z 
2025-05-07T20:25:45.0601698Z 
2025-05-07T20:25:45.0770925Z python-3.11.8        | 29.3 MB   | #####5     |  55% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0771301Z 
2025-05-07T20:25:45.0771308Z 
2025-05-07T20:25:45.0771321Z 
2025-05-07T20:25:45.0771327Z 
2025-05-07T20:25:45.0771332Z 
2025-05-07T20:25:45.0771338Z 
2025-05-07T20:25:45.0771343Z 
2025-05-07T20:25:45.0771348Z 
2025-05-07T20:25:45.0771353Z 
2025-05-07T20:25:45.0772744Z 
2025-05-07T20:25:45.1009604Z gds-tools-1.11.1.6   | 37.8 MB   | ########7  |  88% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1009905Z 
2025-05-07T20:25:45.1009910Z 
2025-05-07T20:25:45.1009914Z 
2025-05-07T20:25:45.1009917Z 
2025-05-07T20:25:45.1009921Z 
2025-05-07T20:25:45.1009924Z 
2025-05-07T20:25:45.1010085Z 
2025-05-07T20:25:45.1010094Z 
2025-05-07T20:25:45.1010103Z 
2025-05-07T20:25:45.1010111Z 
2025-05-07T20:25:45.1010132Z 
2025-05-07T20:25:45.1010142Z 
2025-05-07T20:25:45.1603073Z cuda-nvcc-tools-12.6 | 23.0 MB   | #######7   |  78% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1603435Z 
2025-05-07T20:25:45.1603440Z 
2025-05-07T20:25:45.1603457Z 
2025-05-07T20:25:45.1603462Z 
2025-05-07T20:25:45.1603466Z 
2025-05-07T20:25:45.1603470Z 
2025-05-07T20:25:45.1603473Z 
2025-05-07T20:25:45.1603477Z 
2025-05-07T20:25:45.1603481Z 
2025-05-07T20:25:45.1603485Z 
2025-05-07T20:25:45.1604011Z 
2025-05-07T20:25:45.1802974Z python-3.11.8        | 29.3 MB   | ######4    |  65% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1803270Z 
2025-05-07T20:25:45.1803274Z 
2025-05-07T20:25:45.1803278Z 
2025-05-07T20:25:45.1803282Z 
2025-05-07T20:25:45.1803295Z 
2025-05-07T20:25:45.1803299Z 
2025-05-07T20:25:45.1803302Z 
2025-05-07T20:25:45.1803306Z 
2025-05-07T20:25:45.1803309Z 
2025-05-07T20:25:45.1806376Z 
2025-05-07T20:25:45.2011515Z gds-tools-1.11.1.6   | 37.8 MB   | #########5 |  96% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2012282Z 
2025-05-07T20:25:45.2012289Z 
2025-05-07T20:25:45.2012294Z 
2025-05-07T20:25:45.2012309Z 
2025-05-07T20:25:45.2012314Z 
2025-05-07T20:25:45.2012320Z 
2025-05-07T20:25:45.2012325Z 
2025-05-07T20:25:45.2012501Z 
2025-05-07T20:25:45.2012509Z 
2025-05-07T20:25:45.2012514Z 
2025-05-07T20:25:45.2012519Z 
2025-05-07T20:25:45.2015343Z 
2025-05-07T20:25:45.2604245Z cuda-nvcc-tools-12.6 | 23.0 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2604700Z 
2025-05-07T20:25:45.2604704Z 
2025-05-07T20:25:45.2604708Z 
2025-05-07T20:25:45.2604711Z 
2025-05-07T20:25:45.2604715Z 
2025-05-07T20:25:45.2604719Z 
2025-05-07T20:25:45.2604723Z 
2025-05-07T20:25:45.2604726Z 
2025-05-07T20:25:45.2604730Z 
2025-05-07T20:25:45.2604734Z 
2025-05-07T20:25:45.2604737Z 
2025-05-07T20:25:45.3608304Z python-3.11.8        | 29.3 MB   | #######4   |  74% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3608682Z 
2025-05-07T20:25:45.3608686Z 
2025-05-07T20:25:45.3608690Z 
2025-05-07T20:25:45.3608719Z 
2025-05-07T20:25:45.3608722Z 
2025-05-07T20:25:45.3608726Z 
2025-05-07T20:25:45.3608730Z 
2025-05-07T20:25:45.3608733Z 
2025-05-07T20:25:45.3608737Z 
2025-05-07T20:25:45.3608742Z 
2025-05-07T20:25:45.3608746Z 
2025-05-07T20:25:45.4611718Z python-3.11.8        | 29.3 MB   | ########5  |  85% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4612128Z 
2025-05-07T20:25:45.4612132Z 
2025-05-07T20:25:45.4612136Z 
2025-05-07T20:25:45.4612147Z 
2025-05-07T20:25:45.4612151Z 
2025-05-07T20:25:45.4612155Z 
2025-05-07T20:25:45.4612158Z 
2025-05-07T20:25:45.4612162Z 
2025-05-07T20:25:45.4612166Z 
2025-05-07T20:25:45.4612169Z 
2025-05-07T20:25:45.4613375Z 
2025-05-07T20:25:45.6112672Z python-3.11.8        | 29.3 MB   | #########6 |  97% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6113084Z 
2025-05-07T20:25:45.6113090Z 
2025-05-07T20:25:45.6113096Z 
2025-05-07T20:25:45.6113101Z 
2025-05-07T20:25:45.6113106Z 
2025-05-07T20:25:45.6113111Z 
2025-05-07T20:25:45.6113116Z 
2025-05-07T20:25:45.6925757Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:45.6926152Z 
2025-05-07T20:25:45.6926158Z 
2025-05-07T20:25:45.6926163Z 
2025-05-07T20:25:45.6926168Z 
2025-05-07T20:25:45.6926191Z 
2025-05-07T20:25:45.6926197Z 
2025-05-07T20:25:45.6926202Z 
2025-05-07T20:25:45.6926207Z 
2025-05-07T20:25:45.6926212Z 
2025-05-07T20:25:45.6926217Z 
2025-05-07T20:25:45.6926222Z 
2025-05-07T20:25:45.6926227Z 
2025-05-07T20:25:45.6926232Z 
2025-05-07T20:25:45.7929974Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7930403Z 
2025-05-07T20:25:45.7930409Z 
2025-05-07T20:25:45.7930414Z 
2025-05-07T20:25:45.7930419Z 
2025-05-07T20:25:45.7930424Z 
2025-05-07T20:25:45.7930429Z 
2025-05-07T20:25:45.7930434Z 
2025-05-07T20:25:45.7930449Z 
2025-05-07T20:25:45.7930455Z 
2025-05-07T20:25:45.7930460Z 
2025-05-07T20:25:45.7930465Z 
2025-05-07T20:25:45.7930470Z 
2025-05-07T20:25:45.7930475Z 
2025-05-07T20:25:45.8976605Z cuda-nvrtc-12.6.85   | 17.3 MB   | #9         |  20% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8977034Z 
2025-05-07T20:25:45.8977039Z 
2025-05-07T20:25:45.8977044Z 
2025-05-07T20:25:45.8977049Z 
2025-05-07T20:25:45.8977066Z 
2025-05-07T20:25:45.8977071Z 
2025-05-07T20:25:45.8977087Z 
2025-05-07T20:25:45.8977092Z 
2025-05-07T20:25:45.8977097Z 
2025-05-07T20:25:45.8977102Z 
2025-05-07T20:25:45.8977107Z 
2025-05-07T20:25:45.8977112Z 
2025-05-07T20:25:45.8981757Z 
2025-05-07T20:25:46.0006278Z cuda-nvrtc-12.6.85   | 17.3 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0006704Z 
2025-05-07T20:25:46.0006709Z 
2025-05-07T20:25:46.0006714Z 
2025-05-07T20:25:46.0006720Z 
2025-05-07T20:25:46.0006724Z 
2025-05-07T20:25:46.0006730Z 
2025-05-07T20:25:46.0006735Z 
2025-05-07T20:25:46.0006811Z 
2025-05-07T20:25:46.0006816Z 
2025-05-07T20:25:46.0006821Z 
2025-05-07T20:25:46.0006826Z 
2025-05-07T20:25:46.0010697Z 
2025-05-07T20:25:46.0031453Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0032185Z 
2025-05-07T20:25:46.0032191Z 
2025-05-07T20:25:46.0032196Z 
2025-05-07T20:25:46.0032201Z 
2025-05-07T20:25:46.0032206Z 
2025-05-07T20:25:46.0032211Z 
2025-05-07T20:25:46.0032386Z 
2025-05-07T20:25:46.0032392Z 
2025-05-07T20:25:46.0032397Z 
2025-05-07T20:25:46.0032402Z 
2025-05-07T20:25:46.0032407Z 
2025-05-07T20:25:46.0032421Z 
2025-05-07T20:25:46.0032426Z 
2025-05-07T20:25:46.0528861Z cuda-nvrtc-12.6.85   | 17.3 MB   | #####7     |  58% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0529272Z 
2025-05-07T20:25:46.0529277Z 
2025-05-07T20:25:46.0529291Z 
2025-05-07T20:25:46.0529297Z 
2025-05-07T20:25:46.0529302Z 
2025-05-07T20:25:46.0529307Z 
2025-05-07T20:25:46.0529312Z 
2025-05-07T20:25:46.0529318Z 
2025-05-07T20:25:46.0529322Z 
2025-05-07T20:25:46.0529328Z 
2025-05-07T20:25:46.0529332Z 
2025-05-07T20:25:46.0529338Z 
2025-05-07T20:25:46.0529343Z 
2025-05-07T20:25:46.0529348Z 
2025-05-07T20:25:46.1040328Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1040929Z 
2025-05-07T20:25:46.1040935Z 
2025-05-07T20:25:46.1040941Z 
2025-05-07T20:25:46.1040947Z 
2025-05-07T20:25:46.1040963Z 
2025-05-07T20:25:46.1040968Z 
2025-05-07T20:25:46.1040974Z 
2025-05-07T20:25:46.1040980Z 
2025-05-07T20:25:46.1040985Z 
2025-05-07T20:25:46.1040991Z 
2025-05-07T20:25:46.1040997Z 
2025-05-07T20:25:46.1041002Z 
2025-05-07T20:25:46.1041008Z 
2025-05-07T20:25:46.1535029Z cuda-nvrtc-12.6.85   | 17.3 MB   | #######6   |  77% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1535337Z 
2025-05-07T20:25:46.1535341Z 
2025-05-07T20:25:46.1535345Z 
2025-05-07T20:25:46.1535349Z 
2025-05-07T20:25:46.1535352Z 
2025-05-07T20:25:46.1535356Z 
2025-05-07T20:25:46.1535368Z 
2025-05-07T20:25:46.1535371Z 
2025-05-07T20:25:46.1535375Z 
2025-05-07T20:25:46.1535379Z 
2025-05-07T20:25:46.1535382Z 
2025-05-07T20:25:46.1535386Z 
2025-05-07T20:25:46.1535390Z 
2025-05-07T20:25:46.1535393Z 
2025-05-07T20:25:46.2240687Z libnvjitlink-12.6.85 | 14.9 MB   | ##         |  20% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2241181Z 
2025-05-07T20:25:46.2241187Z 
2025-05-07T20:25:46.2241193Z 
2025-05-07T20:25:46.2241219Z 
2025-05-07T20:25:46.2241225Z 
2025-05-07T20:25:46.2241232Z 
2025-05-07T20:25:46.2241238Z 
2025-05-07T20:25:46.2241243Z 
2025-05-07T20:25:46.2241249Z 
2025-05-07T20:25:46.2241255Z 
2025-05-07T20:25:46.2241261Z 
2025-05-07T20:25:46.2241268Z 
2025-05-07T20:25:46.2241274Z 
2025-05-07T20:25:46.2562349Z cuda-nvrtc-12.6.85   | 17.3 MB   | #########5 |  95% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2562717Z 
2025-05-07T20:25:46.2562721Z 
2025-05-07T20:25:46.2562725Z 
2025-05-07T20:25:46.2562729Z 
2025-05-07T20:25:46.2562732Z 
2025-05-07T20:25:46.2562737Z 
2025-05-07T20:25:46.2562741Z 
2025-05-07T20:25:46.2562744Z 
2025-05-07T20:25:46.2562748Z 
2025-05-07T20:25:46.2562752Z 
2025-05-07T20:25:46.2562756Z 
2025-05-07T20:25:46.2562768Z 
2025-05-07T20:25:46.2562772Z 
2025-05-07T20:25:46.2567055Z 
2025-05-07T20:25:46.3562607Z libnvjitlink-12.6.85 | 14.9 MB   | ###9       |  40% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3563145Z 
2025-05-07T20:25:46.3563152Z 
2025-05-07T20:25:46.3563172Z 
2025-05-07T20:25:46.3563178Z 
2025-05-07T20:25:46.3563183Z 
2025-05-07T20:25:46.3563188Z 
2025-05-07T20:25:46.3563194Z 
2025-05-07T20:25:46.3563199Z 
2025-05-07T20:25:46.3563204Z 
2025-05-07T20:25:46.3563209Z 
2025-05-07T20:25:46.3563214Z 
2025-05-07T20:25:46.3563219Z 
2025-05-07T20:25:46.3563225Z 
2025-05-07T20:25:46.3563230Z 
2025-05-07T20:25:46.4567218Z libnvjitlink-12.6.85 | 14.9 MB   | ######3    |  63% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4567766Z 
2025-05-07T20:25:46.4567772Z 
2025-05-07T20:25:46.4567777Z 
2025-05-07T20:25:46.4567783Z 
2025-05-07T20:25:46.4567788Z 
2025-05-07T20:25:46.4567794Z 
2025-05-07T20:25:46.4567799Z 
2025-05-07T20:25:46.4567804Z 
2025-05-07T20:25:46.4567809Z 
2025-05-07T20:25:46.4567814Z 
2025-05-07T20:25:46.4567820Z 
2025-05-07T20:25:46.4568076Z 
2025-05-07T20:25:46.4568082Z 
2025-05-07T20:25:46.4568087Z 
2025-05-07T20:25:46.4782399Z libnvjitlink-12.6.85 | 14.9 MB   | ########5  |  86% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4783041Z 
2025-05-07T20:25:46.4783049Z 
2025-05-07T20:25:46.4783054Z 
2025-05-07T20:25:46.4783059Z 
2025-05-07T20:25:46.4783064Z 
2025-05-07T20:25:46.4783081Z 
2025-05-07T20:25:46.4783086Z 
2025-05-07T20:25:46.4783091Z 
2025-05-07T20:25:46.4783096Z 
2025-05-07T20:25:46.4783101Z 
2025-05-07T20:25:46.4784646Z 
2025-05-07T20:25:46.5144589Z python-3.11.8        | 29.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.5144984Z 
2025-05-07T20:25:46.5144990Z 
2025-05-07T20:25:46.5144995Z 
2025-05-07T20:25:46.5145000Z 
2025-05-07T20:25:46.5145005Z 
2025-05-07T20:25:46.5145010Z 
2025-05-07T20:25:46.5145015Z 
2025-05-07T20:25:46.5145021Z 
2025-05-07T20:25:46.5145026Z 
2025-05-07T20:25:46.5145031Z 
2025-05-07T20:25:46.5145036Z 
2025-05-07T20:25:46.5145041Z 
2025-05-07T20:25:46.5145061Z 
2025-05-07T20:25:46.5145066Z 
2025-05-07T20:25:46.5145072Z 
2025-05-07T20:25:46.6149391Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.6149899Z 
2025-05-07T20:25:46.6149906Z 
2025-05-07T20:25:46.6149911Z 
2025-05-07T20:25:46.6149916Z 
2025-05-07T20:25:46.6149922Z 
2025-05-07T20:25:46.6149927Z 
2025-05-07T20:25:46.6149933Z 
2025-05-07T20:25:46.6149938Z 
2025-05-07T20:25:46.6149954Z 
2025-05-07T20:25:46.6149960Z 
2025-05-07T20:25:46.6149965Z 
2025-05-07T20:25:46.6149971Z 
2025-05-07T20:25:46.6149976Z 
2025-05-07T20:25:46.6149981Z 
2025-05-07T20:25:46.6149987Z 
2025-05-07T20:25:46.6251422Z cuda-nvcc-dev_linux- | 10.8 MB   | ###3       |  33% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.6251767Z 
2025-05-07T20:25:46.6251771Z 
2025-05-07T20:25:46.6251775Z 
2025-05-07T20:25:46.6251778Z 
2025-05-07T20:25:46.6251782Z 
2025-05-07T20:25:46.6251786Z 
2025-05-07T20:25:46.6251789Z 
2025-05-07T20:25:46.6251803Z 
2025-05-07T20:25:46.6251808Z 
2025-05-07T20:25:46.6255028Z 
2025-05-07T20:25:46.6798841Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.6799167Z 
2025-05-07T20:25:46.6799185Z 
2025-05-07T20:25:46.6799190Z 
2025-05-07T20:25:46.6799194Z 
2025-05-07T20:25:46.6799197Z 
2025-05-07T20:25:46.6799201Z 
2025-05-07T20:25:46.6799205Z 
2025-05-07T20:25:46.6799208Z 
2025-05-07T20:25:46.6799212Z 
2025-05-07T20:25:46.6799216Z 
2025-05-07T20:25:46.6799220Z 
2025-05-07T20:25:46.6799223Z 
2025-05-07T20:25:46.6799227Z 
2025-05-07T20:25:46.6799231Z 
2025-05-07T20:25:46.6799241Z 
2025-05-07T20:25:46.6799245Z 
2025-05-07T20:25:46.7313257Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7313691Z 
2025-05-07T20:25:46.7313696Z 
2025-05-07T20:25:46.7313714Z 
2025-05-07T20:25:46.7313718Z 
2025-05-07T20:25:46.7313721Z 
2025-05-07T20:25:46.7313725Z 
2025-05-07T20:25:46.7313728Z 
2025-05-07T20:25:46.7313747Z 
2025-05-07T20:25:46.7313751Z 
2025-05-07T20:25:46.7313755Z 
2025-05-07T20:25:46.7313759Z 
2025-05-07T20:25:46.7313763Z 
2025-05-07T20:25:46.7313767Z 
2025-05-07T20:25:46.7313770Z 
2025-05-07T20:25:46.7313774Z 
2025-05-07T20:25:46.7805228Z cuda-nvcc-dev_linux- | 10.8 MB   | ######6    |  66% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7805813Z 
2025-05-07T20:25:46.7805820Z 
2025-05-07T20:25:46.7805826Z 
2025-05-07T20:25:46.7805831Z 
2025-05-07T20:25:46.7805836Z 
2025-05-07T20:25:46.7805841Z 
2025-05-07T20:25:46.7805846Z 
2025-05-07T20:25:46.7805852Z 
2025-05-07T20:25:46.7805856Z 
2025-05-07T20:25:46.7805859Z 
2025-05-07T20:25:46.7805863Z 
2025-05-07T20:25:46.7805867Z 
2025-05-07T20:25:46.7805870Z 
2025-05-07T20:25:46.7805874Z 
2025-05-07T20:25:46.7805878Z 
2025-05-07T20:25:46.7805889Z 
2025-05-07T20:25:46.8462892Z cuda-nvvm-tools-12.6 | 10.4 MB   | ##8        |  29% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.8463254Z 
2025-05-07T20:25:46.8463259Z 
2025-05-07T20:25:46.8463540Z 
2025-05-07T20:25:46.8463556Z 
2025-05-07T20:25:46.8463560Z 
2025-05-07T20:25:46.8463564Z 
2025-05-07T20:25:46.8463567Z 
2025-05-07T20:25:46.8463571Z 
2025-05-07T20:25:46.8463575Z 
2025-05-07T20:25:46.8463703Z 
2025-05-07T20:25:46.8463707Z 
2025-05-07T20:25:46.8463711Z 
2025-05-07T20:25:46.8463715Z 
2025-05-07T20:25:46.8463722Z 
2025-05-07T20:25:46.8463727Z 
2025-05-07T20:25:46.8874436Z cuda-nvcc-dev_linux- | 10.8 MB   | #########6 |  97% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.8874794Z 
2025-05-07T20:25:46.8874800Z 
2025-05-07T20:25:46.8874805Z 
2025-05-07T20:25:46.8874810Z 
2025-05-07T20:25:46.8874815Z 
2025-05-07T20:25:46.8874820Z 
2025-05-07T20:25:46.8874826Z 
2025-05-07T20:25:46.8874831Z 
2025-05-07T20:25:46.8874836Z 
2025-05-07T20:25:46.8874841Z 
2025-05-07T20:25:46.8874846Z 
2025-05-07T20:25:46.8874851Z 
2025-05-07T20:25:46.8874860Z 
2025-05-07T20:25:46.8932369Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.8932753Z 
2025-05-07T20:25:46.8932757Z 
2025-05-07T20:25:46.8932761Z 
2025-05-07T20:25:46.8932765Z 
2025-05-07T20:25:46.8932768Z 
2025-05-07T20:25:46.8932772Z 
2025-05-07T20:25:46.8932783Z 
2025-05-07T20:25:46.8932794Z 
2025-05-07T20:25:46.8932797Z 
2025-05-07T20:25:46.8932801Z 
2025-05-07T20:25:46.8932805Z 
2025-05-07T20:25:46.8932808Z 
2025-05-07T20:25:46.8932812Z 
2025-05-07T20:25:46.8932816Z 
2025-05-07T20:25:46.8932819Z 
2025-05-07T20:25:46.8932823Z 
2025-05-07T20:25:46.9449087Z cuda-nvvm-tools-12.6 | 10.4 MB   | #####7     |  58% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.9449429Z 
2025-05-07T20:25:46.9449433Z 
2025-05-07T20:25:46.9449437Z 
2025-05-07T20:25:46.9449441Z 
2025-05-07T20:25:46.9449444Z 
2025-05-07T20:25:46.9449448Z 
2025-05-07T20:25:46.9449452Z 
2025-05-07T20:25:46.9449456Z 
2025-05-07T20:25:46.9449459Z 
2025-05-07T20:25:46.9449463Z 
2025-05-07T20:25:46.9449467Z 
2025-05-07T20:25:46.9449470Z 
2025-05-07T20:25:46.9449474Z 
2025-05-07T20:25:46.9449499Z 
2025-05-07T20:25:46.9449503Z 
2025-05-07T20:25:46.9449506Z 
2025-05-07T20:25:46.9449510Z 
2025-05-07T20:25:46.9935710Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.9936280Z 
2025-05-07T20:25:46.9936287Z 
2025-05-07T20:25:46.9936292Z 
2025-05-07T20:25:46.9936297Z 
2025-05-07T20:25:46.9936302Z 
2025-05-07T20:25:46.9936308Z 
2025-05-07T20:25:46.9936313Z 
2025-05-07T20:25:46.9936318Z 
2025-05-07T20:25:46.9936323Z 
2025-05-07T20:25:46.9936329Z 
2025-05-07T20:25:46.9936334Z 
2025-05-07T20:25:46.9936339Z 
2025-05-07T20:25:46.9936345Z 
2025-05-07T20:25:46.9936350Z 
2025-05-07T20:25:46.9936355Z 
2025-05-07T20:25:46.9936394Z 
2025-05-07T20:25:47.0449137Z cuda-nvvm-tools-12.6 | 10.4 MB   | #########  |  90% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.0449517Z 
2025-05-07T20:25:47.0449523Z 
2025-05-07T20:25:47.0449528Z 
2025-05-07T20:25:47.0449533Z 
2025-05-07T20:25:47.0449539Z 
2025-05-07T20:25:47.0449544Z 
2025-05-07T20:25:47.0449565Z 
2025-05-07T20:25:47.0449570Z 
2025-05-07T20:25:47.0449575Z 
2025-05-07T20:25:47.0449580Z 
2025-05-07T20:25:47.0449594Z 
2025-05-07T20:25:47.0449600Z 
2025-05-07T20:25:47.0449605Z 
2025-05-07T20:25:47.0449618Z 
2025-05-07T20:25:47.0449623Z 
2025-05-07T20:25:47.0449629Z 
2025-05-07T20:25:47.0449634Z 
2025-05-07T20:25:47.0594982Z cuda-sanitizer-api-1 | 8.9 MB    | ###3       |  34% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.0595412Z 
2025-05-07T20:25:47.0595417Z 
2025-05-07T20:25:47.0595420Z 
2025-05-07T20:25:47.0595424Z 
2025-05-07T20:25:47.0595428Z 
2025-05-07T20:25:47.0595431Z 
2025-05-07T20:25:47.0595435Z 
2025-05-07T20:25:47.0595439Z 
2025-05-07T20:25:47.0595443Z 
2025-05-07T20:25:47.0595446Z 
2025-05-07T20:25:47.0595450Z 
2025-05-07T20:25:47.0595454Z 
2025-05-07T20:25:47.0595457Z 
2025-05-07T20:25:47.0598079Z 
2025-05-07T20:25:47.1326530Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.1327321Z 
2025-05-07T20:25:47.1328945Z 
2025-05-07T20:25:47.1382536Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:47.1382914Z 
2025-05-07T20:25:47.1382920Z 
2025-05-07T20:25:47.1382926Z 
2025-05-07T20:25:47.1383134Z 
2025-05-07T20:25:47.1383139Z 
2025-05-07T20:25:47.1383143Z 
2025-05-07T20:25:47.1383147Z 
2025-05-07T20:25:47.1383150Z 
2025-05-07T20:25:47.1383154Z 
2025-05-07T20:25:47.1383158Z 
2025-05-07T20:25:47.1383161Z 
2025-05-07T20:25:47.1383167Z 
2025-05-07T20:25:47.1383173Z 
2025-05-07T20:25:47.1383178Z 
2025-05-07T20:25:47.1383183Z 
2025-05-07T20:25:47.1383188Z 
2025-05-07T20:25:47.1383194Z 
2025-05-07T20:25:47.1387462Z 
2025-05-07T20:25:47.1451561Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.1452148Z 
2025-05-07T20:25:47.1452155Z 
2025-05-07T20:25:47.1452160Z 
2025-05-07T20:25:47.1452165Z 
2025-05-07T20:25:47.1452170Z 
2025-05-07T20:25:47.1452175Z 
2025-05-07T20:25:47.1452180Z 
2025-05-07T20:25:47.1452204Z 
2025-05-07T20:25:47.1452223Z 
2025-05-07T20:25:47.1452229Z 
2025-05-07T20:25:47.1452234Z 
2025-05-07T20:25:47.1452239Z 
2025-05-07T20:25:47.1452245Z 
2025-05-07T20:25:47.1452250Z 
2025-05-07T20:25:47.1452263Z 
2025-05-07T20:25:47.1452268Z 
2025-05-07T20:25:47.1452273Z 
2025-05-07T20:25:47.2346680Z cuda-sanitizer-api-1 | 8.9 MB    | #######2   |  72% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2347035Z 
2025-05-07T20:25:47.2347039Z 
2025-05-07T20:25:47.2347043Z 
2025-05-07T20:25:47.2347047Z 
2025-05-07T20:25:47.2347050Z 
2025-05-07T20:25:47.2347054Z 
2025-05-07T20:25:47.2347058Z 
2025-05-07T20:25:47.2347062Z 
2025-05-07T20:25:47.2347065Z 
2025-05-07T20:25:47.2347069Z 
2025-05-07T20:25:47.2347073Z 
2025-05-07T20:25:47.2347076Z 
2025-05-07T20:25:47.2347080Z 
2025-05-07T20:25:47.2347084Z 
2025-05-07T20:25:47.2347088Z 
2025-05-07T20:25:47.2384415Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2384829Z 
2025-05-07T20:25:47.2384833Z 
2025-05-07T20:25:47.2384837Z 
2025-05-07T20:25:47.2384841Z 
2025-05-07T20:25:47.2384844Z 
2025-05-07T20:25:47.2384848Z 
2025-05-07T20:25:47.2384852Z 
2025-05-07T20:25:47.2384863Z 
2025-05-07T20:25:47.2384875Z 
2025-05-07T20:25:47.2384879Z 
2025-05-07T20:25:47.2384882Z 
2025-05-07T20:25:47.2384886Z 
2025-05-07T20:25:47.2384890Z 
2025-05-07T20:25:47.2384894Z 
2025-05-07T20:25:47.2384897Z 
2025-05-07T20:25:47.2384901Z 
2025-05-07T20:25:47.2384905Z 
2025-05-07T20:25:47.2387937Z 
2025-05-07T20:25:47.2752887Z cuda-nvvm-impl-12.6. | 7.7 MB    | ###5       |  36% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2753357Z 
2025-05-07T20:25:47.2753362Z 
2025-05-07T20:25:47.2753368Z 
2025-05-07T20:25:47.2753372Z 
2025-05-07T20:25:47.2753377Z 
2025-05-07T20:25:47.2753382Z 
2025-05-07T20:25:47.2753387Z 
2025-05-07T20:25:47.2753392Z 
2025-05-07T20:25:47.2753397Z 
2025-05-07T20:25:47.2753410Z 
2025-05-07T20:25:47.2753416Z 
2025-05-07T20:25:47.2753421Z 
2025-05-07T20:25:47.2753439Z 
2025-05-07T20:25:47.2753444Z 
2025-05-07T20:25:47.2753457Z 
2025-05-07T20:25:47.2753462Z 
2025-05-07T20:25:47.2753467Z 
2025-05-07T20:25:47.2753473Z 
2025-05-07T20:25:47.2753478Z 
2025-05-07T20:25:47.3389913Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.3390247Z 
2025-05-07T20:25:47.3390252Z 
2025-05-07T20:25:47.3390255Z 
2025-05-07T20:25:47.3390259Z 
2025-05-07T20:25:47.3390262Z 
2025-05-07T20:25:47.3390266Z 
2025-05-07T20:25:47.3390270Z 
2025-05-07T20:25:47.3390273Z 
2025-05-07T20:25:47.3390277Z 
2025-05-07T20:25:47.3390281Z 
2025-05-07T20:25:47.3390284Z 
2025-05-07T20:25:47.3390288Z 
2025-05-07T20:25:47.3390291Z 
2025-05-07T20:25:47.3390295Z 
2025-05-07T20:25:47.3390299Z 
2025-05-07T20:25:47.3390302Z 
2025-05-07T20:25:47.3390306Z 
2025-05-07T20:25:47.3391552Z 
2025-05-07T20:25:47.3754324Z cuda-nvvm-impl-12.6. | 7.7 MB    | #######3   |  73% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.3754774Z 
2025-05-07T20:25:47.3755044Z 
2025-05-07T20:25:47.3755050Z 
2025-05-07T20:25:47.3755055Z 
2025-05-07T20:25:47.3755060Z 
2025-05-07T20:25:47.3755065Z 
2025-05-07T20:25:47.3755071Z 
2025-05-07T20:25:47.3755088Z 
2025-05-07T20:25:47.3755339Z 
2025-05-07T20:25:47.3755346Z 
2025-05-07T20:25:47.3755351Z 
2025-05-07T20:25:47.3755356Z 
2025-05-07T20:25:47.3755362Z 
2025-05-07T20:25:47.3755366Z 
2025-05-07T20:25:47.3755372Z 
2025-05-07T20:25:47.3755377Z 
2025-05-07T20:25:47.3755382Z 
2025-05-07T20:25:47.3755387Z 
2025-05-07T20:25:47.3755392Z 
2025-05-07T20:25:47.4070042Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4070366Z 
2025-05-07T20:25:47.4070370Z 
2025-05-07T20:25:47.4070373Z 
2025-05-07T20:25:47.4070377Z 
2025-05-07T20:25:47.4070381Z 
2025-05-07T20:25:47.4070385Z 
2025-05-07T20:25:47.4070388Z 
2025-05-07T20:25:47.4070392Z 
2025-05-07T20:25:47.4070396Z 
2025-05-07T20:25:47.4070399Z 
2025-05-07T20:25:47.4070403Z 
2025-05-07T20:25:47.4070407Z 
2025-05-07T20:25:47.4070422Z 
2025-05-07T20:25:47.4070426Z 
2025-05-07T20:25:47.4070429Z 
2025-05-07T20:25:47.4072979Z 
2025-05-07T20:25:47.5332153Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5332559Z 
2025-05-07T20:25:47.5332564Z 
2025-05-07T20:25:47.5332568Z 
2025-05-07T20:25:47.5332571Z 
2025-05-07T20:25:47.5332575Z 
2025-05-07T20:25:47.5332579Z 
2025-05-07T20:25:47.5332583Z 
2025-05-07T20:25:47.5332586Z 
2025-05-07T20:25:47.5332597Z 
2025-05-07T20:25:47.5332601Z 
2025-05-07T20:25:47.5332605Z 
2025-05-07T20:25:47.5332609Z 
2025-05-07T20:25:47.5332612Z 
2025-05-07T20:25:47.5332616Z 
2025-05-07T20:25:47.5332620Z 
2025-05-07T20:25:47.5332624Z 
2025-05-07T20:25:47.5332627Z 
2025-05-07T20:25:47.5403869Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5404291Z 
2025-05-07T20:25:47.5404296Z 
2025-05-07T20:25:47.5404299Z 
2025-05-07T20:25:47.5404303Z 
2025-05-07T20:25:47.5404317Z 
2025-05-07T20:25:47.5404321Z 
2025-05-07T20:25:47.5404325Z 
2025-05-07T20:25:47.5404328Z 
2025-05-07T20:25:47.5404332Z 
2025-05-07T20:25:47.5404336Z 
2025-05-07T20:25:47.5404340Z 
2025-05-07T20:25:47.5404343Z 
2025-05-07T20:25:47.5404353Z 
2025-05-07T20:25:47.5404356Z 
2025-05-07T20:25:47.5404360Z 
2025-05-07T20:25:47.5404364Z 
2025-05-07T20:25:47.5404368Z 
2025-05-07T20:25:47.5404371Z 
2025-05-07T20:25:47.5404488Z 
2025-05-07T20:25:47.6125116Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.6125413Z 
2025-05-07T20:25:47.6125417Z 
2025-05-07T20:25:47.6125421Z 
2025-05-07T20:25:47.6125424Z 
2025-05-07T20:25:47.6125646Z 
2025-05-07T20:25:47.6125654Z 
2025-05-07T20:25:47.6125659Z 
2025-05-07T20:25:47.6125665Z 
2025-05-07T20:25:47.6125670Z 
2025-05-07T20:25:47.6125676Z 
2025-05-07T20:25:47.6125681Z 
2025-05-07T20:25:47.6125687Z 
2025-05-07T20:25:47.6125692Z 
2025-05-07T20:25:47.6125698Z 
2025-05-07T20:25:47.6125703Z 
2025-05-07T20:25:47.6125709Z 
2025-05-07T20:25:47.6125729Z 
2025-05-07T20:25:47.6125737Z 
2025-05-07T20:25:47.8882455Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.8882791Z 
2025-05-07T20:25:47.8882813Z 
2025-05-07T20:25:47.8882817Z 
2025-05-07T20:25:47.8882821Z 
2025-05-07T20:25:47.8882825Z 
2025-05-07T20:25:47.8884584Z 
2025-05-07T20:25:48.4604005Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:48.4604331Z 
2025-05-07T20:25:48.4604335Z 
2025-05-07T20:25:48.4604339Z 
2025-05-07T20:25:48.4604343Z 
2025-05-07T20:25:48.4604346Z 
2025-05-07T20:25:48.4604350Z 
2025-05-07T20:25:48.4604355Z 
2025-05-07T20:25:48.4606534Z 
2025-05-07T20:25:49.2384947Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.2385264Z 
2025-05-07T20:25:49.2385268Z 
2025-05-07T20:25:49.2385272Z 
2025-05-07T20:25:49.2385286Z 
2025-05-07T20:25:49.2388068Z 
2025-05-07T20:25:49.5941908Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:49.5942655Z 
2025-05-07T20:25:49.5942661Z 
2025-05-07T20:25:49.5942666Z 
2025-05-07T20:25:49.5942671Z 
2025-05-07T20:25:49.5942687Z 
2025-05-07T20:25:49.5942692Z 
2025-05-07T20:25:49.5942870Z 
2025-05-07T20:25:49.5942878Z 
2025-05-07T20:25:49.5942883Z 
2025-05-07T20:25:50.2686783Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.3071646Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:50.3072027Z 
2025-05-07T20:25:50.3072031Z 
2025-05-07T20:25:50.3072034Z 
2025-05-07T20:25:50.3072038Z 
2025-05-07T20:25:50.3072042Z 
2025-05-07T20:25:50.3072046Z 
2025-05-07T20:25:50.3072049Z 
2025-05-07T20:25:50.3072061Z 
2025-05-07T20:25:50.3072065Z 
2025-05-07T20:25:50.3072069Z 
2025-05-07T20:25:50.3072072Z 
2025-05-07T20:25:50.3072077Z 
2025-05-07T20:25:50.9843200Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.9843535Z 
2025-05-07T20:25:50.9843573Z 
2025-05-07T20:25:50.9843577Z 
2025-05-07T20:25:50.9843581Z 
2025-05-07T20:25:50.9843584Z 
2025-05-07T20:25:50.9843588Z 
2025-05-07T20:25:50.9843592Z 
2025-05-07T20:25:51.3744987Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:51.3745391Z 
2025-05-07T20:25:51.3745398Z 
2025-05-07T20:25:51.3745404Z 
2025-05-07T20:25:51.3745409Z 
2025-05-07T20:25:51.3745415Z 
2025-05-07T20:25:51.3745420Z 
2025-05-07T20:25:51.3745426Z 
2025-05-07T20:25:51.3745431Z 
2025-05-07T20:25:51.3745436Z 
2025-05-07T20:25:51.3745452Z 
2025-05-07T20:25:51.6992395Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.6993012Z 
2025-05-07T20:25:51.6993019Z 
2025-05-07T20:25:51.6993027Z 
2025-05-07T20:25:51.6993035Z 
2025-05-07T20:25:51.6993042Z 
2025-05-07T20:25:51.6993050Z 
2025-05-07T20:25:51.6993057Z 
2025-05-07T20:25:51.6993065Z 
2025-05-07T20:25:51.6993073Z 
2025-05-07T20:25:51.6993081Z 
2025-05-07T20:25:51.6993088Z 
2025-05-07T20:25:51.6993139Z 
2025-05-07T20:25:51.6993146Z 
2025-05-07T20:25:51.8590568Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.8590887Z 
2025-05-07T20:25:51.8590891Z 
2025-05-07T20:25:51.8590919Z 
2025-05-07T20:25:51.8590922Z 
2025-05-07T20:25:51.8590926Z 
2025-05-07T20:25:51.8590930Z 
2025-05-07T20:25:51.8590933Z 
2025-05-07T20:25:51.8590937Z 
2025-05-07T20:25:51.8590941Z 
2025-05-07T20:25:51.8590944Z 
2025-05-07T20:25:51.8590951Z 
2025-05-07T20:25:51.9723556Z python-3.11.8        | 29.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.9723853Z 
2025-05-07T20:25:51.9723857Z 
2025-05-07T20:25:51.9723860Z 
2025-05-07T20:25:51.9723865Z 
2025-05-07T20:25:51.9723877Z 
2025-05-07T20:25:51.9723880Z 
2025-05-07T20:25:51.9723884Z 
2025-05-07T20:25:51.9723888Z 
2025-05-07T20:25:51.9723891Z 
2025-05-07T20:25:51.9723895Z 
2025-05-07T20:25:51.9723898Z 
2025-05-07T20:25:51.9723902Z 
2025-05-07T20:25:51.9723906Z 
2025-05-07T20:25:51.9723932Z 
2025-05-07T20:25:52.0693911Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.0694242Z 
2025-05-07T20:25:52.0694246Z 
2025-05-07T20:25:52.0694250Z 
2025-05-07T20:25:52.0694279Z 
2025-05-07T20:25:52.0694283Z 
2025-05-07T20:25:52.0694287Z 
2025-05-07T20:25:52.0694291Z 
2025-05-07T20:25:52.0694295Z 
2025-05-07T20:25:52.0694299Z 
2025-05-07T20:25:52.0694303Z 
2025-05-07T20:25:52.0694306Z 
2025-05-07T20:25:52.0694310Z 
2025-05-07T20:25:52.0694314Z 
2025-05-07T20:25:52.0694318Z 
2025-05-07T20:25:52.0694321Z 
2025-05-07T20:25:52.1427304Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.1427629Z 
2025-05-07T20:25:52.1427632Z 
2025-05-07T20:25:52.1427636Z 
2025-05-07T20:25:52.1427641Z 
2025-05-07T20:25:52.1427646Z 
2025-05-07T20:25:52.1427650Z 
2025-05-07T20:25:52.1427654Z 
2025-05-07T20:25:52.1427666Z 
2025-05-07T20:25:52.1427670Z 
2025-05-07T20:25:52.1427673Z 
2025-05-07T20:25:52.1427947Z 
2025-05-07T20:25:52.1427951Z 
2025-05-07T20:25:52.1427955Z 
2025-05-07T20:25:52.1427959Z 
2025-05-07T20:25:52.1427963Z 
2025-05-07T20:25:52.1427966Z 
2025-05-07T20:25:52.2947335Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.2947683Z 
2025-05-07T20:25:52.2947694Z 
2025-05-07T20:25:52.2947698Z 
2025-05-07T20:25:52.2947702Z 
2025-05-07T20:25:52.2947705Z 
2025-05-07T20:25:52.2947709Z 
2025-05-07T20:25:52.2947713Z 
2025-05-07T20:25:52.2947717Z 
2025-05-07T20:25:52.2947720Z 
2025-05-07T20:25:52.2947724Z 
2025-05-07T20:25:52.2947728Z 
2025-05-07T20:25:52.2947732Z 
2025-05-07T20:25:52.2947735Z 
2025-05-07T20:25:52.2947739Z 
2025-05-07T20:25:52.2947743Z 
2025-05-07T20:25:52.2947746Z 
2025-05-07T20:25:52.2947753Z 
2025-05-07T20:25:52.3610314Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.3610674Z 
2025-05-07T20:25:52.3610679Z 
2025-05-07T20:25:52.3610709Z 
2025-05-07T20:25:52.3610713Z 
2025-05-07T20:25:52.3610717Z 
2025-05-07T20:25:52.3610721Z 
2025-05-07T20:25:52.3610726Z 
2025-05-07T20:25:52.3610730Z 
2025-05-07T20:25:52.3610733Z 
2025-05-07T20:25:52.3610737Z 
2025-05-07T20:25:52.3610758Z 
2025-05-07T20:25:52.3610762Z 
2025-05-07T20:25:52.3610765Z 
2025-05-07T20:25:52.3610769Z 
2025-05-07T20:25:52.3610773Z 
2025-05-07T20:25:52.3610776Z 
2025-05-07T20:25:52.3610780Z 
2025-05-07T20:25:52.3610783Z 
2025-05-07T20:25:52.3610787Z 
2025-05-07T20:25:52.4446531Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.4446834Z 
2025-05-07T20:25:52.4446838Z 
2025-05-07T20:25:52.4446842Z 
2025-05-07T20:25:52.4446846Z 
2025-05-07T20:25:52.4446851Z 
2025-05-07T20:25:52.4446855Z 
2025-05-07T20:25:52.4446858Z 
2025-05-07T20:25:52.4446862Z 
2025-05-07T20:25:52.4446874Z 
2025-05-07T20:25:52.4446878Z 
2025-05-07T20:25:52.4446882Z 
2025-05-07T20:25:52.4446885Z 
2025-05-07T20:25:52.4446889Z 
2025-05-07T20:25:52.4446893Z 
2025-05-07T20:25:52.4446918Z 
2025-05-07T20:25:52.4446922Z 
2025-05-07T20:25:52.4446925Z 
2025-05-07T20:25:52.4446929Z 
2025-05-07T20:25:52.5340283Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.5341137Z 
2025-05-07T20:25:58.2585241Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:58.2592373Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:58.2592743Z 
2025-05-07T20:25:58.2592749Z 
2025-05-07T20:25:58.2592754Z 
2025-05-07T20:25:58.2592758Z 
2025-05-07T20:25:58.2592761Z 
2025-05-07T20:25:58.2592765Z 
2025-05-07T20:25:58.2592769Z 
2025-05-07T20:25:58.2592772Z 
2025-05-07T20:25:58.2592785Z 
2025-05-07T20:25:58.2592789Z 
2025-05-07T20:25:58.2592793Z 
2025-05-07T20:25:58.2592797Z 
2025-05-07T20:25:58.2592800Z 
2025-05-07T20:25:58.2592804Z 
2025-05-07T20:25:58.2592808Z 
2025-05-07T20:25:58.2592813Z 
2025-05-07T20:25:58.2592817Z 
2025-05-07T20:25:58.2592821Z 
2025-05-07T20:25:58.2592824Z 
2025-05-07T20:25:58.2592939Z                       
2025-05-07T20:25:58.2593292Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2593616Z                                                      
2025-05-07T20:25:58.2593820Z 
2025-05-07T20:25:58.2593988Z                                                      [A
2025-05-07T20:25:58.2594192Z 
2025-05-07T20:25:58.2594196Z 
2025-05-07T20:25:58.2594580Z                                                      [A[A
2025-05-07T20:25:58.2594790Z 
2025-05-07T20:25:58.2594793Z 
2025-05-07T20:25:58.2594797Z 
2025-05-07T20:25:58.2595435Z                                                      [A[A[A
2025-05-07T20:25:58.2595730Z 
2025-05-07T20:25:58.2595736Z 
2025-05-07T20:25:58.2595750Z 
2025-05-07T20:25:58.2595755Z 
2025-05-07T20:25:58.2595992Z                                                      [A[A[A[A
2025-05-07T20:25:58.2596270Z 
2025-05-07T20:25:58.2596275Z 
2025-05-07T20:25:58.2596280Z 
2025-05-07T20:25:58.2596290Z 
2025-05-07T20:25:58.2596559Z 
2025-05-07T20:25:58.2596778Z                                                      [A[A[A[A[A
2025-05-07T20:25:58.2596992Z 
2025-05-07T20:25:58.2596996Z 
2025-05-07T20:25:58.2597000Z 
2025-05-07T20:25:58.2597003Z 
2025-05-07T20:25:58.2597143Z 
2025-05-07T20:25:58.2597148Z 
2025-05-07T20:25:58.2597360Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:58.2597581Z 
2025-05-07T20:25:58.2597585Z 
2025-05-07T20:25:58.2597589Z 
2025-05-07T20:25:58.2597592Z 
2025-05-07T20:25:58.2597596Z 
2025-05-07T20:25:58.2597600Z 
2025-05-07T20:25:58.2597603Z 
2025-05-07T20:25:58.2597792Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:58.2598183Z 
2025-05-07T20:25:58.2598193Z 
2025-05-07T20:25:58.2598197Z 
2025-05-07T20:25:58.2598201Z 
2025-05-07T20:25:58.2598204Z 
2025-05-07T20:25:58.2598208Z 
2025-05-07T20:25:58.2598211Z 
2025-05-07T20:25:58.2598215Z 
2025-05-07T20:25:58.2598520Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2598749Z 
2025-05-07T20:25:58.2598753Z 
2025-05-07T20:25:58.2598756Z 
2025-05-07T20:25:58.2598760Z 
2025-05-07T20:25:58.2598763Z 
2025-05-07T20:25:58.2598767Z 
2025-05-07T20:25:58.2598777Z 
2025-05-07T20:25:58.2598780Z 
2025-05-07T20:25:58.2598784Z 
2025-05-07T20:25:58.2599352Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2599577Z 
2025-05-07T20:25:58.2599580Z 
2025-05-07T20:25:58.2599584Z 
2025-05-07T20:25:58.2599594Z 
2025-05-07T20:25:58.2599598Z 
2025-05-07T20:25:58.2599601Z 
2025-05-07T20:25:58.2599605Z 
2025-05-07T20:25:58.2599609Z 
2025-05-07T20:25:58.2599612Z 
2025-05-07T20:25:58.2599616Z 
2025-05-07T20:25:58.2600062Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2600279Z 
2025-05-07T20:25:58.2600286Z 
2025-05-07T20:25:58.2600290Z 
2025-05-07T20:25:58.2600294Z 
2025-05-07T20:25:58.2600297Z 
2025-05-07T20:25:58.2600301Z 
2025-05-07T20:25:58.2600308Z 
2025-05-07T20:25:58.2600312Z 
2025-05-07T20:25:58.2600316Z 
2025-05-07T20:25:58.2600319Z 
2025-05-07T20:25:58.2600323Z 
2025-05-07T20:25:58.2600794Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2601017Z 
2025-05-07T20:25:58.2601021Z 
2025-05-07T20:25:58.2601026Z 
2025-05-07T20:25:58.2601030Z 
2025-05-07T20:25:58.2601035Z 
2025-05-07T20:25:58.2601053Z 
2025-05-07T20:25:58.2601057Z 
2025-05-07T20:25:58.2601062Z 
2025-05-07T20:25:58.2601066Z 
2025-05-07T20:25:58.2601071Z 
2025-05-07T20:25:58.2601075Z 
2025-05-07T20:25:58.2601080Z 
2025-05-07T20:25:58.2601564Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2601792Z 
2025-05-07T20:25:58.2601800Z 
2025-05-07T20:25:58.2601803Z 
2025-05-07T20:25:58.2601807Z 
2025-05-07T20:25:58.2601810Z 
2025-05-07T20:25:58.2601814Z 
2025-05-07T20:25:58.2601818Z 
2025-05-07T20:25:58.2601821Z 
2025-05-07T20:25:58.2601825Z 
2025-05-07T20:25:58.2601828Z 
2025-05-07T20:25:58.2601837Z 
2025-05-07T20:25:58.2601840Z 
2025-05-07T20:25:58.2601844Z 
2025-05-07T20:25:58.2602330Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2602572Z 
2025-05-07T20:25:58.2602576Z 
2025-05-07T20:25:58.2602580Z 
2025-05-07T20:25:58.2602583Z 
2025-05-07T20:25:58.2602587Z 
2025-05-07T20:25:58.2602596Z 
2025-05-07T20:25:58.2602600Z 
2025-05-07T20:25:58.2602603Z 
2025-05-07T20:25:58.2602607Z 
2025-05-07T20:25:58.2602610Z 
2025-05-07T20:25:58.2602614Z 
2025-05-07T20:25:58.2602618Z 
2025-05-07T20:25:58.2602621Z 
2025-05-07T20:25:58.2602625Z 
2025-05-07T20:25:58.2603065Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2603362Z 
2025-05-07T20:25:58.2603366Z 
2025-05-07T20:25:58.2603370Z 
2025-05-07T20:25:58.2603373Z 
2025-05-07T20:25:58.2603377Z 
2025-05-07T20:25:58.2603380Z 
2025-05-07T20:25:58.2603384Z 
2025-05-07T20:25:58.2603387Z 
2025-05-07T20:25:58.2603391Z 
2025-05-07T20:25:58.2603504Z 
2025-05-07T20:25:58.2603508Z 
2025-05-07T20:25:58.2603519Z 
2025-05-07T20:25:58.2603523Z 
2025-05-07T20:25:58.2603527Z 
2025-05-07T20:25:58.2603530Z 
2025-05-07T20:25:58.2603908Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2604144Z 
2025-05-07T20:25:58.2604154Z 
2025-05-07T20:25:58.2604158Z 
2025-05-07T20:25:58.2604162Z 
2025-05-07T20:25:58.2604165Z 
2025-05-07T20:25:58.2604169Z 
2025-05-07T20:25:58.2604179Z 
2025-05-07T20:25:58.2604182Z 
2025-05-07T20:25:58.2604186Z 
2025-05-07T20:25:58.2604190Z 
2025-05-07T20:25:58.2604193Z 
2025-05-07T20:25:58.2604197Z 
2025-05-07T20:25:58.2604200Z 
2025-05-07T20:25:58.2604204Z 
2025-05-07T20:25:58.2604208Z 
2025-05-07T20:25:58.2604211Z 
2025-05-07T20:25:58.2604506Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2604744Z 
2025-05-07T20:25:58.2604747Z 
2025-05-07T20:25:58.2604751Z 
2025-05-07T20:25:58.2604761Z 
2025-05-07T20:25:58.2604765Z 
2025-05-07T20:25:58.2604768Z 
2025-05-07T20:25:58.2604772Z 
2025-05-07T20:25:58.2604776Z 
2025-05-07T20:25:58.2604779Z 
2025-05-07T20:25:58.2604783Z 
2025-05-07T20:25:58.2604793Z 
2025-05-07T20:25:58.2604801Z 
2025-05-07T20:25:58.2604805Z 
2025-05-07T20:25:58.2604808Z 
2025-05-07T20:25:58.2604812Z 
2025-05-07T20:25:58.2604816Z 
2025-05-07T20:25:58.2604819Z 
2025-05-07T20:25:58.2605211Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2605477Z 
2025-05-07T20:25:58.2605481Z 
2025-05-07T20:25:58.2605484Z 
2025-05-07T20:25:58.2605496Z 
2025-05-07T20:25:58.2605499Z 
2025-05-07T20:25:58.2605503Z 
2025-05-07T20:25:58.2605507Z 
2025-05-07T20:25:58.2605510Z 
2025-05-07T20:25:58.2605514Z 
2025-05-07T20:25:58.2605517Z 
2025-05-07T20:25:58.2605521Z 
2025-05-07T20:25:58.2605535Z 
2025-05-07T20:25:58.2605539Z 
2025-05-07T20:25:58.2605542Z 
2025-05-07T20:25:58.2605546Z 
2025-05-07T20:25:58.2605550Z 
2025-05-07T20:25:58.2605560Z 
2025-05-07T20:25:58.2605563Z 
2025-05-07T20:25:58.2607597Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2608003Z 
2025-05-07T20:25:58.2608024Z 
2025-05-07T20:25:58.2608211Z [A
2025-05-07T20:25:58.2608371Z 
2025-05-07T20:25:58.2608378Z 
2025-05-07T20:25:58.2608995Z [A[A
2025-05-07T20:25:58.2609175Z 
2025-05-07T20:25:58.2609181Z 
2025-05-07T20:25:58.2609191Z 
2025-05-07T20:25:58.2609721Z [A[A[A
2025-05-07T20:25:58.2609906Z 
2025-05-07T20:25:58.2609913Z 
2025-05-07T20:25:58.2609918Z 
2025-05-07T20:25:58.2609928Z 
2025-05-07T20:25:58.2610661Z [A[A[A[A
2025-05-07T20:25:58.2610858Z 
2025-05-07T20:25:58.2610864Z 
2025-05-07T20:25:58.2610869Z 
2025-05-07T20:25:58.2610875Z 
2025-05-07T20:25:58.2610885Z 
2025-05-07T20:25:58.2611299Z [A[A[A[A[A
2025-05-07T20:25:58.2611451Z 
2025-05-07T20:25:58.2611455Z 
2025-05-07T20:25:58.2611458Z 
2025-05-07T20:25:58.2611462Z 
2025-05-07T20:25:58.2611466Z 
2025-05-07T20:25:58.2611479Z 
2025-05-07T20:25:58.2611973Z [A[A[A[A[A[A
2025-05-07T20:25:58.2612116Z 
2025-05-07T20:25:58.2612119Z 
2025-05-07T20:25:58.2612123Z 
2025-05-07T20:25:58.2612127Z 
2025-05-07T20:25:58.2612131Z 
2025-05-07T20:25:58.2612138Z 
2025-05-07T20:25:58.2612141Z 
2025-05-07T20:25:58.2612781Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.2612941Z 
2025-05-07T20:25:58.2612945Z 
2025-05-07T20:25:58.2612948Z 
2025-05-07T20:25:58.2612952Z 
2025-05-07T20:25:58.2612956Z 
2025-05-07T20:25:58.2612959Z 
2025-05-07T20:25:58.2612963Z 
2025-05-07T20:25:58.2612970Z 
2025-05-07T20:25:58.2613399Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2613558Z 
2025-05-07T20:25:58.2613565Z 
2025-05-07T20:25:58.2613569Z 
2025-05-07T20:25:58.2613572Z 
2025-05-07T20:25:58.2613576Z 
2025-05-07T20:25:58.2613580Z 
2025-05-07T20:25:58.2613583Z 
2025-05-07T20:25:58.2613587Z 
2025-05-07T20:25:58.2613590Z 
2025-05-07T20:25:58.2614080Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2614251Z 
2025-05-07T20:25:58.2614259Z 
2025-05-07T20:25:58.2614449Z 
2025-05-07T20:25:58.2614453Z 
2025-05-07T20:25:58.2614457Z 
2025-05-07T20:25:58.2614460Z 
2025-05-07T20:25:58.2614464Z 
2025-05-07T20:25:58.2614467Z 
2025-05-07T20:25:58.2614471Z 
2025-05-07T20:25:58.2614474Z 
2025-05-07T20:25:58.2614901Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2615166Z 
2025-05-07T20:25:58.2615171Z 
2025-05-07T20:25:58.2615176Z 
2025-05-07T20:25:58.2615182Z 
2025-05-07T20:25:58.2615187Z 
2025-05-07T20:25:58.2615192Z 
2025-05-07T20:25:58.2615197Z 
2025-05-07T20:25:58.2615203Z 
2025-05-07T20:25:58.2615208Z 
2025-05-07T20:25:58.2615213Z 
2025-05-07T20:25:58.2615218Z 
2025-05-07T20:25:58.2615466Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2615654Z 
2025-05-07T20:25:58.2615662Z 
2025-05-07T20:25:58.2615666Z 
2025-05-07T20:25:58.2615669Z 
2025-05-07T20:25:58.2615673Z 
2025-05-07T20:25:58.2615676Z 
2025-05-07T20:25:58.2615680Z 
2025-05-07T20:25:58.2615684Z 
2025-05-07T20:25:58.2615687Z 
2025-05-07T20:25:58.2615691Z 
2025-05-07T20:25:58.2615694Z 
2025-05-07T20:25:58.2615704Z 
2025-05-07T20:25:58.2616108Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2616292Z 
2025-05-07T20:25:58.2616300Z 
2025-05-07T20:25:58.2616303Z 
2025-05-07T20:25:58.2616307Z 
2025-05-07T20:25:58.2616316Z 
2025-05-07T20:25:58.2616320Z 
2025-05-07T20:25:58.2616323Z 
2025-05-07T20:25:58.2616327Z 
2025-05-07T20:25:58.2616336Z 
2025-05-07T20:25:58.2616340Z 
2025-05-07T20:25:58.2616343Z 
2025-05-07T20:25:58.2616347Z 
2025-05-07T20:25:58.2616351Z 
2025-05-07T20:25:58.2616801Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2617001Z 
2025-05-07T20:25:58.2617004Z 
2025-05-07T20:25:58.2617008Z 
2025-05-07T20:25:58.2617012Z 
2025-05-07T20:25:58.2617015Z 
2025-05-07T20:25:58.2617019Z 
2025-05-07T20:25:58.2617022Z 
2025-05-07T20:25:58.2617026Z 
2025-05-07T20:25:58.2617030Z 
2025-05-07T20:25:58.2617033Z 
2025-05-07T20:25:58.2617037Z 
2025-05-07T20:25:58.2617040Z 
2025-05-07T20:25:58.2617044Z 
2025-05-07T20:25:58.2617047Z 
2025-05-07T20:25:58.2617466Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2617671Z 
2025-05-07T20:25:58.2617675Z 
2025-05-07T20:25:58.2617682Z 
2025-05-07T20:25:58.2617686Z 
2025-05-07T20:25:58.2617690Z 
2025-05-07T20:25:58.2617693Z 
2025-05-07T20:25:58.2617697Z 
2025-05-07T20:25:58.2617705Z 
2025-05-07T20:25:58.2617708Z 
2025-05-07T20:25:58.2617712Z 
2025-05-07T20:25:58.2617715Z 
2025-05-07T20:25:58.2617719Z 
2025-05-07T20:25:58.2617722Z 
2025-05-07T20:25:58.2617726Z 
2025-05-07T20:25:58.2617737Z 
2025-05-07T20:25:58.2618232Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2618442Z 
2025-05-07T20:25:58.2618446Z 
2025-05-07T20:25:58.2618450Z 
2025-05-07T20:25:58.2618453Z 
2025-05-07T20:25:58.2618457Z 
2025-05-07T20:25:58.2618461Z 
2025-05-07T20:25:58.2618464Z 
2025-05-07T20:25:58.2618468Z 
2025-05-07T20:25:58.2618475Z 
2025-05-07T20:25:58.2618479Z 
2025-05-07T20:25:58.2618491Z 
2025-05-07T20:25:58.2618495Z 
2025-05-07T20:25:58.2618498Z 
2025-05-07T20:25:58.2618502Z 
2025-05-07T20:25:58.2618506Z 
2025-05-07T20:25:58.2618509Z 
2025-05-07T20:25:58.2618791Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2619052Z 
2025-05-07T20:25:58.2619063Z 
2025-05-07T20:25:58.2619072Z 
2025-05-07T20:25:58.2619076Z 
2025-05-07T20:25:58.2619079Z 
2025-05-07T20:25:58.2619088Z 
2025-05-07T20:25:58.2619092Z 
2025-05-07T20:25:58.2619095Z 
2025-05-07T20:25:58.2619099Z 
2025-05-07T20:25:58.2619102Z 
2025-05-07T20:25:58.2619106Z 
2025-05-07T20:25:58.2619110Z 
2025-05-07T20:25:58.2619113Z 
2025-05-07T20:25:58.2619117Z 
2025-05-07T20:25:58.2619120Z 
2025-05-07T20:25:58.2619124Z 
2025-05-07T20:25:58.2619128Z 
2025-05-07T20:25:58.2619622Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2619972Z 
2025-05-07T20:25:58.2619978Z 
2025-05-07T20:25:58.2619983Z 
2025-05-07T20:25:58.2619989Z 
2025-05-07T20:25:58.2619995Z 
2025-05-07T20:25:58.2620001Z 
2025-05-07T20:25:58.2620007Z 
2025-05-07T20:25:58.2620013Z 
2025-05-07T20:25:58.2620019Z 
2025-05-07T20:25:58.2620025Z 
2025-05-07T20:25:58.2620038Z 
2025-05-07T20:25:58.2620044Z 
2025-05-07T20:25:58.2620208Z 
2025-05-07T20:25:58.2620213Z 
2025-05-07T20:25:58.2620219Z 
2025-05-07T20:25:58.2620225Z 
2025-05-07T20:25:58.2620231Z 
2025-05-07T20:25:58.2620236Z 
2025-05-07T20:25:58.2621262Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2621613Z 
2025-05-07T20:25:58.2621625Z 
2025-05-07T20:25:58.2621797Z [A
2025-05-07T20:25:58.2621985Z 
2025-05-07T20:25:58.2621991Z 
2025-05-07T20:25:58.2622356Z [A[A
2025-05-07T20:25:58.2622473Z 
2025-05-07T20:25:58.2629817Z 
2025-05-07T20:25:58.2629836Z 
2025-05-07T20:25:58.2630221Z [A[A[A
2025-05-07T20:25:58.2630422Z 
2025-05-07T20:25:58.2630428Z 
2025-05-07T20:25:58.2630435Z 
2025-05-07T20:25:58.2630441Z 
2025-05-07T20:25:58.2630636Z [A[A[A[A
2025-05-07T20:25:58.2630810Z 
2025-05-07T20:25:58.2630816Z 
2025-05-07T20:25:58.2630821Z 
2025-05-07T20:25:58.2630826Z 
2025-05-07T20:25:58.2630832Z 
2025-05-07T20:25:58.2631028Z [A[A[A[A[A
2025-05-07T20:25:58.2631234Z 
2025-05-07T20:25:58.2631240Z 
2025-05-07T20:25:58.2631246Z 
2025-05-07T20:25:58.2631267Z 
2025-05-07T20:25:58.2631272Z 
2025-05-07T20:25:58.2631278Z 
2025-05-07T20:25:58.2631468Z [A[A[A[A[A[A
2025-05-07T20:25:58.2631683Z 
2025-05-07T20:25:58.2631689Z 
2025-05-07T20:25:58.2631695Z 
2025-05-07T20:25:58.2631709Z 
2025-05-07T20:25:58.2631715Z 
2025-05-07T20:25:58.2631721Z 
2025-05-07T20:25:58.2631726Z 
2025-05-07T20:25:58.2631928Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.2632144Z 
2025-05-07T20:25:58.2632198Z 
2025-05-07T20:25:58.2632205Z 
2025-05-07T20:25:58.2632210Z 
2025-05-07T20:25:58.2632216Z 
2025-05-07T20:25:58.2632221Z 
2025-05-07T20:25:58.2632227Z 
2025-05-07T20:25:58.2632232Z 
2025-05-07T20:25:58.2632437Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2632677Z 
2025-05-07T20:25:58.2632693Z 
2025-05-07T20:25:58.2632699Z 
2025-05-07T20:25:58.2632704Z 
2025-05-07T20:25:58.2632709Z 
2025-05-07T20:25:58.2632714Z 
2025-05-07T20:25:58.2632719Z 
2025-05-07T20:25:58.2632725Z 
2025-05-07T20:25:58.2632730Z 
2025-05-07T20:25:58.2632941Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2633224Z 
2025-05-07T20:25:58.2633230Z 
2025-05-07T20:25:58.2633236Z 
2025-05-07T20:25:58.2633241Z 
2025-05-07T20:25:58.2633247Z 
2025-05-07T20:25:58.2633253Z 
2025-05-07T20:25:58.2633259Z 
2025-05-07T20:25:58.2633271Z 
2025-05-07T20:25:58.2633278Z 
2025-05-07T20:25:58.2633283Z 
2025-05-07T20:25:58.2633505Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2633794Z 
2025-05-07T20:25:58.2633799Z 
2025-05-07T20:25:58.2633805Z 
2025-05-07T20:25:58.2633811Z 
2025-05-07T20:25:58.2633816Z 
2025-05-07T20:25:58.2633822Z 
2025-05-07T20:25:58.2633828Z 
2025-05-07T20:25:58.2633833Z 
2025-05-07T20:25:58.2633839Z 
2025-05-07T20:25:58.2633844Z 
2025-05-07T20:25:58.2633850Z 
2025-05-07T20:25:58.2634059Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2634357Z 
2025-05-07T20:25:58.2634364Z 
2025-05-07T20:25:58.2634370Z 
2025-05-07T20:25:58.2634375Z 
2025-05-07T20:25:58.2634381Z 
2025-05-07T20:25:58.2634387Z 
2025-05-07T20:25:58.2634393Z 
2025-05-07T20:25:58.2634398Z 
2025-05-07T20:25:58.2634404Z 
2025-05-07T20:25:58.2634419Z 
2025-05-07T20:25:58.2634425Z 
2025-05-07T20:25:58.2634430Z 
2025-05-07T20:25:58.2634661Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2634961Z 
2025-05-07T20:25:58.2634967Z 
2025-05-07T20:25:58.2634980Z 
2025-05-07T20:25:58.2634986Z 
2025-05-07T20:25:58.2634992Z 
2025-05-07T20:25:58.2634997Z 
2025-05-07T20:25:58.2635003Z 
2025-05-07T20:25:58.2635009Z 
2025-05-07T20:25:58.2635015Z 
2025-05-07T20:25:58.2635021Z 
2025-05-07T20:25:58.2635027Z 
2025-05-07T20:25:58.2635033Z 
2025-05-07T20:25:58.2635038Z 
2025-05-07T20:25:58.2635268Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2635579Z 
2025-05-07T20:25:58.2635585Z 
2025-05-07T20:25:58.2635591Z 
2025-05-07T20:25:58.2635597Z 
2025-05-07T20:25:58.2635603Z 
2025-05-07T20:25:58.2635609Z 
2025-05-07T20:25:58.2635615Z 
2025-05-07T20:25:58.2635635Z 
2025-05-07T20:25:58.2635641Z 
2025-05-07T20:25:58.2635647Z 
2025-05-07T20:25:58.2635653Z 
2025-05-07T20:25:58.2635659Z 
2025-05-07T20:25:58.2635664Z 
2025-05-07T20:25:58.2635817Z 
2025-05-07T20:25:58.2636057Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2636393Z 
2025-05-07T20:25:58.2636400Z 
2025-05-07T20:25:58.2636406Z 
2025-05-07T20:25:58.2636411Z 
2025-05-07T20:25:58.2636417Z 
2025-05-07T20:25:58.2636535Z 
2025-05-07T20:25:58.2636543Z 
2025-05-07T20:25:58.2636548Z 
2025-05-07T20:25:58.2636554Z 
2025-05-07T20:25:58.2636559Z 
2025-05-07T20:25:58.2636565Z 
2025-05-07T20:25:58.2636571Z 
2025-05-07T20:25:58.2636576Z 
2025-05-07T20:25:58.2636581Z 
2025-05-07T20:25:58.2636587Z 
2025-05-07T20:25:58.2636852Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2637175Z 
2025-05-07T20:25:58.2637181Z 
2025-05-07T20:25:58.2637187Z 
2025-05-07T20:25:58.2637192Z 
2025-05-07T20:25:58.2637197Z 
2025-05-07T20:25:58.2637203Z 
2025-05-07T20:25:58.2637208Z 
2025-05-07T20:25:58.2637214Z 
2025-05-07T20:25:58.2637220Z 
2025-05-07T20:25:58.2637225Z 
2025-05-07T20:25:58.2637230Z 
2025-05-07T20:25:58.2637236Z 
2025-05-07T20:25:58.2637241Z 
2025-05-07T20:25:58.2637247Z 
2025-05-07T20:25:58.2637263Z 
2025-05-07T20:25:58.2637268Z 
2025-05-07T20:25:58.2637542Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2637868Z 
2025-05-07T20:25:58.2637875Z 
2025-05-07T20:25:58.2637880Z 
2025-05-07T20:25:58.2637896Z 
2025-05-07T20:25:58.2637902Z 
2025-05-07T20:25:58.2637908Z 
2025-05-07T20:25:58.2637914Z 
2025-05-07T20:25:58.2637931Z 
2025-05-07T20:25:58.2637937Z 
2025-05-07T20:25:58.2637943Z 
2025-05-07T20:25:58.2637949Z 
2025-05-07T20:25:58.2637955Z 
2025-05-07T20:25:58.2637961Z 
2025-05-07T20:25:58.2637966Z 
2025-05-07T20:25:58.2637972Z 
2025-05-07T20:25:58.2637978Z 
2025-05-07T20:25:58.2637984Z 
2025-05-07T20:25:58.2638249Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2638587Z 
2025-05-07T20:25:58.2638593Z 
2025-05-07T20:25:58.2638599Z 
2025-05-07T20:25:58.2638605Z 
2025-05-07T20:25:58.2638611Z 
2025-05-07T20:25:58.2638617Z 
2025-05-07T20:25:58.2638623Z 
2025-05-07T20:25:58.2638629Z 
2025-05-07T20:25:58.2638635Z 
2025-05-07T20:25:58.2638648Z 
2025-05-07T20:25:58.2638654Z 
2025-05-07T20:25:58.2638660Z 
2025-05-07T20:25:58.2638666Z 
2025-05-07T20:25:58.2638672Z 
2025-05-07T20:25:58.2638678Z 
2025-05-07T20:25:58.2638684Z 
2025-05-07T20:25:58.2638690Z 
2025-05-07T20:25:58.2638702Z 
2025-05-07T20:25:58.2638978Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2639331Z 
2025-05-07T20:25:58.2639337Z 
2025-05-07T20:25:58.2639493Z [A
2025-05-07T20:25:58.2639672Z 
2025-05-07T20:25:58.2639678Z 
2025-05-07T20:25:58.2639842Z [A[A
2025-05-07T20:25:58.2640017Z 
2025-05-07T20:25:58.2640022Z 
2025-05-07T20:25:58.2640027Z 
2025-05-07T20:25:58.2640198Z [A[A[A
2025-05-07T20:25:58.2640375Z 
2025-05-07T20:25:58.2640381Z 
2025-05-07T20:25:58.2640387Z 
2025-05-07T20:25:58.2640401Z 
2025-05-07T20:25:58.2640581Z [A[A[A[A
2025-05-07T20:25:58.2640760Z 
2025-05-07T20:25:58.2640766Z 
2025-05-07T20:25:58.2640771Z 
2025-05-07T20:25:58.2640776Z 
2025-05-07T20:25:58.2640781Z 
2025-05-07T20:25:58.2640994Z [A[A[A[A[A
2025-05-07T20:25:58.2641225Z 
2025-05-07T20:25:58.2641231Z 
2025-05-07T20:25:58.2641237Z 
2025-05-07T20:25:58.2641242Z 
2025-05-07T20:25:58.2641248Z 
2025-05-07T20:25:58.2641253Z 
2025-05-07T20:25:58.2641478Z [A[A[A[A[A[A
2025-05-07T20:25:58.2641707Z 
2025-05-07T20:25:58.2641713Z 
2025-05-07T20:25:58.2641719Z 
2025-05-07T20:25:58.2641725Z 
2025-05-07T20:25:58.2641730Z 
2025-05-07T20:25:58.2641736Z 
2025-05-07T20:25:58.2641741Z 
2025-05-07T20:25:58.2641925Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.2642153Z 
2025-05-07T20:25:58.2642159Z 
2025-05-07T20:25:58.2642165Z 
2025-05-07T20:25:58.2642171Z 
2025-05-07T20:25:58.2642177Z 
2025-05-07T20:25:58.2642183Z 
2025-05-07T20:25:58.2642188Z 
2025-05-07T20:25:58.2642194Z 
2025-05-07T20:25:58.2642392Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2642639Z 
2025-05-07T20:25:58.2642644Z 
2025-05-07T20:25:58.2642650Z 
2025-05-07T20:25:58.2642656Z 
2025-05-07T20:25:58.2642661Z 
2025-05-07T20:25:58.2642667Z 
2025-05-07T20:25:58.2642673Z 
2025-05-07T20:25:58.2642678Z 
2025-05-07T20:25:58.2642814Z 
2025-05-07T20:25:58.2643027Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2643280Z 
2025-05-07T20:25:58.2643286Z 
2025-05-07T20:25:58.2643291Z 
2025-05-07T20:25:58.2643297Z 
2025-05-07T20:25:58.2643303Z 
2025-05-07T20:25:58.2643408Z 
2025-05-07T20:25:58.2643415Z 
2025-05-07T20:25:58.2643421Z 
2025-05-07T20:25:58.2643427Z 
2025-05-07T20:25:58.2643432Z 
2025-05-07T20:25:58.2643643Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2643911Z 
2025-05-07T20:25:58.2643917Z 
2025-05-07T20:25:58.2643923Z 
2025-05-07T20:25:58.2643928Z 
2025-05-07T20:25:58.2643934Z 
2025-05-07T20:25:58.2643940Z 
2025-05-07T20:25:58.2643946Z 
2025-05-07T20:25:58.2643952Z 
2025-05-07T20:25:58.2643957Z 
2025-05-07T20:25:58.2643963Z 
2025-05-07T20:25:58.2643969Z 
2025-05-07T20:25:58.2644196Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2644470Z 
2025-05-07T20:25:58.2644476Z 
2025-05-07T20:25:58.2644482Z 
2025-05-07T20:25:58.2644487Z 
2025-05-07T20:25:58.2644493Z 
2025-05-07T20:25:58.2644499Z 
2025-05-07T20:25:58.2644515Z 
2025-05-07T20:25:58.2644521Z 
2025-05-07T20:25:58.2644527Z 
2025-05-07T20:25:58.2644533Z 
2025-05-07T20:25:58.2644539Z 
2025-05-07T20:25:58.2644545Z 
2025-05-07T20:25:58.2644770Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2645070Z 
2025-05-07T20:25:58.2645076Z 
2025-05-07T20:25:58.2645082Z 
2025-05-07T20:25:58.2645088Z 
2025-05-07T20:25:58.2645094Z 
2025-05-07T20:25:58.2645100Z 
2025-05-07T20:25:58.2645105Z 
2025-05-07T20:25:58.2645111Z 
2025-05-07T20:25:58.2645117Z 
2025-05-07T20:25:58.2645123Z 
2025-05-07T20:25:58.2645129Z 
2025-05-07T20:25:58.2645144Z 
2025-05-07T20:25:58.2645149Z 
2025-05-07T20:25:58.2645370Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2645682Z 
2025-05-07T20:25:58.2645688Z 
2025-05-07T20:25:58.2645694Z 
2025-05-07T20:25:58.2645700Z 
2025-05-07T20:25:58.2645706Z 
2025-05-07T20:25:58.2645721Z 
2025-05-07T20:25:58.2645727Z 
2025-05-07T20:25:58.2645732Z 
2025-05-07T20:25:58.2645738Z 
2025-05-07T20:25:58.2645744Z 
2025-05-07T20:25:58.2645749Z 
2025-05-07T20:25:58.2645761Z 
2025-05-07T20:25:58.2645767Z 
2025-05-07T20:25:58.2645773Z 
2025-05-07T20:25:58.2646003Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2646335Z 
2025-05-07T20:25:58.2646342Z 
2025-05-07T20:25:58.2646354Z 
2025-05-07T20:25:58.2646361Z 
2025-05-07T20:25:58.2646366Z 
2025-05-07T20:25:58.2646372Z 
2025-05-07T20:25:58.2646377Z 
2025-05-07T20:25:58.2646382Z 
2025-05-07T20:25:58.2646387Z 
2025-05-07T20:25:58.2646392Z 
2025-05-07T20:25:58.2646397Z 
2025-05-07T20:25:58.2646402Z 
2025-05-07T20:25:58.2646407Z 
2025-05-07T20:25:58.2646412Z 
2025-05-07T20:25:58.2646417Z 
2025-05-07T20:25:58.2646670Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2646992Z 
2025-05-07T20:25:58.2646997Z 
2025-05-07T20:25:58.2647003Z 
2025-05-07T20:25:58.2647008Z 
2025-05-07T20:25:58.2647013Z 
2025-05-07T20:25:58.2647019Z 
2025-05-07T20:25:58.2647024Z 
2025-05-07T20:25:58.2647030Z 
2025-05-07T20:25:58.2647036Z 
2025-05-07T20:25:58.2647041Z 
2025-05-07T20:25:58.2647046Z 
2025-05-07T20:25:58.2647059Z 
2025-05-07T20:25:58.2647065Z 
2025-05-07T20:25:58.2647071Z 
2025-05-07T20:25:58.2647087Z 
2025-05-07T20:25:58.2647093Z 
2025-05-07T20:25:58.2647353Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2647791Z 
2025-05-07T20:25:58.2647798Z 
2025-05-07T20:25:58.2647803Z 
2025-05-07T20:25:58.2647809Z 
2025-05-07T20:25:58.2647815Z 
2025-05-07T20:25:58.2647832Z 
2025-05-07T20:25:58.2647838Z 
2025-05-07T20:25:58.2647843Z 
2025-05-07T20:25:58.2647849Z 
2025-05-07T20:25:58.2647855Z 
2025-05-07T20:25:58.2647861Z 
2025-05-07T20:25:58.2647867Z 
2025-05-07T20:25:58.2647873Z 
2025-05-07T20:25:58.2647879Z 
2025-05-07T20:25:58.2647885Z 
2025-05-07T20:25:58.2647890Z 
2025-05-07T20:25:58.2647896Z 
2025-05-07T20:25:58.2648160Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2648519Z 
2025-05-07T20:25:58.2648525Z 
2025-05-07T20:25:58.2648531Z 
2025-05-07T20:25:58.2648536Z 
2025-05-07T20:25:58.2648542Z 
2025-05-07T20:25:58.2648548Z 
2025-05-07T20:25:58.2648553Z 
2025-05-07T20:25:58.2648687Z 
2025-05-07T20:25:58.2648692Z 
2025-05-07T20:25:58.2648697Z 
2025-05-07T20:25:58.2648702Z 
2025-05-07T20:25:58.2648707Z 
2025-05-07T20:25:58.2648712Z 
2025-05-07T20:25:58.2648717Z 
2025-05-07T20:25:58.2648723Z 
2025-05-07T20:25:58.2649324Z 
2025-05-07T20:25:58.2649334Z 
2025-05-07T20:25:58.2649340Z 
2025-05-07T20:25:58.2649648Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2649985Z 
2025-05-07T20:25:58.2649991Z 
2025-05-07T20:25:58.2650166Z [A
2025-05-07T20:25:58.2650322Z 
2025-05-07T20:25:58.2650328Z 
2025-05-07T20:25:58.2650547Z [A[A
2025-05-07T20:25:58.2650712Z 
2025-05-07T20:25:58.2650718Z 
2025-05-07T20:25:58.2650723Z 
2025-05-07T20:25:58.2650914Z [A[A[A
2025-05-07T20:25:58.2651092Z 
2025-05-07T20:25:58.2651098Z 
2025-05-07T20:25:58.2651104Z 
2025-05-07T20:25:58.2651109Z 
2025-05-07T20:25:58.2651280Z [A[A[A[A
2025-05-07T20:25:58.2651469Z 
2025-05-07T20:25:58.2651475Z 
2025-05-07T20:25:58.2651480Z 
2025-05-07T20:25:58.2651486Z 
2025-05-07T20:25:58.2651503Z 
2025-05-07T20:25:58.2651685Z [A[A[A[A[A
2025-05-07T20:25:58.2651883Z 
2025-05-07T20:25:58.2651889Z 
2025-05-07T20:25:58.2651895Z 
2025-05-07T20:25:58.2651901Z 
2025-05-07T20:25:58.2651906Z 
2025-05-07T20:25:58.2651911Z 
2025-05-07T20:25:58.2652110Z [A[A[A[A[A[A
2025-05-07T20:25:58.2652312Z 
2025-05-07T20:25:58.2652318Z 
2025-05-07T20:25:58.2652323Z 
2025-05-07T20:25:58.2652328Z 
2025-05-07T20:25:58.2652333Z 
2025-05-07T20:25:58.2652339Z 
2025-05-07T20:25:58.2652344Z 
2025-05-07T20:25:58.2652545Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.2652771Z 
2025-05-07T20:25:58.2652776Z 
2025-05-07T20:25:58.2652781Z 
2025-05-07T20:25:58.2652786Z 
2025-05-07T20:25:58.2652791Z 
2025-05-07T20:25:58.2652797Z 
2025-05-07T20:25:58.2652801Z 
2025-05-07T20:25:58.2652807Z 
2025-05-07T20:25:58.2653002Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2653248Z 
2025-05-07T20:25:58.2653254Z 
2025-05-07T20:25:58.2653259Z 
2025-05-07T20:25:58.2653265Z 
2025-05-07T20:25:58.2653271Z 
2025-05-07T20:25:58.2653276Z 
2025-05-07T20:25:58.2653290Z 
2025-05-07T20:25:58.2653296Z 
2025-05-07T20:25:58.2653301Z 
2025-05-07T20:25:58.2653504Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2653751Z 
2025-05-07T20:25:58.2653757Z 
2025-05-07T20:25:58.2653763Z 
2025-05-07T20:25:58.2653776Z 
2025-05-07T20:25:58.2653782Z 
2025-05-07T20:25:58.2653788Z 
2025-05-07T20:25:58.2653793Z 
2025-05-07T20:25:58.2653799Z 
2025-05-07T20:25:58.2653805Z 
2025-05-07T20:25:58.2653821Z 
2025-05-07T20:25:58.2654026Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2654295Z 
2025-05-07T20:25:58.2654301Z 
2025-05-07T20:25:58.2654307Z 
2025-05-07T20:25:58.2654313Z 
2025-05-07T20:25:58.2654318Z 
2025-05-07T20:25:58.2654324Z 
2025-05-07T20:25:58.2654337Z 
2025-05-07T20:25:58.2654343Z 
2025-05-07T20:25:58.2654349Z 
2025-05-07T20:25:58.2654355Z 
2025-05-07T20:25:58.2654360Z 
2025-05-07T20:25:58.2654569Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2654851Z 
2025-05-07T20:25:58.2654857Z 
2025-05-07T20:25:58.2654872Z 
2025-05-07T20:25:58.2654878Z 
2025-05-07T20:25:58.2654893Z 
2025-05-07T20:25:58.2654899Z 
2025-05-07T20:25:58.2654904Z 
2025-05-07T20:25:58.2654910Z 
2025-05-07T20:25:58.2654915Z 
2025-05-07T20:25:58.2654920Z 
2025-05-07T20:25:58.2654925Z 
2025-05-07T20:25:58.2654930Z 
2025-05-07T20:25:58.2655154Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2655462Z 
2025-05-07T20:25:58.2655467Z 
2025-05-07T20:25:58.2655473Z 
2025-05-07T20:25:58.2655478Z 
2025-05-07T20:25:58.2655483Z 
2025-05-07T20:25:58.2655488Z 
2025-05-07T20:25:58.2655494Z 
2025-05-07T20:25:58.2655499Z 
2025-05-07T20:25:58.2655504Z 
2025-05-07T20:25:58.2655509Z 
2025-05-07T20:25:58.2655515Z 
2025-05-07T20:25:58.2655520Z 
2025-05-07T20:25:58.2655526Z 
2025-05-07T20:25:58.2655761Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2656064Z 
2025-05-07T20:25:58.2656071Z 
2025-05-07T20:25:58.2656076Z 
2025-05-07T20:25:58.2656081Z 
2025-05-07T20:25:58.2656087Z 
2025-05-07T20:25:58.2656093Z 
2025-05-07T20:25:58.2656098Z 
2025-05-07T20:25:58.2656104Z 
2025-05-07T20:25:58.2656110Z 
2025-05-07T20:25:58.2656267Z 
2025-05-07T20:25:58.2656273Z 
2025-05-07T20:25:58.2656278Z 
2025-05-07T20:25:58.2656284Z 
2025-05-07T20:25:58.2656290Z 
2025-05-07T20:25:58.2656540Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2656969Z 
2025-05-07T20:25:58.2656977Z 
2025-05-07T20:25:58.2656982Z 
2025-05-07T20:25:58.2656987Z 
2025-05-07T20:25:58.2656993Z 
2025-05-07T20:25:58.2656998Z 
2025-05-07T20:25:58.2657004Z 
2025-05-07T20:25:58.2657009Z 
2025-05-07T20:25:58.2657015Z 
2025-05-07T20:25:58.2657031Z 
2025-05-07T20:25:58.2657037Z 
2025-05-07T20:25:58.2657042Z 
2025-05-07T20:25:58.2657048Z 
2025-05-07T20:25:58.2657053Z 
2025-05-07T20:25:58.2657059Z 
2025-05-07T20:25:58.2657310Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2657630Z 
2025-05-07T20:25:58.2657646Z 
2025-05-07T20:25:58.2657652Z 
2025-05-07T20:25:58.2657657Z 
2025-05-07T20:25:58.2657663Z 
2025-05-07T20:25:58.2657669Z 
2025-05-07T20:25:58.2657675Z 
2025-05-07T20:25:58.2657681Z 
2025-05-07T20:25:58.2657686Z 
2025-05-07T20:25:58.2657701Z 
2025-05-07T20:25:58.2657707Z 
2025-05-07T20:25:58.2657712Z 
2025-05-07T20:25:58.2657718Z 
2025-05-07T20:25:58.2657724Z 
2025-05-07T20:25:58.2657729Z 
2025-05-07T20:25:58.2657734Z 
2025-05-07T20:25:58.2657989Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2658326Z 
2025-05-07T20:25:58.2658332Z 
2025-05-07T20:25:58.2658338Z 
2025-05-07T20:25:58.2658343Z 
2025-05-07T20:25:58.2658348Z 
2025-05-07T20:25:58.2658354Z 
2025-05-07T20:25:58.2658360Z 
2025-05-07T20:25:58.2658365Z 
2025-05-07T20:25:58.2658371Z 
2025-05-07T20:25:58.2658377Z 
2025-05-07T20:25:58.2658383Z 
2025-05-07T20:25:58.2658388Z 
2025-05-07T20:25:58.2658394Z 
2025-05-07T20:25:58.2658400Z 
2025-05-07T20:25:58.2658405Z 
2025-05-07T20:25:58.2658411Z 
2025-05-07T20:25:58.2658417Z 
2025-05-07T20:25:58.2658685Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2659033Z 
2025-05-07T20:25:58.2659039Z 
2025-05-07T20:25:58.2659044Z 
2025-05-07T20:25:58.2659049Z 
2025-05-07T20:25:58.2659054Z 
2025-05-07T20:25:58.2659066Z 
2025-05-07T20:25:58.2659072Z 
2025-05-07T20:25:58.2659087Z 
2025-05-07T20:25:58.2659092Z 
2025-05-07T20:25:58.2659097Z 
2025-05-07T20:25:58.2659102Z 
2025-05-07T20:25:58.2659107Z 
2025-05-07T20:25:58.2659120Z 
2025-05-07T20:25:58.2659125Z 
2025-05-07T20:25:58.2659130Z 
2025-05-07T20:25:58.2659135Z 
2025-05-07T20:25:58.2659140Z 
2025-05-07T20:25:58.2659146Z 
2025-05-07T20:25:58.2659475Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2659823Z 
2025-05-07T20:25:58.2659830Z 
2025-05-07T20:25:58.2659995Z [A
2025-05-07T20:25:58.2660156Z 
2025-05-07T20:25:58.2660162Z 
2025-05-07T20:25:58.2660322Z [A[A
2025-05-07T20:25:58.2660496Z 
2025-05-07T20:25:58.2660502Z 
2025-05-07T20:25:58.2660508Z 
2025-05-07T20:25:58.2660682Z [A[A[A
2025-05-07T20:25:58.2660860Z 
2025-05-07T20:25:58.2660865Z 
2025-05-07T20:25:58.2660871Z 
2025-05-07T20:25:58.2660877Z 
2025-05-07T20:25:58.2661055Z [A[A[A[A
2025-05-07T20:25:58.2661236Z 
2025-05-07T20:25:58.2661241Z 
2025-05-07T20:25:58.2661255Z 
2025-05-07T20:25:58.2661270Z 
2025-05-07T20:25:58.2661276Z 
2025-05-07T20:25:58.2661463Z [A[A[A[A[A
2025-05-07T20:25:58.2661655Z 
2025-05-07T20:25:58.2661660Z 
2025-05-07T20:25:58.2661665Z 
2025-05-07T20:25:58.2661676Z 
2025-05-07T20:25:58.2661682Z 
2025-05-07T20:25:58.2661687Z 
2025-05-07T20:25:58.2661884Z [A[A[A[A[A[A
2025-05-07T20:25:58.2662086Z 
2025-05-07T20:25:58.2662092Z 
2025-05-07T20:25:58.2662097Z 
2025-05-07T20:25:58.2662103Z 
2025-05-07T20:25:58.2662108Z 
2025-05-07T20:25:58.2662113Z 
2025-05-07T20:25:58.2662118Z 
2025-05-07T20:25:58.2662321Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.2662533Z 
2025-05-07T20:25:58.2662539Z 
2025-05-07T20:25:58.2662544Z 
2025-05-07T20:25:58.2662550Z 
2025-05-07T20:25:58.2662555Z 
2025-05-07T20:25:58.2662561Z 
2025-05-07T20:25:58.2662567Z 
2025-05-07T20:25:58.2662572Z 
2025-05-07T20:25:58.2662785Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2663017Z 
2025-05-07T20:25:58.2663023Z 
2025-05-07T20:25:58.2663028Z 
2025-05-07T20:25:58.2663033Z 
2025-05-07T20:25:58.2663172Z 
2025-05-07T20:25:58.2663177Z 
2025-05-07T20:25:58.2663183Z 
2025-05-07T20:25:58.2663188Z 
2025-05-07T20:25:58.2663205Z 
2025-05-07T20:25:58.2663413Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2663778Z 
2025-05-07T20:25:58.2663786Z 
2025-05-07T20:25:58.2663791Z 
2025-05-07T20:25:58.2663797Z 
2025-05-07T20:25:58.2663802Z 
2025-05-07T20:25:58.2663807Z 
2025-05-07T20:25:58.2663813Z 
2025-05-07T20:25:58.2663829Z 
2025-05-07T20:25:58.2663835Z 
2025-05-07T20:25:58.2663841Z 
2025-05-07T20:25:58.2664064Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2664315Z 
2025-05-07T20:25:58.2664321Z 
2025-05-07T20:25:58.2664327Z 
2025-05-07T20:25:58.2664332Z 
2025-05-07T20:25:58.2664350Z 
2025-05-07T20:25:58.2664356Z 
2025-05-07T20:25:58.2664362Z 
2025-05-07T20:25:58.2664367Z 
2025-05-07T20:25:58.2664373Z 
2025-05-07T20:25:58.2664379Z 
2025-05-07T20:25:58.2664385Z 
2025-05-07T20:25:58.2664597Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2664890Z 
2025-05-07T20:25:58.2664895Z 
2025-05-07T20:25:58.2664910Z 
2025-05-07T20:25:58.2664915Z 
2025-05-07T20:25:58.2664921Z 
2025-05-07T20:25:58.2664926Z 
2025-05-07T20:25:58.2664931Z 
2025-05-07T20:25:58.2664936Z 
2025-05-07T20:25:58.2664942Z 
2025-05-07T20:25:58.2664956Z 
2025-05-07T20:25:58.2664963Z 
2025-05-07T20:25:58.2664968Z 
2025-05-07T20:25:58.2665193Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2665488Z 
2025-05-07T20:25:58.2665494Z 
2025-05-07T20:25:58.2665500Z 
2025-05-07T20:25:58.2665506Z 
2025-05-07T20:25:58.2665512Z 
2025-05-07T20:25:58.2665517Z 
2025-05-07T20:25:58.2665523Z 
2025-05-07T20:25:58.2665529Z 
2025-05-07T20:25:58.2665535Z 
2025-05-07T20:25:58.2665540Z 
2025-05-07T20:25:58.2665546Z 
2025-05-07T20:25:58.2665552Z 
2025-05-07T20:25:58.2665557Z 
2025-05-07T20:25:58.2665782Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2666087Z 
2025-05-07T20:25:58.2666094Z 
2025-05-07T20:25:58.2666099Z 
2025-05-07T20:25:58.2666104Z 
2025-05-07T20:25:58.2666110Z 
2025-05-07T20:25:58.2666116Z 
2025-05-07T20:25:58.2666130Z 
2025-05-07T20:25:58.2666135Z 
2025-05-07T20:25:58.2666141Z 
2025-05-07T20:25:58.2666147Z 
2025-05-07T20:25:58.2666152Z 
2025-05-07T20:25:58.2666158Z 
2025-05-07T20:25:58.2666164Z 
2025-05-07T20:25:58.2666169Z 
2025-05-07T20:25:58.2666417Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2666730Z 
2025-05-07T20:25:58.2666736Z 
2025-05-07T20:25:58.2666741Z 
2025-05-07T20:25:58.2666746Z 
2025-05-07T20:25:58.2666752Z 
2025-05-07T20:25:58.2666758Z 
2025-05-07T20:25:58.2666763Z 
2025-05-07T20:25:58.2666778Z 
2025-05-07T20:25:58.2666783Z 
2025-05-07T20:25:58.2666789Z 
2025-05-07T20:25:58.2666794Z 
2025-05-07T20:25:58.2666800Z 
2025-05-07T20:25:58.2666805Z 
2025-05-07T20:25:58.2666811Z 
2025-05-07T20:25:58.2666816Z 
2025-05-07T20:25:58.2667066Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2667398Z 
2025-05-07T20:25:58.2667404Z 
2025-05-07T20:25:58.2667410Z 
2025-05-07T20:25:58.2667416Z 
2025-05-07T20:25:58.2667422Z 
2025-05-07T20:25:58.2667427Z 
2025-05-07T20:25:58.2667432Z 
2025-05-07T20:25:58.2667447Z 
2025-05-07T20:25:58.2667453Z 
2025-05-07T20:25:58.2667458Z 
2025-05-07T20:25:58.2667463Z 
2025-05-07T20:25:58.2667468Z 
2025-05-07T20:25:58.2667473Z 
2025-05-07T20:25:58.2667477Z 
2025-05-07T20:25:58.2667487Z 
2025-05-07T20:25:58.2667494Z 
2025-05-07T20:25:58.2667759Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2668086Z 
2025-05-07T20:25:58.2668092Z 
2025-05-07T20:25:58.2668097Z 
2025-05-07T20:25:58.2668103Z 
2025-05-07T20:25:58.2668108Z 
2025-05-07T20:25:58.2668112Z 
2025-05-07T20:25:58.2668117Z 
2025-05-07T20:25:58.2668123Z 
2025-05-07T20:25:58.2668128Z 
2025-05-07T20:25:58.2668133Z 
2025-05-07T20:25:58.2668138Z 
2025-05-07T20:25:58.2668143Z 
2025-05-07T20:25:58.2668148Z 
2025-05-07T20:25:58.2668153Z 
2025-05-07T20:25:58.2668169Z 
2025-05-07T20:25:58.2668175Z 
2025-05-07T20:25:58.2668180Z 
2025-05-07T20:25:58.2668446Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2668785Z 
2025-05-07T20:25:58.2668790Z 
2025-05-07T20:25:58.2668795Z 
2025-05-07T20:25:58.2668932Z 
2025-05-07T20:25:58.2668938Z 
2025-05-07T20:25:58.2669000Z 
2025-05-07T20:25:58.2669006Z 
2025-05-07T20:25:58.2669011Z 
2025-05-07T20:25:58.2669017Z 
2025-05-07T20:25:58.2669023Z 
2025-05-07T20:25:58.2669140Z 
2025-05-07T20:25:58.2669148Z 
2025-05-07T20:25:58.2669153Z 
2025-05-07T20:25:58.2669158Z 
2025-05-07T20:25:58.2669163Z 
2025-05-07T20:25:58.2669179Z 
2025-05-07T20:25:58.2669185Z 
2025-05-07T20:25:58.2669190Z 
2025-05-07T20:25:58.2669486Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2669825Z 
2025-05-07T20:25:58.2669830Z 
2025-05-07T20:25:58.2670004Z [A
2025-05-07T20:25:58.2670170Z 
2025-05-07T20:25:58.2670176Z 
2025-05-07T20:25:58.2670339Z [A[A
2025-05-07T20:25:58.2670517Z 
2025-05-07T20:25:58.2670523Z 
2025-05-07T20:25:58.2670529Z 
2025-05-07T20:25:58.2670711Z [A[A[A
2025-05-07T20:25:58.2670909Z 
2025-05-07T20:25:58.2670916Z 
2025-05-07T20:25:58.2670923Z 
2025-05-07T20:25:58.2670929Z 
2025-05-07T20:25:58.2671124Z [A[A[A[A
2025-05-07T20:25:58.2671303Z 
2025-05-07T20:25:58.2671317Z 
2025-05-07T20:25:58.2671323Z 
2025-05-07T20:25:58.2671328Z 
2025-05-07T20:25:58.2671334Z 
2025-05-07T20:25:58.2671499Z [A[A[A[A[A
2025-05-07T20:25:58.2671686Z 
2025-05-07T20:25:58.2671699Z 
2025-05-07T20:25:58.2671705Z 
2025-05-07T20:25:58.2671710Z 
2025-05-07T20:25:58.2671727Z 
2025-05-07T20:25:58.2671732Z 
2025-05-07T20:25:58.2671902Z [A[A[A[A[A[A
2025-05-07T20:25:58.2672102Z 
2025-05-07T20:25:58.2672108Z 
2025-05-07T20:25:58.2672114Z 
2025-05-07T20:25:58.2672119Z 
2025-05-07T20:25:58.2672125Z 
2025-05-07T20:25:58.2672131Z 
2025-05-07T20:25:58.2672144Z 
2025-05-07T20:25:58.2672319Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.2672535Z 
2025-05-07T20:25:58.2672540Z 
2025-05-07T20:25:58.2672546Z 
2025-05-07T20:25:58.2672552Z 
2025-05-07T20:25:58.2672557Z 
2025-05-07T20:25:58.2672563Z 
2025-05-07T20:25:58.2672577Z 
2025-05-07T20:25:58.2672583Z 
2025-05-07T20:25:58.2672768Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2673005Z 
2025-05-07T20:25:58.2673016Z 
2025-05-07T20:25:58.2673022Z 
2025-05-07T20:25:58.2673028Z 
2025-05-07T20:25:58.2673034Z 
2025-05-07T20:25:58.2673039Z 
2025-05-07T20:25:58.2673052Z 
2025-05-07T20:25:58.2673057Z 
2025-05-07T20:25:58.2673063Z 
2025-05-07T20:25:58.2673266Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2673517Z 
2025-05-07T20:25:58.2673523Z 
2025-05-07T20:25:58.2673529Z 
2025-05-07T20:25:58.2673535Z 
2025-05-07T20:25:58.2673549Z 
2025-05-07T20:25:58.2673555Z 
2025-05-07T20:25:58.2673561Z 
2025-05-07T20:25:58.2673567Z 
2025-05-07T20:25:58.2673573Z 
2025-05-07T20:25:58.2673578Z 
2025-05-07T20:25:58.2673779Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2674045Z 
2025-05-07T20:25:58.2674050Z 
2025-05-07T20:25:58.2674066Z 
2025-05-07T20:25:58.2674071Z 
2025-05-07T20:25:58.2674076Z 
2025-05-07T20:25:58.2674081Z 
2025-05-07T20:25:58.2674086Z 
2025-05-07T20:25:58.2674091Z 
2025-05-07T20:25:58.2674096Z 
2025-05-07T20:25:58.2674102Z 
2025-05-07T20:25:58.2674107Z 
2025-05-07T20:25:58.2674302Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2674578Z 
2025-05-07T20:25:58.2674584Z 
2025-05-07T20:25:58.2674589Z 
2025-05-07T20:25:58.2674594Z 
2025-05-07T20:25:58.2674600Z 
2025-05-07T20:25:58.2674606Z 
2025-05-07T20:25:58.2674611Z 
2025-05-07T20:25:58.2674620Z 
2025-05-07T20:25:58.2674625Z 
2025-05-07T20:25:58.2674631Z 
2025-05-07T20:25:58.2674636Z 
2025-05-07T20:25:58.2674641Z 
2025-05-07T20:25:58.2674827Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2675107Z 
2025-05-07T20:25:58.2675113Z 
2025-05-07T20:25:58.2675118Z 
2025-05-07T20:25:58.2675123Z 
2025-05-07T20:25:58.2675128Z 
2025-05-07T20:25:58.2675133Z 
2025-05-07T20:25:58.2675139Z 
2025-05-07T20:25:58.2675144Z 
2025-05-07T20:25:58.2675149Z 
2025-05-07T20:25:58.2675154Z 
2025-05-07T20:25:58.2675159Z 
2025-05-07T20:25:58.2675164Z 
2025-05-07T20:25:58.2675169Z 
2025-05-07T20:25:58.2675389Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2675655Z 
2025-05-07T20:25:58.2675660Z 
2025-05-07T20:25:58.2675666Z 
2025-05-07T20:25:58.2675671Z 
2025-05-07T20:25:58.2675838Z 
2025-05-07T20:25:58.2675843Z 
2025-05-07T20:25:58.2675848Z 
2025-05-07T20:25:58.2675853Z 
2025-05-07T20:25:58.2675858Z 
2025-05-07T20:25:58.2675863Z 
2025-05-07T20:25:58.2675868Z 
2025-05-07T20:25:58.2675873Z 
2025-05-07T20:25:58.2675967Z 
2025-05-07T20:25:58.2675973Z 
2025-05-07T20:25:58.2676181Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2676452Z 
2025-05-07T20:25:58.2676457Z 
2025-05-07T20:25:58.2676462Z 
2025-05-07T20:25:58.2676467Z 
2025-05-07T20:25:58.2676472Z 
2025-05-07T20:25:58.2676488Z 
2025-05-07T20:25:58.2676493Z 
2025-05-07T20:25:58.2676498Z 
2025-05-07T20:25:58.2676503Z 
2025-05-07T20:25:58.2676508Z 
2025-05-07T20:25:58.2676513Z 
2025-05-07T20:25:58.2676518Z 
2025-05-07T20:25:58.2676523Z 
2025-05-07T20:25:58.2676528Z 
2025-05-07T20:25:58.2676533Z 
2025-05-07T20:25:58.2676756Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2677044Z 
2025-05-07T20:25:58.2677050Z 
2025-05-07T20:25:58.2677055Z 
2025-05-07T20:25:58.2677060Z 
2025-05-07T20:25:58.2677074Z 
2025-05-07T20:25:58.2677079Z 
2025-05-07T20:25:58.2677084Z 
2025-05-07T20:25:58.2677089Z 
2025-05-07T20:25:58.2677094Z 
2025-05-07T20:25:58.2677099Z 
2025-05-07T20:25:58.2677104Z 
2025-05-07T20:25:58.2677109Z 
2025-05-07T20:25:58.2677124Z 
2025-05-07T20:25:58.2677129Z 
2025-05-07T20:25:58.2677134Z 
2025-05-07T20:25:58.2677139Z 
2025-05-07T20:25:58.2677374Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2677679Z 
2025-05-07T20:25:58.2677685Z 
2025-05-07T20:25:58.2677690Z 
2025-05-07T20:25:58.2677695Z 
2025-05-07T20:25:58.2677701Z 
2025-05-07T20:25:58.2677706Z 
2025-05-07T20:25:58.2677711Z 
2025-05-07T20:25:58.2677716Z 
2025-05-07T20:25:58.2677721Z 
2025-05-07T20:25:58.2677726Z 
2025-05-07T20:25:58.2677731Z 
2025-05-07T20:25:58.2677736Z 
2025-05-07T20:25:58.2677741Z 
2025-05-07T20:25:58.2677756Z 
2025-05-07T20:25:58.2677761Z 
2025-05-07T20:25:58.2677766Z 
2025-05-07T20:25:58.2677771Z 
2025-05-07T20:25:58.2678006Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2678324Z 
2025-05-07T20:25:58.2678329Z 
2025-05-07T20:25:58.2678334Z 
2025-05-07T20:25:58.2678347Z 
2025-05-07T20:25:58.2678352Z 
2025-05-07T20:25:58.2678357Z 
2025-05-07T20:25:58.2678362Z 
2025-05-07T20:25:58.2678367Z 
2025-05-07T20:25:58.2678379Z 
2025-05-07T20:25:58.2678384Z 
2025-05-07T20:25:58.2678389Z 
2025-05-07T20:25:58.2678394Z 
2025-05-07T20:25:58.2678399Z 
2025-05-07T20:25:58.2678404Z 
2025-05-07T20:25:58.2678409Z 
2025-05-07T20:25:58.2678414Z 
2025-05-07T20:25:58.2678419Z 
2025-05-07T20:25:58.2678424Z 
2025-05-07T20:25:58.2678674Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2679024Z 
2025-05-07T20:25:58.2679030Z 
2025-05-07T20:25:58.2679177Z [A
2025-05-07T20:25:58.2679324Z 
2025-05-07T20:25:58.2679330Z 
2025-05-07T20:25:58.2679467Z [A[A
2025-05-07T20:25:58.2679612Z 
2025-05-07T20:25:58.2679616Z 
2025-05-07T20:25:58.2679622Z 
2025-05-07T20:25:58.2679774Z [A[A[A
2025-05-07T20:25:58.2679925Z 
2025-05-07T20:25:58.2679931Z 
2025-05-07T20:25:58.2679936Z 
2025-05-07T20:25:58.2679947Z 
2025-05-07T20:25:58.2680103Z [A[A[A[A
2025-05-07T20:25:58.2680266Z 
2025-05-07T20:25:58.2680271Z 
2025-05-07T20:25:58.2680276Z 
2025-05-07T20:25:58.2680281Z 
2025-05-07T20:25:58.2680286Z 
2025-05-07T20:25:58.2680494Z [A[A[A[A[A
2025-05-07T20:25:58.2680690Z 
2025-05-07T20:25:58.2680696Z 
2025-05-07T20:25:58.2680702Z 
2025-05-07T20:25:58.2680707Z 
2025-05-07T20:25:58.2680713Z 
2025-05-07T20:25:58.2680718Z 
2025-05-07T20:25:58.2680914Z [A[A[A[A[A[A
2025-05-07T20:25:58.2681126Z 
2025-05-07T20:25:58.2681131Z 
2025-05-07T20:25:58.2681137Z 
2025-05-07T20:25:58.2681143Z 
2025-05-07T20:25:58.2681149Z 
2025-05-07T20:25:58.2681154Z 
2025-05-07T20:25:58.2681160Z 
2025-05-07T20:25:58.2681342Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.2681554Z 
2025-05-07T20:25:58.2681560Z 
2025-05-07T20:25:58.2681566Z 
2025-05-07T20:25:58.2681571Z 
2025-05-07T20:25:58.2681577Z 
2025-05-07T20:25:58.2681582Z 
2025-05-07T20:25:58.2681588Z 
2025-05-07T20:25:58.2681593Z 
2025-05-07T20:25:58.2681802Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2682193Z 
2025-05-07T20:25:58.2682199Z 
2025-05-07T20:25:58.2682205Z 
2025-05-07T20:25:58.2682210Z 
2025-05-07T20:25:58.2682216Z 
2025-05-07T20:25:58.2682222Z 
2025-05-07T20:25:58.2682227Z 
2025-05-07T20:25:58.2682321Z 
2025-05-07T20:25:58.2682327Z 
2025-05-07T20:25:58.2682528Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2682789Z 
2025-05-07T20:25:58.2682795Z 
2025-05-07T20:25:58.2682801Z 
2025-05-07T20:25:58.2682807Z 
2025-05-07T20:25:58.2682813Z 
2025-05-07T20:25:58.2682819Z 
2025-05-07T20:25:58.2682825Z 
2025-05-07T20:25:58.2682831Z 
2025-05-07T20:25:58.2682837Z 
2025-05-07T20:25:58.2682842Z 
2025-05-07T20:25:58.2683067Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.2683330Z 
2025-05-07T20:25:58.2683336Z 
2025-05-07T20:25:58.2683341Z 
2025-05-07T20:25:58.2683347Z 
2025-05-07T20:25:58.2683352Z 
2025-05-07T20:25:58.2683357Z 
2025-05-07T20:25:58.2683362Z 
2025-05-07T20:25:58.2683367Z 
2025-05-07T20:25:58.2683372Z 
2025-05-07T20:25:58.2683387Z 
2025-05-07T20:25:58.2683404Z 
2025-05-07T20:25:58.2683633Z [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:58.5885837Z Preparing transaction: | / - done
2025-05-07T20:26:00.1662668Z Verifying transaction: | / - \ | / - \ | / - \ | / - done
2025-05-07T20:26:00.9837837Z Executing transaction: | / - \ | / - \ done
2025-05-07T20:26:03.3345211Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:03.3345624Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:03.3346313Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:03.3346880Z 
2025-05-07T20:26:03.3359524Z 
2025-05-07T20:26:03.3360294Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:03.3361013Z 
2025-05-07T20:26:03.3372919Z 
2025-05-07T20:26:03.3373334Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:03.3378531Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:03.3382383Z 
2025-05-07T20:26:03.5134268Z 
2025-05-07T20:26:03.5142508Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:03.5146552Z 
2025-05-07T20:26:03.5161501Z 
2025-05-07T20:26:03.5161818Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:03.5527955Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:05.4336774Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:05.4975668Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:05.4976200Z 
2025-05-07T20:26:05.9320486Z 
2025-05-07T20:26:05.9329823Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:05.9840434Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:05.9840929Z 
2025-05-07T20:26:06.4206641Z 
2025-05-07T20:26:06.4207050Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:06.4208466Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:06.4209199Z 
2025-05-07T20:26:06.8582631Z 
2025-05-07T20:26:08.8708690Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:10.8982724Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:12.9095576Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:12.9096395Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:14.9486719Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:16.8318103Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:16.8318386Z 
2025-05-07T20:26:16.8932374Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:20.7468721Z /tmp/tmp_mbt3l3p: line 3: clang: command not found
2025-05-07T20:26:20.7469005Z 
2025-05-07T20:26:20.7469583Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:20.8103153Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:20.8103492Z 
2025-05-07T20:26:20.8122381Z total 36
2025-05-07T20:26:20.8122915Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:26:20.8123315Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:20.8123754Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:20.8124258Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:20.8124761Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:20.8125249Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:20.8125875Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:20.8126485Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:26:20.8126833Z 
2025-05-07T20:26:20.8127060Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:20.8127815Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:20.8128232Z 
2025-05-07T20:26:20.8146556Z 
2025-05-07T20:26:20.8148817Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:20.8149178Z 
2025-05-07T20:26:22.7776036Z 
2025-05-07T20:26:22.7776573Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:22.7777278Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:22.7778060Z 
2025-05-07T20:26:23.2118551Z 
2025-05-07T20:26:23.2118892Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:23.2119257Z 
2025-05-07T20:26:25.0995546Z -allow-unsupported-compiler
2025-05-07T20:26:25.0995868Z 
2025-05-07T20:26:25.1630598Z 
2025-05-07T20:26:25.1631364Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:25.1632698Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:25.1633379Z 
2025-05-07T20:26:27.1189895Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:27.1190688Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:27.1191127Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:27.1191515Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:27.1191849Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:27.1192107Z #define _STL_PAIR_H 1
2025-05-07T20:26:27.1192386Z #define __cpp_attributes 200809L
2025-05-07T20:26:27.1192712Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:27.1193054Z #define __DELETE_THROW throw()
2025-05-07T20:26:27.1193317Z #define _PTRDIFF_T_ 
2025-05-07T20:26:27.1193557Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:27.1193843Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:27.1194109Z #define _IO_LEFT 02
2025-05-07T20:26:27.1194334Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:27.1194590Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:27.1194857Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:27.1195278Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:27.1195706Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:27.1196066Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:27.1196409Z #define _IOS_OUTPUT 2
2025-05-07T20:26:27.1196891Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:27.1197285Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:27.1197596Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:27.1197870Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:27.1198144Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:27.1198918Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:27.1200130Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:27.1200555Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:27.1200959Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:27.1201290Z #define _T_WCHAR_ 
2025-05-07T20:26:27.1201522Z #define stdout stdout
2025-05-07T20:26:27.1201863Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:27.1202254Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:27.1202512Z #define __flexarr []
2025-05-07T20:26:27.1202760Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:27.1203098Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:27.1203445Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:27.1203718Z #define _MATH_H 1
2025-05-07T20:26:27.1204009Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:27.1204350Z #define __S64_TYPE long int
2025-05-07T20:26:27.1204614Z #define __stub_fchflags 
2025-05-07T20:26:27.1204886Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:27.1205182Z #define __SQUAD_TYPE long int
2025-05-07T20:26:27.1205549Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:27.1206169Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:27.1206548Z #define NL_NMAX INT_MAX
2025-05-07T20:26:27.1206878Z #define _BITS_TIME_H 1
2025-05-07T20:26:27.1207249Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:27.1207821Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:27.1208242Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:27.1208728Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:27.1209707Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:27.1210210Z #define __CHAR_BIT__ 8
2025-05-07T20:26:27.1210745Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1211201Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:27.1211608Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:27.1211882Z #define FP_NAN 0
2025-05-07T20:26:27.1212152Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:27.1212595Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:27.1213089Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:27.1213481Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:27.1213774Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:27.1214039Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:27.1214302Z #define __SM_80_RT_H__ 
2025-05-07T20:26:27.1214537Z #define _NEW 
2025-05-07T20:26:27.1214769Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:27.1215065Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:27.1215445Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:27.1215856Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:27.1216108Z #define __USE_ANSI 1
2025-05-07T20:26:27.1216403Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:27.1216805Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:27.1217169Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:27.1217479Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:27.1217768Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:27.1218052Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:27.1218333Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:27.1218621Z #define PIPE_BUF 4096
2025-05-07T20:26:27.1218945Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:27.1219309Z #define ADJ_TICK 0x4000
2025-05-07T20:26:27.1219598Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:27.1219915Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:27.1220188Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:27.1220523Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:27.1220993Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:27.1221525Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:27.1221895Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:27.1222157Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:27.1222431Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1222721Z #define __cpp_static_assert 201411L
2025-05-07T20:26:27.1223065Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:27.1223410Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:27.1223692Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:27.1223983Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:27.1224287Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:27.1224577Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:27.1224885Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1225250Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:27.1225595Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:27.1225885Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:27.1226210Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1226571Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:27.1226937Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:27.1227238Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:27.1227535Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:27.1227871Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:27.1228205Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:27.1228611Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:27.1229037Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:27.1229467Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:27.1229744Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:27.1230030Z #define __GCC_IEC_559 2
2025-05-07T20:26:27.1230413Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:27.1230760Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:27.1231022Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:27.1231304Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:27.1231578Z #define _IOFBF 0
2025-05-07T20:26:27.1231792Z #define __USE_BSD 1
2025-05-07T20:26:27.1232030Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:27.1232309Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:27.1232583Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:27.1232845Z #define _IO_NO_WRITES 8
2025-05-07T20:26:27.1233112Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:27.1233461Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:27.1233823Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:27.1234143Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:27.1234474Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:27.1234767Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:27.1235048Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:27.1235327Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:27.1235642Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:27.1236038Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:27.1236413Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:27.1236726Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:27.1237044Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:27.1237382Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:27.1237692Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:27.1238000Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:27.1238284Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:27.1238561Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:27.1239152Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:27.1239806Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:27.1240143Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:27.1240473Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:27.1240784Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:27.1241068Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:27.1241338Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:27.1241652Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:27.1241993Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:27.1242299Z #define RAND_MAX 2147483647
2025-05-07T20:26:27.1242571Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:27.1242908Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1243229Z #define __SM_90_RT_H__ 
2025-05-07T20:26:27.1243475Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:27.1243748Z #define __COMPAR_FN_T 
2025-05-07T20:26:27.1243994Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:27.1244257Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:27.1244738Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:27.1245257Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:27.1245604Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:27.1245966Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:27.1246286Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:27.1246632Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:27.1258216Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:27.1258762Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:27.1259326Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:27.1259670Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:27.1260165Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:27.1260476Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:27.1260782Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:27.1261158Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:27.1261439Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:27.1261705Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:27.1261962Z #define __u_char_defined 
2025-05-07T20:26:27.1262293Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:27.1262657Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:27.1262927Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:27.1263191Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:27.1263483Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:27.1263928Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:27.1264365Z #define FP_INFINITE 1
2025-05-07T20:26:27.1264743Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:27.1265172Z #define _IO_pid_t __pid_t
2025-05-07T20:26:27.1265439Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:27.1265713Z #define __LEAF , __leaf__
2025-05-07T20:26:27.1265964Z #define PATH_MAX 4096
2025-05-07T20:26:27.1266235Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:27.1266586Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:27.1266913Z #define _LIMITS_H___ 
2025-05-07T20:26:27.1267152Z #define __size_t 
2025-05-07T20:26:27.1267394Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:27.1267957Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:27.1268526Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:27.1268846Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:27.1269190Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:27.1269454Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:27.1269826Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:27.1270241Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:27.1270541Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:27.1270875Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:27.1271174Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:27.1271470Z #define __INT8_C(c) c
2025-05-07T20:26:27.1271738Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:27.1272051Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:27.1272327Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:27.1272587Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:27.1272844Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:27.1273125Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:27.1273447Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1273787Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:27.1274072Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:27.1274344Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:27.1274616Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:27.1274943Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:27.1275255Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:27.1275618Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:27.1276014Z #define NFDBITS __NFDBITS
2025-05-07T20:26:27.1276283Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:27.1276575Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:27.1276902Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:27.1277227Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:27.1277486Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:27.1277783Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:27.1278094Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:27.1278406Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:27.1278837Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:27.1279209Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:27.1279507Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:27.1279957Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:27.1280337Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:27.1280688Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:27.1281132Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:27.1281478Z #define __daddr_t_defined 
2025-05-07T20:26:27.1281740Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:27.1282020Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:27.1282349Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:27.1282873Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:27.1283370Z #define _ACRTIMP 
2025-05-07T20:26:27.1283600Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:27.1283877Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:27.1284177Z #define _IOS_BIN 128
2025-05-07T20:26:27.1284535Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:27.1284965Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1285242Z #define UNDERFLOW 4
2025-05-07T20:26:27.1285464Z #define NAME_MAX 255
2025-05-07T20:26:27.1285712Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:27.1285989Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:27.1286271Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:27.1286576Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:27.1286960Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:27.1287350Z #define __ptr_t void *
2025-05-07T20:26:27.1287695Z #define M_E 2.7182818284590452354
2025-05-07T20:26:27.1287979Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:27.1288251Z #define __USE_ISOCXX11 1
2025-05-07T20:26:27.1288521Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:27.1288839Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:27.1289138Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:27.1289409Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:27.1289708Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:27.1290024Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:27.1290281Z #define __linux 1
2025-05-07T20:26:27.1290517Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:27.1290793Z #define cudaDeviceMask 0xff
2025-05-07T20:26:27.1291057Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:27.1291354Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:27.1291637Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:27.1291920Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:27.1292234Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:27.1292543Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:27.1292836Z #define _BITS_TYPES_H 1
2025-05-07T20:26:27.1293124Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:27.1293460Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:27.1293759Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:27.1294034Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:27.1294329Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:27.1294623Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:27.1295415Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:27.1296246Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:27.1296539Z #define __unix 1
2025-05-07T20:26:27.1296759Z #define MATH_ERRNO 1
2025-05-07T20:26:27.1297001Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:27.1297282Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:27.1297556Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:27.1297835Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:27.1298120Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:27.1298410Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:27.1298868Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:27.1299441Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:27.1299742Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:27.1299993Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:27.1300348Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:27.1300637Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:27.1300899Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:27.1301131Z #define __SIZE_T 
2025-05-07T20:26:27.1301384Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:27.1301707Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:27.1301999Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:27.1302261Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:27.1302520Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:27.1302905Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:27.1303334Z #define __WAIT_STATUS void *
2025-05-07T20:26:27.1303596Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:27.1303858Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:27.1304132Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:27.1304417Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:27.1304686Z #define __WINT_MIN__ 0U
2025-05-07T20:26:27.1305282Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:27.1306354Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:27.1306655Z #define WUNTRACED 2
2025-05-07T20:26:27.1306882Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:27.1307159Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:27.1307443Z #define NZERO 20
2025-05-07T20:26:27.1307667Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:27.1307954Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:27.1308256Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:27.1308546Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:27.1308808Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:27.1309105Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:27.1309432Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:27.1309729Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:27.1310008Z #define EXIT_FAILURE 1
2025-05-07T20:26:27.1310258Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:27.1310524Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:27.1310798Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:27.1311057Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:27.1311338Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:27.1311685Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:27.1312052Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:27.1312347Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:27.1312611Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:27.1312891Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:27.1313190Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:27.1313504Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:27.1313804Z #define SEEK_DATA 3
2025-05-07T20:26:27.1314043Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:27.1314345Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:27.1314782Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:27.1315182Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:27.1315433Z #define __INT64_C(c) c ## L
2025-05-07T20:26:27.1315727Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:27.1316058Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:27.1316388Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:27.1316669Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:27.1316970Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:27.1317287Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:27.1317551Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:27.1317792Z #define WSTOPPED 2
2025-05-07T20:26:27.1318040Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:27.1318331Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:27.1318579Z #define FP_NORMAL 4
2025-05-07T20:26:27.1319083Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:27.1319373Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:27.1319612Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:27.1319990Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:27.1320288Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:27.1320567Z #define cudaTextureType1D 0x01
2025-05-07T20:26:27.1320838Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:27.1321105Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:27.1321381Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:27.1321674Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:27.1322106Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:27.1322561Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:27.1322826Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:27.1323094Z #define _POSIX_SOURCE 1
2025-05-07T20:26:27.1323352Z #define cudaTextureType2D 0x02
2025-05-07T20:26:27.1323621Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:27.1323904Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:27.1324221Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:27.1324494Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:27.1324834Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:27.1325178Z #define cudaTextureType3D 0x03
2025-05-07T20:26:27.1325456Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:27.1325715Z #define CLOCK_REALTIME 0
2025-05-07T20:26:27.1325969Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:27.1326253Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:27.1326554Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:27.1326842Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:27.1327127Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:27.1327413Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:27.1327785Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:27.1328099Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:27.1328391Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:27.1328683Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:27.1328945Z #define __GLIBC__ 2
2025-05-07T20:26:27.1329159Z #define __END_DECLS }
2025-05-07T20:26:27.1329448Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:27.1329839Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:27.1330224Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:27.1330483Z #define WCONTINUED 8
2025-05-07T20:26:27.1330723Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:27.1330992Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:27.1331266Z #define _ALLOCA_H 1
2025-05-07T20:26:27.1331500Z #define __host__ __location__(host)
2025-05-07T20:26:27.1331926Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:27.1332368Z #define __SLONG32_TYPE int
2025-05-07T20:26:27.1332640Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:27.1332933Z #define _SYS_SELECT_H 1
2025-05-07T20:26:27.1333174Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:27.1333435Z #define _IOS_NOCREATE 32
2025-05-07T20:26:27.1333693Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:27.1333968Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:27.1334277Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:27.1334573Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:27.1334870Z #define __global__ __location__(global)
2025-05-07T20:26:27.1335160Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:27.1335425Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:27.1335707Z #define __DBL_DIG__ 15
2025-05-07T20:26:27.1335933Z #define TIME_UTC 1
2025-05-07T20:26:27.1336155Z #define __FLT32_DIG__ 6
2025-05-07T20:26:27.1336481Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:27.1336873Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:27.1337195Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:27.1337510Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:27.1337814Z #define _G_BUFSIZ 8192
2025-05-07T20:26:27.1338240Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:27.1338614Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:27.1338924Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:27.1339283Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:27.1339578Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:27.1339834Z #define __GXX_WEAK__ 1
2025-05-07T20:26:27.1340089Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.1340396Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:27.1340666Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:27.1340962Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:27.1341311Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:27.1341596Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:27.1341877Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:27.1342179Z #define _G_config_h 1
2025-05-07T20:26:27.1342458Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:27.1342792Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:27.1343080Z #define _GCC_WCHAR_T 
2025-05-07T20:26:27.1343313Z #define TMP_MAX 238328
2025-05-07T20:26:27.1343560Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:27.1343831Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:27.1344108Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1344395Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:27.1344669Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:27.1344961Z #define _IO_SKIPWS 01
2025-05-07T20:26:27.1345375Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:27.1345835Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:27.1346111Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:27.1346451Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:27.1346818Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:27.1347190Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:27.1347557Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:27.1347823Z #define le32toh(x) (x)
2025-05-07T20:26:27.1348059Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:27.1348318Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:27.1348665Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:27.1349016Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:27.1349468Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:27.1349896Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:27.1350163Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:27.1350438Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:27.1350712Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:27.1350989Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:27.1351529Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:27.1352034Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:27.1352351Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:27.1352697Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:27.1353018Z #define _WCHAR_T_ 
2025-05-07T20:26:27.1353250Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:27.1353618Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:27.1354017Z #define RTSIG_MAX 32
2025-05-07T20:26:27.1354248Z #define _STDDEF_H 
2025-05-07T20:26:27.1354478Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:27.1354756Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:27.1355012Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:27.1355345Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:27.1355748Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:27.1356084Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:27.1356378Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:27.1356838Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:27.1357374Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:27.1357905Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:27.1358224Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:27.1360002Z #define __unix__ 1
2025-05-07T20:26:27.1360262Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.1360543Z #define __INT_WIDTH__ 32
2025-05-07T20:26:27.1360790Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:27.1361030Z #define _IONBF 2
2025-05-07T20:26:27.1361476Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:27.1362256Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:27.1362807Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:27.1363068Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:27.1363338Z #define __UINT16_C(c) c
2025-05-07T20:26:27.1363587Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:27.1363877Z #define STA_DEL 0x0020
2025-05-07T20:26:27.1364120Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:26:27.1364392Z #define __id_t_defined 
2025-05-07T20:26:27.1364667Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:27.1365124Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:27.1365566Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:27.1365840Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:27.1366108Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:27.1366365Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:27.1366637Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:27.1366910Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:27.1367172Z #define SING 2
2025-05-07T20:26:27.1367397Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:27.1367779Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1368079Z #define cudaStreamDefault 0x00
2025-05-07T20:26:27.1368437Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:27.1368819Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:27.1369098Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:27.1369397Z #define __gnu_linux__ 1
2025-05-07T20:26:27.1369674Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:27.1369931Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:27.1370188Z #define MAX_INPUT 255
2025-05-07T20:26:27.1370659Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:27.1370995Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:27.1371365Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:27.1371690Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:27.1372007Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:27.1372408Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:27.1372845Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:27.1373187Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:27.1373555Z #define _Mfloat_ float
2025-05-07T20:26:27.1373839Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:27.1374157Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:27.1374452Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:27.1374961Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:27.1375476Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1375760Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:27.1376091Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:27.1376460Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:27.1376783Z #define __USE_ISOC11 1
2025-05-07T20:26:27.1377026Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:27.1377271Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:27.1377536Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:27.1377807Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:27.1378117Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:27.1378449Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:27.1378892Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:27.1379229Z #define __THROW throw ()
2025-05-07T20:26:27.1379493Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:27.1379875Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1380238Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:27.1380606Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:27.1390171Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:27.1390507Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:27.1390785Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:27.1391060Z #define L_tmpnam 20
2025-05-07T20:26:27.1391301Z #define ___int_wchar_t_h 
2025-05-07T20:26:27.1391649Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:27.1392049Z #define isascii(c) __isascii (c)
2025-05-07T20:26:27.1392318Z #define _T_PTRDIFF 
2025-05-07T20:26:27.1392642Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:27.1393025Z #define toascii(c) __toascii (c)
2025-05-07T20:26:27.1393299Z #define __GNUC__ 11
2025-05-07T20:26:27.1393563Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:27.1393871Z #define __GXX_RTTI 1
2025-05-07T20:26:27.1394113Z #define __pie__ 2
2025-05-07T20:26:27.1394338Z #define __MMX__ 1
2025-05-07T20:26:27.1394568Z #define __cudaCDP2Malloc 
2025-05-07T20:26:27.1394843Z #define __timespec_defined 1
2025-05-07T20:26:27.1395111Z #define L_ctermid 9
2025-05-07T20:26:27.1395358Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:27.1395678Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:27.1396092Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:27.1396476Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:27.1396759Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:27.1397067Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:27.1397378Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:27.1397708Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:27.1397991Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:27.1398434Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:27.1399203Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:27.1399818Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:27.1400138Z #define __USE_SVID 1
2025-05-07T20:26:27.1400395Z #define __constant__ __location__(constant)
2025-05-07T20:26:27.1400714Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:27.1401021Z #define __device__ __location__(device)
2025-05-07T20:26:27.1401352Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:27.1401690Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:27.1401964Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:27.1402263Z #define CUDART_DEVICE __device__
2025-05-07T20:26:27.1402614Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:27.1403000Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:27.1403295Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:27.1403677Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:27.1404064Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:27.1404323Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:27.1404690Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:27.1405128Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:27.1405453Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:27.1406914Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:27.1407188Z #define NGROUPS_MAX 65536
2025-05-07T20:26:27.1407449Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:27.1407806Z #define __USE_ISOC95 1
2025-05-07T20:26:27.1408030Z #define _TIME_H 1
2025-05-07T20:26:27.1408298Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:27.1408617Z #define __USE_ISOC99 1
2025-05-07T20:26:27.1409354Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:27.1409765Z #define HOST_NAME_MAX 64
2025-05-07T20:26:27.1410022Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:27.1410399Z #define _IOS_ATEND 4
2025-05-07T20:26:27.1410643Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:27.1410975Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:27.1411386Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:27.1411731Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:27.1412021Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:27.1412353Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:27.1412670Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:27.1412935Z #define _STDIO_H 1
2025-05-07T20:26:27.1413335Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:27.1413802Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:27.1414167Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:27.1414552Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:27.1414846Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:27.1415126Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:27.1415402Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:27.1415699Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:27.1416001Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1416324Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:27.1416604Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:27.1416881Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:27.1417191Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:27.1417467Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:27.1417753Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:27.1418117Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:27.1418494Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:27.1418742Z #define __USE_XOPEN 1
2025-05-07T20:26:27.1419000Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:27.1419484Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:27.1419957Z #define __USE_XOPEN2K 1
2025-05-07T20:26:27.1420202Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:27.1420475Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:27.1420773Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:27.1421040Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:27.1421561Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:27.1422089Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:27.1422371Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:27.1422733Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:27.1423123Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:27.1423505Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:27.1423907Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:27.1424191Z #define __glibcxx_integral_traps true
2025-05-07T20:26:27.1424484Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:27.1424742Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:27.1425007Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:27.1425282Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:27.1425531Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:27.1425830Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:27.1426142Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:27.1426511Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:27.1426904Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:27.1427187Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:27.1427449Z #define _IO_UNITBUF 020000
2025-05-07T20:26:27.1427712Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:27.1427980Z #define __FD_SETSIZE 1024
2025-05-07T20:26:27.1428234Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:27.1428513Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:27.1428977Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:27.1429354Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:27.1429655Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:27.1430046Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:27.1430374Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:27.1430649Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:27.1430964Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:27.1431307Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:27.1431594Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:27.1431925Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:27.1432221Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:27.1432495Z #define __USE_POSIX199506 1
2025-05-07T20:26:27.1432753Z #define _FEATURES_H 1
2025-05-07T20:26:27.1432999Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:27.1433400Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:27.1433824Z #define __stub_getmsg 
2025-05-07T20:26:27.1434067Z #define _IO_FIXED 010000
2025-05-07T20:26:27.1434345Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:27.1434660Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:27.1434941Z #define __stub_setlogin 
2025-05-07T20:26:27.1435189Z #define __stub_fattach 
2025-05-07T20:26:27.1435432Z #define __cplusplus 201703L
2025-05-07T20:26:27.1435706Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:27.1435995Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:27.1436247Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:27.1436531Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:27.1437026Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:27.1437562Z #define _IO_INTERNAL 010
2025-05-07T20:26:27.1437817Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:27.1438161Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:27.1438521Z #define __dev_t_defined 
2025-05-07T20:26:27.1438766Z #define __DEPRECATED 1
2025-05-07T20:26:27.1439001Z #define __S32_TYPE int
2025-05-07T20:26:27.1439258Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:27.1439562Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:27.1439827Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:27.1440089Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:27.1440695Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:27.1441340Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:27.1441656Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:27.1442000Z #define OVERFLOW 3
2025-05-07T20:26:27.1442256Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:27.1442573Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:27.1442867Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.1443211Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:27.1443548Z #define __SSE2_MATH__ 1
2025-05-07T20:26:27.1443806Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:27.1444115Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.1444424Z #define _IO_STDIO_H 
2025-05-07T20:26:27.1444682Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:27.1444973Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:27.1445304Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:27.1445612Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1445928Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:27.1446208Z #define __amd64 1
2025-05-07T20:26:27.1446441Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:27.1446715Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:27.1447002Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:27.1447296Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:27.1447732Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:27.1448003Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:27.1448326Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:27.1448748Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:27.1449004Z #define __bounded 
2025-05-07T20:26:27.1449242Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:27.1449677Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:27.1449968Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:27.1450239Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:27.1450519Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1450842Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:27.1451264Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:27.1451668Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:27.1451945Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:27.1452288Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:27.1452650Z #define STA_PLL 0x0001
2025-05-07T20:26:27.1452894Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:27.1453165Z #define __GNUG__ 11
2025-05-07T20:26:27.1453403Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:27.1453675Z #define _T_WCHAR 
2025-05-07T20:26:27.1453917Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:27.1454212Z #define __specialization_static 
2025-05-07T20:26:27.1454522Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:27.1454834Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:27.1455100Z #define cudaArraySparse 0x40
2025-05-07T20:26:27.1455361Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:27.1455623Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:27.1455907Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:27.1456210Z #define _WCHAR_T 
2025-05-07T20:26:27.1456433Z #define __cudaCDP2Free 
2025-05-07T20:26:27.1457084Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:27.1460498Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:27.1460923Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:27.1461368Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:27.1461653Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:27.1461917Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:27.1462259Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:27.1462611Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:27.1462856Z #define __NO_CTYPE 1
2025-05-07T20:26:27.1463088Z #define __stub_bdflush 
2025-05-07T20:26:27.1463464Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:27.1463882Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:27.1464184Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:27.1464453Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:27.1464735Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:27.1465036Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:27.1465339Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:27.1465683Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:27.1466031Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:27.1466319Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:27.1466613Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:27.1466955Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:27.1467302Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:27.1467588Z #define _IO_STDIO 040000
2025-05-07T20:26:27.1467920Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:27.1468300Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:27.1468625Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:27.1468918Z #define _PTRDIFF_T 
2025-05-07T20:26:27.1469137Z #define _MOVE_H 1
2025-05-07T20:26:27.1469395Z #define __cpp_hex_float 201603L
2025-05-07T20:26:27.1469688Z #define ADJ_TAI 0x0080
2025-05-07T20:26:27.1469918Z #define __ptrvalue 
2025-05-07T20:26:27.1470152Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:27.1470547Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:27.1470834Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:27.1471142Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:27.1471397Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:27.1471766Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:27.1472171Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:27.1472560Z #define __USE_GNU 1
2025-05-07T20:26:27.1472791Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:27.1473062Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:27.1473338Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:27.1473737Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:27.1474119Z #define WEXITED 4
2025-05-07T20:26:27.1474344Z #define _IO_NO_READS 4
2025-05-07T20:26:27.1474655Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:27.1475006Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:27.1475305Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:27.1475616Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:27.1475929Z #define __uid_t_defined 
2025-05-07T20:26:27.1476200Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:27.1476491Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:27.1476761Z #define WNOHANG 1
2025-05-07T20:26:27.1477018Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:27.1477324Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:27.1477598Z #define cudaEventDefault 0x00
2025-05-07T20:26:27.1477899Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:27.1478217Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:27.1478461Z #define __x86_64 1
2025-05-07T20:26:27.1478691Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:27.1479090Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:27.1479567Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:27.1480064Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:27.1480500Z #define __PTRDIFF_T 
2025-05-07T20:26:27.1480830Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:27.1481213Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:27.1481490Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.1481784Z #define _Mlong_double_ long double
2025-05-07T20:26:27.1482070Z #define __cpp_lambdas 200907L
2025-05-07T20:26:27.1482320Z #define _IO_DEC 020
2025-05-07T20:26:27.1482552Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:27.1482827Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:27.1483113Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:27.1483403Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:27.1483670Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:27.1483965Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:27.1484293Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:27.1484570Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:27.1484844Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:27.1485158Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:27.1485535Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:27.1485926Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:27.1486213Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:27.1486509Z #define __cpp_template_auto 201606L
2025-05-07T20:26:27.1486868Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:27.1487239Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:27.1487513Z #define __key_t_defined 
2025-05-07T20:26:27.1487915Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:27.1488284Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:27.1488749Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:27.1489117Z #define __GNUC_VA_LIST 
2025-05-07T20:26:27.1489458Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:27.1489951Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:27.1490220Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:27.1490582Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:27.1490871Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:27.1491124Z #define __WCOREFLAG 0x80
2025-05-07T20:26:27.1491381Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:27.1491685Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:27.1491965Z #define __LP64__ 1
2025-05-07T20:26:27.1492219Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:27.1492532Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:27.1492820Z #define _IO_off64_t __off64_t
2025-05-07T20:26:27.1493087Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1493346Z #define __time_t_defined 1
2025-05-07T20:26:27.1493611Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:27.1493964Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:27.1494339Z #define __USE_UNIX98 1
2025-05-07T20:26:27.1494582Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:27.1494860Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:27.1495139Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:27.1495442Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:27.1495755Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:27.1496021Z #define SEEK_CUR 1
2025-05-07T20:26:27.1496254Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1496528Z #define _ASSERT_H 1
2025-05-07T20:26:27.1497102Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:27.1497723Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:27.1498004Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:27.1498267Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:27.1498543Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:27.1498817Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:27.1499206Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:27.1499619Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:27.1500280Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:27.1500925Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:27.1501228Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:27.1501573Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:27.1501951Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:27.1502227Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:27.1502513Z #define cudaArrayDefault 0x00
2025-05-07T20:26:27.1502796Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:27.1503090Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:27.1503381Z #define TLOSS 5
2025-05-07T20:26:27.1503592Z #define __ssize_t_defined 
2025-05-07T20:26:27.1503851Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:26:27.1504129Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:27.1504415Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:27.1504719Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:27.1505081Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:27.1505465Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:27.1506105Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:27.1506442Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:27.1506754Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:27.1507040Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:27.1507322Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:27.1507581Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:27.1507906Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:27.1508262Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:27.1508497Z #define __cdecl 
2025-05-07T20:26:27.1508725Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:27.1509296Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:27.1509622Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:27.1509867Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:27.1510293Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:27.1510587Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:27.1510846Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:27.1511156Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:27.1511487Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:27.1511895Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:27.1512345Z #define ADJ_NANO 0x2000
2025-05-07T20:26:27.1512661Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:27.1513021Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:27.1513307Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:27.1513573Z #define __FLT_DIG__ 6
2025-05-07T20:26:27.1513932Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:27.1514322Z #define __NO_INLINE__ 1
2025-05-07T20:26:27.1514630Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:27.1514986Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:27.1515251Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:27.1515513Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:27.1515805Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:27.1516082Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:27.1516379Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:27.1516671Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:27.1526259Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:27.1526733Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:27.1527086Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:27.1527449Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:27.1527802Z #define MAX_CANON 255
2025-05-07T20:26:27.1528044Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:27.1528312Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:27.1528593Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:27.1528888Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:27.1529213Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:27.1529523Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:27.1529811Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:27.1530140Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:27.1530471Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:27.1530751Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:27.1531054Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:27.1531365Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:27.1531665Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:27.1531989Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:27.1532303Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:27.1532583Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:27.1532854Z #define _SYS_TYPES_H 1
2025-05-07T20:26:27.1533113Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:27.1533390Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:27.1533657Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:27.1533911Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:27.1534201Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:27.1534509Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:27.1534773Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:27.1535087Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:27.1535374Z #define FP_SUBNORMAL 3
2025-05-07T20:26:27.1535634Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:27.1535930Z #define _INITIALIZER_LIST 
2025-05-07T20:26:27.1536195Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:27.1536454Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:27.1536741Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:27.1537039Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:27.1537316Z #define _IO_file_flags _flags
2025-05-07T20:26:27.1537749Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:27.1538008Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:27.1538299Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:27.1538592Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:27.1538953Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:27.1539343Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:27.1539749Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:27.1540069Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:27.1540346Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:27.1540615Z #define _BSD_SOURCE 1
2025-05-07T20:26:27.1540862Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:27.1541718Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:27.1542578Z #define __catch(X) catch(X)
2025-05-07T20:26:27.1542859Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:27.1543167Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:27.1543448Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:27.1543718Z #define __STRING(x) #x
2025-05-07T20:26:27.1543974Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:27.1544254Z #define _T_PTRDIFF_ 
2025-05-07T20:26:27.1544513Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:27.1544830Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:27.1545112Z #define __unbounded 
2025-05-07T20:26:27.1545370Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.1545672Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:27.1545961Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.1546278Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:27.1546570Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:27.1546881Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:27.1547215Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:27.1547541Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:27.1547828Z #define __managed__ __location__(managed)
2025-05-07T20:26:27.1548130Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:27.1548541Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:27.1548967Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:27.1549230Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:27.1549605Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:27.1550013Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:27.1550269Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:27.1550566Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:27.1550919Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:27.1551205Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:27.1551498Z #define _CRTIMP 
2025-05-07T20:26:27.1551730Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:27.1552035Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:27.1552369Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:27.1552731Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:27.1553154Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1553473Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:27.1553755Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:27.1554049Z #define __SIZE_T__ 
2025-05-07T20:26:27.1554268Z #define __stub_gtty 
2025-05-07T20:26:27.1554502Z #define __pid_t_defined 
2025-05-07T20:26:27.1554768Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:27.1555074Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.1555392Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:27.1555695Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:27.1555946Z #define __need_clockid_t 
2025-05-07T20:26:27.1556194Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:27.1556453Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:27.1556774Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:27.1557201Z #define _IO_HEX 0100
2025-05-07T20:26:27.1557462Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:27.1557807Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:27.1558208Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:27.1558497Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:27.1558913Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:27.1559377Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:27.1559717Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:27.1560011Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:27.1560124Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:27.1560240Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:27.1560328Z #define __stub_sstk 
2025-05-07T20:26:27.1560426Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:27.1560588Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:27.1560676Z #define __wur 
2025-05-07T20:26:27.1560804Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:27.1560901Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:27.1560990Z #define _IO_OCT 040
2025-05-07T20:26:27.1561092Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:27.1561192Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:27.1561289Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:27.1561420Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:27.1561522Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:27.1561630Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:27.1561829Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:27.1561928Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:27.1562022Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:27.1562136Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:27.1562227Z #define __off64_t_defined 
2025-05-07T20:26:27.1562326Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:27.1562421Z #define __FLT128_DIG__ 33
2025-05-07T20:26:27.1562528Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:27.1562631Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:27.1562725Z #define __INT32_C(c) c
2025-05-07T20:26:27.1562823Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:27.1562934Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:27.1563034Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:27.1563126Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:27.1563223Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:27.1563322Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:27.1563453Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:27.1563559Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:27.1563648Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:27.1563748Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:27.1563851Z #define __have_pthread_attr_t 1
2025-05-07T20:26:27.1563953Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:27.1564176Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:27.1564298Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:27.1564401Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:27.1564501Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:27.1564590Z #define htole32(x) (x)
2025-05-07T20:26:27.1564845Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:27.1564974Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:27.1565076Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:27.1565233Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:27.1565380Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:27.1565507Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:27.1565646Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:27.1565745Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:27.1565849Z #define cudaArrayLayered 0x01
2025-05-07T20:26:27.1566024Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:27.1566232Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:27.1566332Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:27.1566442Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:27.1566523Z #define unix 1
2025-05-07T20:26:27.1566696Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:27.1566801Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:27.1566899Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:27.1567022Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:27.1567118Z #define __USE_POSIX 1
2025-05-07T20:26:27.1567217Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:27.1567352Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:27.1567453Z #define __THROWNL throw ()
2025-05-07T20:26:27.1567654Z #define __cpp_rtti 199711L
2025-05-07T20:26:27.1567768Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:27.1567861Z #define __PMT(args) args
2025-05-07T20:26:27.1567977Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1568138Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:27.1568261Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:27.1568354Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:27.1568460Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:27.1568561Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:27.1568955Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:27.1569066Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:27.1569163Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:27.1569268Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:27.1569413Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:27.1569501Z #define _WCHAR_T_H 
2025-05-07T20:26:27.1569600Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:27.1569692Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:27.1569784Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:27.1569892Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:27.1569993Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:27.1570090Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:27.1570206Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:27.1570291Z #define __ELF__ 1
2025-05-07T20:26:27.1570395Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:27.1570511Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:27.1570604Z #define STA_INS 0x0010
2025-05-07T20:26:27.1570711Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:27.1570884Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:27.1570977Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:27.1571078Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:27.1571194Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1571302Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1571406Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:27.1571509Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:27.1571605Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:27.1571768Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:27.1571931Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:27.1572034Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:27.1572358Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:27.1572490Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:27.1572591Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:27.1572677Z #define __FLT_RADIX__ 2
2025-05-07T20:26:27.1572778Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:27.1572951Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:27.1573047Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:27.1573140Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:27.1573246Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:27.1573344Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:27.1573446Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:27.1573551Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:27.1573767Z #define WORD_BIT 32
2025-05-07T20:26:27.1573863Z #define _IO_USER_BUF 1
2025-05-07T20:26:27.1573956Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:27.1574061Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1574253Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:27.1574356Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:27.1574457Z #define __long_double_t long double
2025-05-07T20:26:27.1574561Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:27.1574652Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:27.1575051Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:27.1575144Z #define __k8 1
2025-05-07T20:26:27.1575337Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:27.1575513Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:27.1575628Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:27.1575728Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:27.1575841Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:27.1575945Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:27.1576039Z #define __blksize_t_defined 
2025-05-07T20:26:27.1576138Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:27.1576260Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:27.1576375Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:27.1576468Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:27.1576582Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:27.1576677Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:27.1576783Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:27.1577035Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:27.1577378Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:27.1577488Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:27.1577585Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:27.1577674Z #define SEEK_SET 0
2025-05-07T20:26:27.1577776Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:27.1577875Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:26:27.1578072Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:27.1578185Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:27.1578289Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:27.1578390Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:27.1578482Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:27.1578806Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:27.1578915Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:27.1579014Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:27.1579107Z #define __stub_sigreturn 
2025-05-07T20:26:27.1579354Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:27.1579452Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:27.1579550Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:27.1579656Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:27.1579743Z #define CLOCK_TAI 11
2025-05-07T20:26:27.1579858Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:27.1579946Z #define __restrict_arr 
2025-05-07T20:26:27.1580060Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:27.1580207Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:27.1580728Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:27.1580910Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:27.1581000Z #define __USE_MISC 1
2025-05-07T20:26:27.1581107Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:27.1581206Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:27.1581300Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:27.1581477Z #define __LDBL_DIG__ 18
2025-05-07T20:26:27.1581577Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:27.1581677Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:27.1581843Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:27.1581952Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:27.1582034Z #define __x86_64__ 1
2025-05-07T20:26:27.1582115Z #define _SIZE_T_ 
2025-05-07T20:26:27.1582995Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:27.1583097Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:27.1583201Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:27.1583315Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:27.1583437Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:27.1583539Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:27.1583647Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:27.1583772Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:27.1583916Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:27.1584014Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:27.1584475Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:27.1584596Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:27.1584739Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:27.1584846Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:27.1584941Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:27.1585030Z #define STA_FLL 0x0008
2025-05-07T20:26:27.1585176Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:27.1585277Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:27.1585398Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1585513Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:27.1585604Z #define __stub_revoke 
2025-05-07T20:26:27.1585701Z #define __timer_t_defined 1
2025-05-07T20:26:27.1585832Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:27.1585920Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:27.1586029Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:27.1586134Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:27.1586229Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:27.1586333Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:27.1586442Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:27.1586541Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:27.1586689Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:27.1586785Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:27.1586878Z #define _IO_off_t __off_t
2025-05-07T20:26:27.1586970Z #define __FLT64_DIG__ 15
2025-05-07T20:26:27.1587188Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:27.1587297Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:27.1587423Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1587546Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:27.1587647Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:27.1587752Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:27.1587840Z #define NULL __null
2025-05-07T20:26:27.1587977Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:27.1588080Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:27.1588181Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:27.1588285Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1588379Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:27.1588463Z #define FP_ZERO 2
2025-05-07T20:26:27.1588569Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:27.1588812Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:27.1588931Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1589020Z #define __WCHAR_T__ 
2025-05-07T20:26:27.1589196Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:27.1589403Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:27.1589557Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:27.1589657Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:27.1589788Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:27.1589908Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:27.1590038Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:27.1590176Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:27.1590273Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:27.1590378Z #define _SIGSET_H_types 1
2025-05-07T20:26:27.1590493Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:27.1590602Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:27.1590766Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:27.1590873Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:27.1591000Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:27.1591142Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:27.1591256Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:27.1591388Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:27.1591571Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:27.1591673Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:27.1591788Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:27.1591890Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:27.1591982Z #define STA_MODE 0x4000
2025-05-07T20:26:27.1592103Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:27.1592209Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:27.1592327Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:27.1592443Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:27.1592543Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:27.1592653Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:27.1592764Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:27.1592886Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:27.1592987Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:27.1593110Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1593199Z #define __SEG_FS 1
2025-05-07T20:26:27.1593300Z #define _IO_size_t size_t
2025-05-07T20:26:27.1593402Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:27.1593503Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:27.1593599Z #define __stub_lchmod 
2025-05-07T20:26:27.1593695Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:27.1593808Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1593915Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:27.1594004Z #define __SEG_GS 1
2025-05-07T20:26:27.1594188Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:27.1594291Z #define _IOS_APPEND 8
2025-05-07T20:26:27.1594392Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:27.1594500Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:27.1594602Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:27.1594704Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:27.1594814Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:27.1594903Z #define htole16(x) (x)
2025-05-07T20:26:27.1595015Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:27.1595127Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:27.1595224Z #define __INT16_TYPE__ short int
2025-05-07T20:26:27.1595329Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:27.1595446Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:27.1595560Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:27.1595685Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:27.1595783Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:27.1595968Z #define __WCLONE 0x80000000
2025-05-07T20:26:27.1596073Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:27.1596165Z #define SEEK_HOLE 4
2025-05-07T20:26:27.1596260Z #define TIMER_ABSTIME 1
2025-05-07T20:26:27.1596442Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:27.1596538Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:27.1596714Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:27.1596841Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1596940Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:27.1597052Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:27.1597160Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1597286Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:27.1597378Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:27.1597470Z #define linux 1
2025-05-07T20:26:27.1597565Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:27.1597688Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:27.1597797Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:27.1597894Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:27.1598010Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:27.1598163Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:27.1598264Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:27.1598372Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1598475Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:27.1598569Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:27.1598665Z #define htole64(x) (x)
2025-05-07T20:26:27.1598769Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:27.1598904Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:27.1599003Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:27.1599496Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:27.1599592Z #define __USE_POSIX2 1
2025-05-07T20:26:27.1599695Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:27.1599791Z #define __WALL 0x40000000
2025-05-07T20:26:27.1599902Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:27.1599989Z #define _XLOCALE_H 1
2025-05-07T20:26:27.1600090Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:27.1600202Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:27.1600303Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:27.1600411Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:27.1600508Z #define __EXCEPTIONS 1
2025-05-07T20:26:27.1600614Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:27.1600817Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:27.1600907Z #define __WORDSIZE 64
2025-05-07T20:26:27.1601006Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:27.1601103Z #define _STL_RELOPS_H 1
2025-05-07T20:26:27.1601201Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:27.1601305Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:27.1601415Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:27.1601513Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:27.1601622Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:27.1601931Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:27.1602170Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:27.1602304Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:27.1602408Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:27.1602518Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:27.1602639Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:27.1602747Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:27.1602859Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:27.1603052Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:27.1603155Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:27.1603251Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:27.1603366Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:27.1603543Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:27.1603799Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:27.1603889Z #define _STRING_H 1
2025-05-07T20:26:27.1604065Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:27.1604169Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:27.1604272Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:27.1604410Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:27.1604516Z #define __code_model_small__ 1
2025-05-07T20:26:27.1604610Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:27.1604717Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:27.1604842Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:27.1604939Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:27.1605045Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:27.1605387Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:27.1605486Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:27.1605588Z #define le64toh(x) (x)
2025-05-07T20:26:27.1606585Z #define FILENAME_MAX 4096
2025-05-07T20:26:27.1606754Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:27.1606891Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:27.1606983Z #define L_cuserid 9
2025-05-07T20:26:27.1607077Z #define __ino_t_defined 
2025-05-07T20:26:27.1607171Z #define __k8__ 1
2025-05-07T20:26:27.1607275Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:27.1607390Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:27.1607489Z #define __int8_t_defined 
2025-05-07T20:26:27.1607669Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:27.1607781Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:27.1607901Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:27.1608006Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:27.1608103Z #define _IOS_TRUNC 16
2025-05-07T20:26:27.1608227Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:27.1608383Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:27.1608488Z #define __HAVE_COLUMN 
2025-05-07T20:26:27.1608581Z #define __stub_fdetach 
2025-05-07T20:26:27.1608997Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:27.1609097Z #define __pic__ 2
2025-05-07T20:26:27.1609224Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1609334Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:27.1609434Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:27.1609541Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:27.1609645Z #define __stub_chflags 
2025-05-07T20:26:27.1609742Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:27.1609834Z #define __need_IOV_MAX 
2025-05-07T20:26:27.1609956Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:27.1610066Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:27.1610171Z #define __cpp_decltype 200707L
2025-05-07T20:26:27.1610281Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:27.1610383Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:27.1610495Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:27.1610595Z #define TTY_NAME_MAX 32
2025-05-07T20:26:27.1610775Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:27.1610911Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1611085Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:27.1611204Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:27.1611311Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:27.1611412Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:27.1611503Z #define __import__ 
2025-05-07T20:26:27.1611607Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:27.1611750Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:27.1611841Z #define __export__ 
2025-05-07T20:26:27.1611976Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:27.1612084Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:27.1612493Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:27.1612604Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:27.1612700Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:27.1612925Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:27.1613026Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:27.1613153Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:27.1613284Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:27.1613399Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:27.1613497Z #define WNOWAIT 0x01000000
2025-05-07T20:26:27.1613594Z #define PLOSS 6
2025-05-07T20:26:27.1613697Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:27.1613964Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:27.1614065Z #define EXIT_SUCCESS 0
2025-05-07T20:26:27.1614169Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:27.1614281Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:27.1614396Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:27.1614495Z #define __thread__ __thread
2025-05-07T20:26:27.1614605Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:27.1614710Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:27.1614821Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:27.1615062Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:27.1615186Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:27.1615289Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:27.1615385Z #define __linux__ 1
2025-05-07T20:26:27.1615488Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:27.1615623Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:27.1615729Z #define __S16_TYPE short int
2025-05-07T20:26:27.1616081Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:27.1616203Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:27.1616404Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:27.1616509Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:27.1616621Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:27.1616716Z #define _T_SIZE_ 
2025-05-07T20:26:27.1616821Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:27.1616953Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:27.1617054Z #define _PSTL_VERSION 12000
2025-05-07T20:26:27.1617183Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:27.1617292Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:27.1617397Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:27.1617540Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:27.1617633Z #define _IOS_INPUT 1
2025-05-07T20:26:27.1617732Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:27.1617850Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:27.1617951Z #define __INT64_TYPE__ long int
2025-05-07T20:26:27.1618053Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:27.1618170Z #define __shared__ __location__(shared)
2025-05-07T20:26:27.1618269Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:27.1618430Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:27.1618537Z #define __gid_t_defined 
2025-05-07T20:26:27.1618656Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:27.1618766Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:27.1618971Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:27.1619077Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:27.1619180Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:27.1619275Z #define ___int_size_t_h 
2025-05-07T20:26:27.1619390Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1619550Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:27.1619740Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:27.1619849Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:27.1619957Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:27.1620149Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:27.1620250Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:27.1620387Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1620578Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:27.1620717Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:27.1620815Z #define __clock_t_defined 1
2025-05-07T20:26:27.1620922Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:27.1621045Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:27.1621143Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:27.1621243Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:27.1621354Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:27.1621470Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:27.1621569Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:27.1621753Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:27.1621843Z #define __SSE__ 1
2025-05-07T20:26:27.1621961Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:27.1622068Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:27.1622158Z #define _CTYPE_H 1
2025-05-07T20:26:27.1622261Z #define __sigset_t_defined 
2025-05-07T20:26:27.1622368Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:27.1622470Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:27.1622569Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:27.1622673Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:27.1622774Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:27.1622870Z #define __SM_70_RT_H__ 
2025-05-07T20:26:27.1622972Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:27.1623085Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:27.1623192Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:27.1623359Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:27.1623467Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:27.1623583Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:27.1623685Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:27.1623792Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:27.1623881Z #define __amd64__ 1
2025-05-07T20:26:27.1623976Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:27.1624096Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:27.1624367Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:27.1624474Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:27.1624583Z #define EOF (-1)
2025-05-07T20:26:27.1624687Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:27.1624786Z #define __USE_POSIX199309 1
2025-05-07T20:26:27.1624897Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:27.1624997Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:27.1625106Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:27.1625211Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:27.1625331Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:27.1625441Z #define ____mbstate_t_defined 1
2025-05-07T20:26:27.1625537Z #define STA_NANO 0x2000
2025-05-07T20:26:27.1625647Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:27.1625757Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:27.1625851Z #define _IO_LINKED 0x80
2025-05-07T20:26:27.1625958Z #define __cpp_lib_launder 201606
2025-05-07T20:26:27.1626065Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:27.1626173Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:27.1626276Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:27.1626385Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:27.1626535Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:27.1626656Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1626765Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:27.1626867Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:27.1626975Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:27.1627075Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:27.1627214Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:27.1627351Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.1627646Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:27.1627839Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:27.1628044Z #define __stub_stty 
2025-05-07T20:26:27.1628225Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:27.1628326Z #define le16toh(x) (x)
2025-05-07T20:26:27.1628439Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:27.1641361Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:27.1641498Z #define _SIZET_ 
2025-05-07T20:26:27.1641602Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:27.1641694Z #define _SVID_SOURCE 1
2025-05-07T20:26:27.1641781Z #define _LP64 1
2025-05-07T20:26:27.1641879Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:27.1642132Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:27.1642249Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:27.1642356Z #define __UINT8_C(c) c
2025-05-07T20:26:27.1642458Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:27.1642558Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:27.1642672Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:27.1642775Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:27.1642878Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:27.1642981Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:27.1643070Z #define CUDARTAPI 
2025-05-07T20:26:27.1643165Z #define IOV_MAX 1024
2025-05-07T20:26:27.1643315Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:27.1643418Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:27.1643529Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:27.1643616Z #define __wchar_t__ 
2025-05-07T20:26:27.1643723Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:27.1643813Z #define SEEK_END 2
2025-05-07T20:26:27.1643910Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:27.1644092Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:27.1644198Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:27.1644345Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:27.1644444Z #define ____FILE_defined 1
2025-05-07T20:26:27.1644568Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:27.1644667Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:27.1644763Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:27.1644863Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:27.1645116Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:27.1645258Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:27.1645347Z #define _IO_RIGHT 04
2025-05-07T20:26:27.1645451Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:27.1645643Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:27.1645740Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:27.1645868Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:27.1645967Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:27.1646075Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:27.1646167Z #define _STDDEF_H_ 
2025-05-07T20:26:27.1646344Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:27.1646451Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1646578Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:27.1646784Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:27.1646907Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1647052Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:27.1647179Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:27.1647291Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:27.1647404Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:27.1647503Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:27.1647700Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:27.1647802Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:27.1648069Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:27.1648177Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:27.1648354Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:27.1648542Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:27.1648725Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:27.1648827Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:27.1648933Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:27.1649080Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:27.1649194Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:27.1649310Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:27.1649437Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:27.1649538Z #define P_tmpdir "/tmp"
2025-05-07T20:26:27.1649669Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:27.1649767Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:27.1649870Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:27.1650057Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:27.1650229Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:27.1650345Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:27.1650471Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:27.1650587Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:27.1650697Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:27.1650927Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:27.1651029Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:27.1651151Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:27.1651250Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:27.1651344Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:27.1651447Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:27.1651549Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:27.1651654Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:27.1651746Z #define __FXSR__ 1
2025-05-07T20:26:27.1651832Z #define _SIZE_T 
2025-05-07T20:26:27.1651944Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:27.1652059Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:27.1652236Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:27.1652394Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:27.1652491Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:27.1652595Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:27.1652789Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:27.1652991Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:27.1653092Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:27.1653218Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:27.1653309Z #define FOPEN_MAX 16
2025-05-07T20:26:27.1653409Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:27.1653530Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:27.1653636Z #define __suseconds_t_defined 
2025-05-07T20:26:27.1653735Z #define __off_t_defined 
2025-05-07T20:26:27.1653824Z #define stderr stderr
2025-05-07T20:26:27.1653931Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:27.1654054Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:27.1654155Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:27.1654249Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:27.1654667Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:27.1654762Z #define __mode_t_defined 
2025-05-07T20:26:27.1654859Z #define _GCC_SIZE_T 
2025-05-07T20:26:27.1654961Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1655066Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:27.1655183Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:27.1655282Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:27.1655377Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:27.1655655Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:27.1655764Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:27.1655872Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:27.1656068Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:27.1656155Z #define __size_t__ 
2025-05-07T20:26:27.1656297Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:27.1656395Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:27.1656508Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:27.1656668Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:27.1656769Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:27.1656941Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:27.1657038Z #define _ENDIAN_H 1
2025-05-07T20:26:27.1657147Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:27.1657248Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:27.1657357Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:27.1657441Z #define __try try
2025-05-07T20:26:27.1657557Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:27.1657654Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:27.1657746Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:27.1658016Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:27.1658109Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:27.1658195Z #define __PIC__ 2
2025-05-07T20:26:27.1658315Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:27.1658437Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:27.1658571Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:27.1658675Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:27.1658772Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:27.1658958Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:27.1659066Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1659171Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:27.1659268Z #define _IO_uid_t __uid_t
2025-05-07T20:26:27.1659373Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:27.1659504Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:27.1659607Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:27.1659763Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:27.1659868Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:27.1660000Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:27.1660087Z #define LONG_BIT 64
2025-05-07T20:26:27.1660201Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:27.1660308Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:27.1660437Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:27.1660546Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:27.1660640Z #define __blkcnt_t_defined 
2025-05-07T20:26:27.1660911Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:27.1661012Z #define __USE_LARGEFILE 1
2025-05-07T20:26:27.1661115Z #define __cpp_constexpr 201603L
2025-05-07T20:26:27.1661217Z #define CUDART_VERSION 12060
2025-05-07T20:26:27.1661318Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:27.1661423Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:27.1661519Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:27.1661725Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:27.1661819Z #define __lldiv_t_defined 1
2025-05-07T20:26:27.1661905Z #define __SSE2__ 1
2025-05-07T20:26:27.1661995Z #define _IOLBF 1
2025-05-07T20:26:27.1662099Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:27.1662205Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:27.1662312Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:27.1662410Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:27.1662528Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:27.1662622Z #define __INT32_TYPE__ int
2025-05-07T20:26:27.1662717Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:27.1662834Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:27.1663021Z #define __cpp_exceptions 199711L
2025-05-07T20:26:27.1663125Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:27.1663248Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:27.1663346Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:27.1663569Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:27.1663743Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:27.1663846Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:27.1663954Z #define __SWORD_TYPE long int
2025-05-07T20:26:27.1664056Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:27.1664160Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:27.1664266Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:27.1664366Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:27.1664654Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:27.1664763Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:27.1664915Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:27.1665009Z #define _T_SIZE 
2025-05-07T20:26:27.1665129Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:27.1665262Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:27.1665404Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:27.1665504Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:27.1665603Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:27.1665737Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:27.1665838Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:27.1665945Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1666049Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:27.1666229Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:27.1666326Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:27.1666443Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:27.1666545Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:27.1666671Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1666766Z #define __PIE__ 2
2025-05-07T20:26:27.1666881Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:27.1666993Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:27.1667189Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:27.1667415Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:27.1667518Z #define __nlink_t_defined 
2025-05-07T20:26:27.1667649Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:27.1667764Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:27.1667864Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:27.1668127Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:27.1668257Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:27.1668367Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:27.1668472Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:27.1668580Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:27.1668680Z #define __FILE_defined 1
2025-05-07T20:26:27.1668862Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:27.1668968Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:27.1669071Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:27.1669185Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:27.1669314Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:27.1669448Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:27.1669570Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:27.1669675Z #define __INT16_C(c) c
2025-05-07T20:26:27.1669773Z #define __U32_TYPE unsigned int
2025-05-07T20:26:27.1669881Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:27.1670007Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:27.1670095Z #define __STDC__ 1
2025-05-07T20:26:27.1670200Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:27.1670306Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:27.1670405Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:27.1670653Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:27.1670747Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:27.1670850Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:27.1671029Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:27.1671147Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:27.1671266Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:27.1671368Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:27.1671472Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:27.1671566Z #define stdin stdin
2025-05-07T20:26:27.1671662Z #define __ino64_t_defined 
2025-05-07T20:26:27.1671753Z #define STA_CLK 0x8000
2025-05-07T20:26:27.1671855Z #define __clockid_t_defined 1
2025-05-07T20:26:27.1672006Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:27.1672173Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:27.1672285Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:27.1672391Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:27.1672505Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:27.1672619Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:27.1672822Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:27.1672924Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:27.1673451Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:27.1673539Z #define DOMAIN 1
2025-05-07T20:26:27.1673642Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:27.1673730Z #define __NVCC__ 1
2025-05-07T20:26:27.1673838Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:27.1673962Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1674070Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:27.1674182Z #define __throw_exception_again throw
2025-05-07T20:26:27.1674284Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:27.1674378Z #define __EXCEPTION_H 1
2025-05-07T20:26:27.1674485Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:27.1674600Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:27.1674907Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:27.1675031Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:27.1675135Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:27.1675233Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:27.1675345Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:27.1675447Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:27.1675603Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:27.1675716Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1675830Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:27.1675934Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:27.1676043Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:27.1676146Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:27.1676259Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:27.1676403Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:27.1676505Z #define __useconds_t_defined 
2025-05-07T20:26:27.1676615Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:27.1676800Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:27.1676952Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:27.1677049Z #define __SSE_MATH__ 1
2025-05-07T20:26:27.1677145Z #define _IO_wint_t wint_t
2025-05-07T20:26:27.1677249Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:27.1677343Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:27.1677441Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:27.1677565Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:27.1677666Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:27.1677765Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:27.1677948Z #define __USE_ATFILE 1
2025-05-07T20:26:27.1678047Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:27.1678150Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:27.1678247Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:27.1678550Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:27.1678662Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:27.1678768Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:27.1678875Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:27.1678995Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:27.1679085Z #define _STDLIB_H 1
2025-05-07T20:26:27.1679226Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:27.1679340Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1679458Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:27.1679610Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1679734Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:27.1679835Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:27.1680027Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:27.1680193Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:27.1680305Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:27.1680433Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:27.1680531Z #define __ldiv_t_defined 1
2025-05-07T20:26:27.1680715Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:27.1680820Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:27.1680992Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:27.1681099Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:27.1681200Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:27.1681306Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:27.1681411Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.1681521Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:27.1681611Z #define CUDART_CB 
2025-05-07T20:26:27.1681726Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:27.1681855Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:27.1681946Z #define MB_LEN_MAX 16
2025-05-07T20:26:27.1682181Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:27.1682286Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:27.1682416Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:27.1682537Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:27.1682637Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:27.1682787Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:27.1682903Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:27.1682994Z #define _GNU_SOURCE 1
2025-05-07T20:26:27.1683091Z #define __stub_putmsg 
2025-05-07T20:26:27.1683182Z #define __CUDACC__ 1
2025-05-07T20:26:27.1683274Z #define __N(msgid) (msgid)
2025-05-07T20:26:27.1683368Z #define __P(args) args
2025-05-07T20:26:27.1683623Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:27.1683734Z #define __cpp_init_captures 201304L
2025-05-07T20:26:27.1683851Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:27.1683949Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:27.1684051Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:27.1684145Z #define __WCHAR_T 
2025-05-07T20:26:27.1684240Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:27.1684338Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:27.1684464Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:27.1684570Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:27.1684576Z 
2025-05-07T20:26:27.1854798Z 
2025-05-07T20:26:27.1855530Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:27.1855545Z 
2025-05-07T20:26:29.0932888Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:29.0933254Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:26:29.0933572Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:26:29.0933892Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:26:29.0934578Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:26:29.0934782Z 
2025-05-07T20:26:29.1561189Z 
2025-05-07T20:26:29.1565668Z /usr/bin/nvidia-smi
2025-05-07T20:26:29.1571284Z + nvidia-smi
2025-05-07T20:26:29.1571437Z 
2025-05-07T20:26:29.1747034Z Wed May  7 20:26:29 2025       
2025-05-07T20:26:29.1747500Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:29.1748185Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:29.1748676Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:29.1749162Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:29.1749680Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:29.1750107Z |                                         |                        |               MIG M. |
2025-05-07T20:26:29.1750465Z |=========================================+========================+======================|
2025-05-07T20:26:29.1915409Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:29.1915856Z |  0%   25C    P8             15W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:29.1916239Z |                                         |                        |                  N/A |
2025-05-07T20:26:29.1916643Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:29.1920215Z                                                                                          
2025-05-07T20:26:29.1920616Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:29.1921040Z | Processes:                                                                              |
2025-05-07T20:26:29.1921485Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:29.1921898Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:29.1922247Z |=========================================================================================|
2025-05-07T20:26:29.1925184Z |  No running processes found                                                             |
2025-05-07T20:26:29.1925656Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:29.4620571Z 
2025-05-07T20:26:29.4625215Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:26:29.4673807Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:26:29.4674362Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:26:29.4685957Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:29.4686315Z env:
2025-05-07T20:26:29.4686542Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:29.4686836Z   BUILD_ENV: build_binary
2025-05-07T20:26:29.4687081Z   BUILD_TARGET: genai
2025-05-07T20:26:29.4687311Z   BUILD_VARIANT: cuda
2025-05-07T20:26:29.4687618Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:26:29.4687879Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:29.4688177Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:29.4688507Z ##[endgroup]
2025-05-07T20:26:29.8037568Z ################################################################################
2025-05-07T20:26:29.8038049Z # Install PyTorch (PIP)
2025-05-07T20:26:29.8038345Z #
2025-05-07T20:26:29.8054054Z # [2025-05-07T20:26:29.805Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:26:29.8054654Z ################################################################################
2025-05-07T20:26:29.8054950Z 
2025-05-07T20:26:29.8082310Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:30.7878268Z Channels:
2025-05-07T20:26:30.7878519Z  - conda-forge
2025-05-07T20:26:30.7878767Z Platform: linux-64
2025-05-07T20:26:34.1872332Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:34.9102969Z Solving environment: \ | / done
2025-05-07T20:26:35.1475363Z 
2025-05-07T20:26:35.1475765Z ## Package Plan ##
2025-05-07T20:26:35.1475925Z 
2025-05-07T20:26:35.1476141Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:35.1476465Z 
2025-05-07T20:26:35.1476566Z   added / updated specs:
2025-05-07T20:26:35.1476829Z     - numpy
2025-05-07T20:26:35.1476951Z 
2025-05-07T20:26:35.1476990Z 
2025-05-07T20:26:35.1477124Z The following packages will be downloaded:
2025-05-07T20:26:35.1477340Z 
2025-05-07T20:26:35.1477467Z     package                    |            build
2025-05-07T20:26:35.1477783Z     ---------------------------|-----------------
2025-05-07T20:26:35.1478165Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:35.1478624Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:35.1479065Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:35.1479509Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:35.1479960Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:35.1480422Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:35.1480861Z     numpy-2.2.5                |  py311h5d046bc_0         8.6 MB  conda-forge
2025-05-07T20:26:35.1481246Z     ------------------------------------------------------------
2025-05-07T20:26:35.1481588Z                                            Total:        15.9 MB
2025-05-07T20:26:35.1481793Z 
2025-05-07T20:26:35.1481960Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:35.1482259Z 
2025-05-07T20:26:35.1482473Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:35.1482964Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:35.1483458Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:35.1484023Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:35.1484538Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:35.1485070Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:35.1485890Z   numpy              conda-forge/linux-64::numpy-2.2.5-py311h5d046bc_0 
2025-05-07T20:26:35.1486160Z 
2025-05-07T20:26:35.1486164Z 
2025-05-07T20:26:35.1486168Z 
2025-05-07T20:26:35.1486311Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:35.1486672Z numpy-2.2.5          | 8.6 MB    |            |   0% 
2025-05-07T20:26:35.1486900Z 
2025-05-07T20:26:35.1487168Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:35.1487407Z 
2025-05-07T20:26:35.1487688Z 
2025-05-07T20:26:35.1512135Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:35.1512457Z 
2025-05-07T20:26:35.1512463Z 
2025-05-07T20:26:35.1512468Z 
2025-05-07T20:26:35.1525150Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:35.1525433Z 
2025-05-07T20:26:35.1525441Z 
2025-05-07T20:26:35.1525447Z 
2025-05-07T20:26:35.1525455Z 
2025-05-07T20:26:35.1530426Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:35.1530864Z 
2025-05-07T20:26:35.1530892Z 
2025-05-07T20:26:35.1530898Z 
2025-05-07T20:26:35.1530904Z 
2025-05-07T20:26:35.1530910Z 
2025-05-07T20:26:35.1532115Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:35.1532425Z 
2025-05-07T20:26:35.1532429Z 
2025-05-07T20:26:35.1532433Z 
2025-05-07T20:26:35.1532437Z 
2025-05-07T20:26:35.1532747Z 
2025-05-07T20:26:35.1532760Z 
2025-05-07T20:26:35.2418984Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:35.2419272Z 
2025-05-07T20:26:35.2419277Z 
2025-05-07T20:26:35.2419281Z 
2025-05-07T20:26:35.2434751Z 
2025-05-07T20:26:35.3273758Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:35.3274035Z 
2025-05-07T20:26:35.3274040Z 
2025-05-07T20:26:35.3274044Z 
2025-05-07T20:26:35.3274047Z 
2025-05-07T20:26:35.3495623Z 
2025-05-07T20:26:35.3655376Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:35.3655661Z 
2025-05-07T20:26:35.3655665Z 
2025-05-07T20:26:35.3655669Z 
2025-05-07T20:26:35.3655697Z 
2025-05-07T20:26:35.3655701Z 
2025-05-07T20:26:35.4319461Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:35.4319750Z 
2025-05-07T20:26:35.4319754Z 
2025-05-07T20:26:35.4319758Z 
2025-05-07T20:26:35.4319762Z 
2025-05-07T20:26:35.4319765Z 
2025-05-07T20:26:35.4319788Z 
2025-05-07T20:26:35.5533609Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:35.5533911Z 
2025-05-07T20:26:35.5558797Z 
2025-05-07T20:26:35.5575332Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:26:35.5575735Z 
2025-05-07T20:26:35.5575741Z 
2025-05-07T20:26:35.5575746Z 
2025-05-07T20:26:35.5575752Z 
2025-05-07T20:26:35.5575758Z 
2025-05-07T20:26:35.5583868Z 
2025-05-07T20:26:35.5586520Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:35.6132022Z numpy-2.2.5          | 8.6 MB    |            |   0% 
2025-05-07T20:26:35.6132323Z 
2025-05-07T20:26:35.6132328Z 
2025-05-07T20:26:35.6132363Z 
2025-05-07T20:26:35.6133058Z 
2025-05-07T20:26:35.6145718Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:35.6146104Z 
2025-05-07T20:26:35.6146109Z 
2025-05-07T20:26:35.6146113Z 
2025-05-07T20:26:35.6146117Z 
2025-05-07T20:26:35.6216089Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:35.6216392Z 
2025-05-07T20:26:35.6216396Z 
2025-05-07T20:26:35.6282598Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:35.6283162Z 
2025-05-07T20:26:35.6283171Z 
2025-05-07T20:26:35.6283178Z 
2025-05-07T20:26:35.6283186Z 
2025-05-07T20:26:35.6283193Z 
2025-05-07T20:26:35.6322668Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:35.6323054Z 
2025-05-07T20:26:35.6327357Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:35.6327826Z 
2025-05-07T20:26:35.6327832Z 
2025-05-07T20:26:35.6327837Z 
2025-05-07T20:26:35.6327854Z 
2025-05-07T20:26:35.6327859Z 
2025-05-07T20:26:35.6328185Z 
2025-05-07T20:26:35.6446986Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:35.6447295Z 
2025-05-07T20:26:35.6447299Z 
2025-05-07T20:26:35.6447303Z 
2025-05-07T20:26:35.6471128Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:26:35.6471423Z 
2025-05-07T20:26:35.6471428Z 
2025-05-07T20:26:35.6471600Z 
2025-05-07T20:26:35.6608169Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:35.7065671Z numpy-2.2.5          | 8.6 MB    | ###3       |  34% 
2025-05-07T20:26:35.7065997Z 
2025-05-07T20:26:35.7066001Z 
2025-05-07T20:26:35.7066729Z 
2025-05-07T20:26:35.7230384Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:35.7230699Z 
2025-05-07T20:26:35.7230703Z 
2025-05-07T20:26:35.7231194Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:35.7231474Z 
2025-05-07T20:26:35.7231483Z 
2025-05-07T20:26:35.7338949Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:35.7339225Z 
2025-05-07T20:26:35.7394183Z libopenblas-0.3.29   | 5.6 MB    | #########9 |  99% [A
2025-05-07T20:26:35.7394669Z 
2025-05-07T20:26:35.7621438Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:35.8368464Z numpy-2.2.5          | 8.6 MB    | #####3     |  53% 
2025-05-07T20:26:35.8369009Z 
2025-05-07T20:26:35.8620419Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:35.8730728Z numpy-2.2.5          | 8.6 MB    | #########6 |  96% 
2025-05-07T20:26:36.3021441Z numpy-2.2.5          | 8.6 MB    | ########## | 100% 
2025-05-07T20:26:36.3029304Z numpy-2.2.5          | 8.6 MB    | ########## | 100% 
2025-05-07T20:26:36.3029649Z                                                      
2025-05-07T20:26:36.3029848Z 
2025-05-07T20:26:36.3030120Z                                                      [A
2025-05-07T20:26:36.3030331Z 
2025-05-07T20:26:36.3030335Z 
2025-05-07T20:26:36.3030523Z                                                      [A[A
2025-05-07T20:26:36.3030731Z 
2025-05-07T20:26:36.3030735Z 
2025-05-07T20:26:36.3030752Z 
2025-05-07T20:26:36.3030926Z                                                      [A[A[A
2025-05-07T20:26:36.3031132Z 
2025-05-07T20:26:36.3031136Z 
2025-05-07T20:26:36.3031140Z 
2025-05-07T20:26:36.3031152Z 
2025-05-07T20:26:36.3031339Z                                                      [A[A[A[A
2025-05-07T20:26:36.3031547Z 
2025-05-07T20:26:36.3031552Z 
2025-05-07T20:26:36.3031556Z 
2025-05-07T20:26:36.3031560Z 
2025-05-07T20:26:36.3031563Z 
2025-05-07T20:26:36.3031751Z                                                      [A[A[A[A[A
2025-05-07T20:26:36.3031961Z 
2025-05-07T20:26:36.3031965Z 
2025-05-07T20:26:36.3031969Z 
2025-05-07T20:26:36.3031972Z 
2025-05-07T20:26:36.3031976Z 
2025-05-07T20:26:36.3031980Z 
2025-05-07T20:26:36.3032340Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:36.4044064Z Preparing transaction: \ done
2025-05-07T20:26:36.6054065Z Verifying transaction: / - done
2025-05-07T20:26:36.7062547Z Executing transaction: | done
2025-05-07T20:26:36.8838121Z ################################################################################
2025-05-07T20:26:36.8838512Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:36.8838835Z #
2025-05-07T20:26:36.8855274Z # [2025-05-07T20:26:36.885Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:26:36.8855759Z ################################################################################
2025-05-07T20:26:36.8855975Z 
2025-05-07T20:26:36.8873122Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:36.9763657Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:36.9764033Z ################################################################################
2025-05-07T20:26:36.9764363Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:36.9764652Z #
2025-05-07T20:26:36.9782355Z # [2025-05-07T20:26:36.977Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:26:36.9782789Z ################################################################################
2025-05-07T20:26:36.9783005Z 
2025-05-07T20:26:36.9806732Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:36.9828146Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:26:36.9844418Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:36.9844970Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:26:36.9852015Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:36.9859413Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:26:36.9880582Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:28:00.8791455Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:28:00.8791885Z Collecting torch
2025-05-07T20:28:00.8792545Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:00.8793252Z Collecting filelock (from torch)
2025-05-07T20:28:00.8794047Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:00.8794979Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from torch) (4.13.2)
2025-05-07T20:28:00.8795681Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:00.8796184Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:00.8797007Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 48.1 MB/s eta 0:00:00
2025-05-07T20:28:00.8797352Z Collecting networkx (from torch)
2025-05-07T20:28:00.8797863Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:28:00.8798507Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 23.4 MB/s eta 0:00:00
2025-05-07T20:28:00.8798855Z Collecting jinja2 (from torch)
2025-05-07T20:28:00.8799328Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:00.8799842Z Collecting fsspec (from torch)
2025-05-07T20:28:00.8800331Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:00.8800895Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:28:00.8801619Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:28:00.8802440Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 51.0 MB/s eta 0:00:00
2025-05-07T20:28:00.8802857Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:28:00.8803574Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:28:00.8804358Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 9.7 MB/s eta 0:00:00
2025-05-07T20:28:00.8804765Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:28:00.8805451Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:28:00.8806739Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 57.5 MB/s eta 0:00:00
2025-05-07T20:28:00.8807227Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:28:00.8808091Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:28:00.8808869Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 27.7 MB/s eta 0:00:00
2025-05-07T20:28:00.8809494Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:28:00.8810274Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:28:00.8811134Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 71.5 MB/s eta 0:00:00
2025-05-07T20:28:00.8811535Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:28:00.8812214Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:28:00.8812978Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 112.8 MB/s eta 0:00:00
2025-05-07T20:28:00.8813369Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:28:00.8814055Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:28:00.8814836Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 135.0 MB/s eta 0:00:00
2025-05-07T20:28:00.8815227Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:28:00.8815931Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:28:00.8816848Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 159.2 MB/s eta 0:00:00
2025-05-07T20:28:00.8817245Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:28:00.8817940Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:28:00.8818720Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 123.9 MB/s eta 0:00:00
2025-05-07T20:28:00.8819120Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:00.8819825Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:00.8820611Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 147.4 MB/s eta 0:00:00
2025-05-07T20:28:00.8820987Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:00.8821755Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:00.8822537Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:28:00.8823184Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:28:00.8823855Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:28:00.8824633Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:28:00.8825486Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 63.2 MB/s eta 0:00:00
2025-05-07T20:28:00.8825883Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:28:00.8826667Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:00.8827475Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:00.8828309Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:00.8829575Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:28:00.8830443Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:00.8831003Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:00.8831728Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 44.2 MB/s eta 0:00:00
2025-05-07T20:28:00.8832100Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:00.8832797Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
2025-05-07T20:28:00.8833849Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl (825.6 MB)
2025-05-07T20:28:00.8834737Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.6/825.6 MB 37.2 MB/s eta 0:00:00
2025-05-07T20:28:00.8835505Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:28:00.8836388Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 19.4 MB/s eta 0:00:00
2025-05-07T20:28:00.8837172Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:00.8838020Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 56.8 MB/s eta 0:00:00
2025-05-07T20:28:00.8838815Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:28:00.8839690Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 134.1 MB/s eta 0:00:00
2025-05-07T20:28:00.8841433Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:00.8843017Z 
2025-05-07T20:28:00.8845007Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:28:00.8847080Z 
2025-05-07T20:28:03.0794179Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:28:03.0796471Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:06.4688891Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:09.8884711Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:09.8885153Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:13.2228983Z True
2025-05-07T20:28:13.2229287Z True
2025-05-07T20:28:13.2229820Z 
2025-05-07T20:28:13.2855279Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:13.2893832Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:13.2894439Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:13.2909319Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:13.2909708Z env:
2025-05-07T20:28:13.2909945Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:13.2910275Z   BUILD_ENV: build_binary
2025-05-07T20:28:13.2910521Z   BUILD_TARGET: genai
2025-05-07T20:28:13.2910746Z   BUILD_VARIANT: cuda
2025-05-07T20:28:13.2910983Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:13.2911230Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:13.2911533Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:13.2911865Z ##[endgroup]
2025-05-07T20:28:13.6254919Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:13.6257772Z ################################################################################
2025-05-07T20:28:13.6258904Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:13.6259635Z #
2025-05-07T20:28:13.6272955Z # [2025-05-07T20:28:13.626Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:13.6273698Z ################################################################################
2025-05-07T20:28:13.6273917Z 
2025-05-07T20:28:13.6288410Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:13.7233552Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:13.7243261Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:13.7243889Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:13.7244289Z 
2025-05-07T20:28:13.8227856Z 
2025-05-07T20:28:13.8228517Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:13.8251719Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:19.9267834Z Collecting environment information...
2025-05-07T20:28:19.9268386Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:28:19.9268703Z Is debug build: False
2025-05-07T20:28:19.9268955Z CUDA used to build PyTorch: 12.6
2025-05-07T20:28:19.9269250Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:19.9269432Z 
2025-05-07T20:28:19.9269535Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:19.9269854Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:19.9270166Z Clang version: Could not collect
2025-05-07T20:28:19.9270440Z CMake version: Could not collect
2025-05-07T20:28:19.9270708Z Libc version: glibc-2.34
2025-05-07T20:28:19.9270858Z 
2025-05-07T20:28:19.9271158Z Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:19.9271766Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:19.9272169Z Is CUDA available: True
2025-05-07T20:28:19.9272426Z CUDA runtime version: 12.6.85
2025-05-07T20:28:19.9272806Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:19.9273233Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:19.9273590Z Nvidia driver version: 570.133.07
2025-05-07T20:28:19.9273958Z cuDNN version: Could not collect
2025-05-07T20:28:19.9274341Z HIP runtime version: N/A
2025-05-07T20:28:19.9274684Z MIOpen runtime version: N/A
2025-05-07T20:28:19.9275025Z Is XNNPACK available: True
2025-05-07T20:28:19.9275236Z 
2025-05-07T20:28:19.9275318Z CPU:
2025-05-07T20:28:19.9275540Z Architecture:                         x86_64
2025-05-07T20:28:19.9275868Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:19.9276258Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:19.9276643Z Byte Order:                           Little Endian
2025-05-07T20:28:19.9276954Z CPU(s):                               16
2025-05-07T20:28:19.9277253Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:19.9277987Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:19.9278341Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:19.9278654Z CPU family:                           23
2025-05-07T20:28:19.9278948Z Model:                                49
2025-05-07T20:28:19.9279239Z Thread(s) per core:                   2
2025-05-07T20:28:19.9279524Z Core(s) per socket:                   8
2025-05-07T20:28:19.9279816Z Socket(s):                            1
2025-05-07T20:28:19.9280098Z Stepping:                             0
2025-05-07T20:28:19.9280393Z BogoMIPS:                             5600.00
2025-05-07T20:28:19.9282472Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:19.9284716Z Hypervisor vendor:                    KVM
2025-05-07T20:28:19.9285029Z Virtualization type:                  full
2025-05-07T20:28:19.9285370Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:19.9285734Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:19.9286091Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:19.9286449Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:19.9286767Z NUMA node(s):                         1
2025-05-07T20:28:19.9287068Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:19.9287407Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:19.9287903Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:19.9288262Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:19.9288623Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:19.9288985Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:19.9289346Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:19.9289717Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:19.9290271Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:19.9290854Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:19.9291402Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:19.9292094Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:19.9292957Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:19.9293633Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:19.9293995Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:19.9294237Z 
2025-05-07T20:28:19.9294342Z Versions of relevant libraries:
2025-05-07T20:28:19.9294609Z [pip3] numpy==2.2.5
2025-05-07T20:28:19.9294845Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:28:19.9295147Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:28:19.9295453Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:28:19.9295763Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:28:19.9296082Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:28:19.9296375Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:28:19.9296659Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:28:19.9296965Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:28:19.9297273Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:28:19.9297695Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:19.9298002Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:19.9298292Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:28:19.9298595Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:28:19.9298894Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:19.9299205Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:28:19.9299577Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:19.9300058Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:19.9300574Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:19.9301095Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:19.9301626Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:19.9302160Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:19.9302643Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:28:19.9303113Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:28:19.9303674Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:19.9304170Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:19.9304655Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:19.9305119Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:19.9305573Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:19.9306396Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:19.9306954Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:19.9307512Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:28:19.9308049Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:19.9308595Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:19.9309140Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:28:19.9309666Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:28:19.9310206Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:28:19.9310750Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:28:19.9311296Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:19.9311860Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:19.9312426Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:28:19.9312991Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:28:19.9313552Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:19.9314127Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:19.9314664Z [conda] numpy                     2.2.5           py311h5d046bc_0    conda-forge
2025-05-07T20:28:19.9315196Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:28:19.9315770Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:28:19.9316354Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:28:19.9316940Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:28:19.9317505Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:28:19.9318196Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:28:19.9318680Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:28:19.9319166Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:28:19.9319656Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:28:19.9320155Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:19.9320640Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:19.9321116Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:28:19.9321600Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:28:19.9322080Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:19.9322602Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:28:19.9322879Z 
2025-05-07T20:28:20.0005028Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:20.0005907Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:20.0019386Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:20.0019732Z env:
2025-05-07T20:28:20.0019964Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:20.0020259Z   BUILD_ENV: build_binary
2025-05-07T20:28:20.0020510Z   BUILD_TARGET: genai
2025-05-07T20:28:20.0020747Z   BUILD_VARIANT: cuda
2025-05-07T20:28:20.0020987Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:20.0021241Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:20.0021547Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:20.0021880Z ##[endgroup]
2025-05-07T20:28:20.3370501Z ################################################################################
2025-05-07T20:28:20.3370885Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:20.3371139Z #
2025-05-07T20:28:20.3385767Z # [2025-05-07T20:28:20.338Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:20.3386174Z ################################################################################
2025-05-07T20:28:20.3386389Z 
2025-05-07T20:28:20.3400834Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:20.4300174Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:20.4323747Z [BUILD] Running git submodules update ...
2025-05-07T20:28:20.4346325Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:20.4707736Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:20.4708209Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:20.4708649Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:20.4709044Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:20.4709436Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:20.4709883Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:20.4710279Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:20.4742683Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:20.5285802Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:20.5308122Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:22.9021200Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:22.9207932Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:23.0155817Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:23.0185534Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:23.2334460Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:23.2366225Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:23.3404648Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:23.3437623Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:23.6516411Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:23.6564176Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:23.7080579Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:23.7083443Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:23.7731833Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:23.7761196Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:23.8188496Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:23.8699174Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:23.8729838Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:23.9953405Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:23.9987903Z   Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:24.0948522Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:24.0996041Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:24.1441914Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:24.2066899Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:24.2098069Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:24.3014921Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:24.3042361Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:24.4095731Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:24.4143902Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:24.5171993Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:24.5266078Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:24.6169513Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:24.6205181Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:24.7209588Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:24.7239416Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:24.8309627Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:24.8339061Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:24.8832451Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:24.9374538Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:24.9418132Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:24.9909318Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:25.0425686Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:25.0453468Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:25.0955598Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:25.1617553Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:25.1662165Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:25.2138161Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:25.2629791Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:25.3157006Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:25.8172488Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 55.6 MB/s eta 0:00:00
2025-05-07T20:28:25.8208888Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:25.8771913Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:25.9486893Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:26.0098782Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:26.0698475Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:26.1248979Z Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB)
2025-05-07T20:28:26.1860611Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 763.0/763.0 kB 8.5 MB/s eta 0:00:00
2025-05-07T20:28:26.1897953Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:26.2371664Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:26.2905812Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:26.3444600Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:26.4127553Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:26.4646964Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:26.5295039Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:26.5837561Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:26.6444549Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:26.7051328Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:26.8825122Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:29.2653915Z 
2025-05-07T20:28:29.2680533Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:28:29.4480534Z ################################################################################
2025-05-07T20:28:29.4480894Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:29.4481160Z #
2025-05-07T20:28:29.4499509Z # [2025-05-07T20:28:29.449Z] + install_triton_pip build_binary
2025-05-07T20:28:29.4499890Z ################################################################################
2025-05-07T20:28:29.4500103Z 
2025-05-07T20:28:29.4500332Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:29.4500767Z ################################################################################
2025-05-07T20:28:29.4501117Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:29.4501433Z #
2025-05-07T20:28:29.4516936Z # [2025-05-07T20:28:29.451Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:29.4517448Z ################################################################################
2025-05-07T20:28:29.4517667Z 
2025-05-07T20:28:29.4532687Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:29.5407559Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:29.5407947Z ################################################################################
2025-05-07T20:28:29.5408371Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:29.5408713Z #
2025-05-07T20:28:29.5425280Z # [2025-05-07T20:28:29.542Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:29.5425901Z ################################################################################
2025-05-07T20:28:29.5426121Z 
2025-05-07T20:28:29.5474824Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:29.5491462Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:29.5491973Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:29.5500295Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:29.5510033Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:29.5530902Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:37.6100365Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:37.6101893Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:37.6102621Z 
2025-05-07T20:28:37.6102827Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:37.6103239Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:37.6104043Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:37.6105241Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:28:37.6106657Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 49.3 MB/s eta 0:00:00
2025-05-07T20:28:37.6107129Z Installing collected packages: pytorch-triton
2025-05-07T20:28:37.6107585Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:37.6108203Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:37.6108707Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:37.6117997Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:37.6118493Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:37.6118762Z 
2025-05-07T20:28:39.8340886Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:39.8344286Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:28:41.9702307Z ################################################################################
2025-05-07T20:28:41.9703130Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:28:41.9703820Z ################################################################################
2025-05-07T20:28:41.9704211Z 
2025-05-07T20:28:44.0023884Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:28:46.1049307Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:28:46.1052955Z [BUILD] Successfully ran git submodules update
2025-05-07T20:28:46.1086597Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:28:46.1087088Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:28:46.1102995Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:46.1103341Z env:
2025-05-07T20:28:46.1103578Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:46.1103868Z   BUILD_ENV: build_binary
2025-05-07T20:28:46.1104114Z   BUILD_TARGET: genai
2025-05-07T20:28:46.1104338Z   BUILD_VARIANT: cuda
2025-05-07T20:28:46.1104566Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:46.1104820Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:46.1105123Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:46.1105453Z ##[endgroup]
2025-05-07T20:28:46.4460320Z ################################################################################
2025-05-07T20:28:46.4460691Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:28:46.4460958Z #
2025-05-07T20:28:46.4475483Z # [2025-05-07T20:28:46.447Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.4476119Z ################################################################################
2025-05-07T20:28:46.4476338Z 
2025-05-07T20:28:46.4476695Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.4477694Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.4478027Z 
2025-05-07T20:28:46.4593859Z 7d736fee50ce6716a3e7a5042537bb0127686eb5  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.4596821Z 
2025-05-07T20:28:46.4598086Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.4598437Z 
2025-05-07T20:28:46.4732845Z 0a23a86eb2b2d57e022570aa036f879ef469d611e4dea9b68c6aaab9d3746d15  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.4735885Z 
2025-05-07T20:28:46.4736599Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.4736954Z 
2025-05-07T20:28:46.4967100Z d1d3b7cfd6b55cbd576df724965b7f7c  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.4969233Z 
2025-05-07T20:28:46.4978997Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl ...
2025-05-07T20:28:46.5000195Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:49.1473264Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:49.1475144Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:28:49.1476807Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:28:49.1477655Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:28:49.1478190Z 
2025-05-07T20:28:55.9671041Z ################################################################################
2025-05-07T20:28:55.9671420Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:28:55.9671817Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:55.9672248Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:28:55.9672568Z [CHECK]
2025-05-07T20:28:55.9672891Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:28:55.9673384Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:28:55.9673775Z ################################################################################
2025-05-07T20:28:55.9673985Z 
2025-05-07T20:28:55.9674107Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:28:59.8784743Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:03.7660029Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:07.6438901Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:07.6445314Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:19.3388789Z ################################################################################
2025-05-07T20:29:19.3391244Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:19.3391672Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:19.3392085Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:19.3392427Z ################################################################################
2025-05-07T20:29:19.3392657Z 
2025-05-07T20:29:27.1544678Z ################################################################################
2025-05-07T20:29:27.1545201Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:27.1547123Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:27.1550028Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:27.1550545Z ################################################################################
2025-05-07T20:29:27.1550773Z 
2025-05-07T20:29:27.1550927Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:31.0603531Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:34.9460854Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:38.9749945Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:29:42.8631773Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:29:42.8637164Z [INSTALL] Check for operator registrations ...
2025-05-07T20:29:46.6968109Z fbgemm.nccl_init
2025-05-07T20:29:46.6968360Z 
2025-05-07T20:29:46.7598601Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:29:50.5842528Z fbgemm.gqa_attn_splitk
2025-05-07T20:29:50.5842790Z 
2025-05-07T20:29:50.6484579Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:29:54.4724513Z fbgemm.rope_qkv_decoding
2025-05-07T20:29:54.4724722Z 
2025-05-07T20:29:54.5347646Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:29:54.5348466Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:29:54.5383291Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:29:54.5383766Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:29:54.5396556Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:54.5396901Z env:
2025-05-07T20:29:54.5397131Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:54.5397428Z   BUILD_ENV: build_binary
2025-05-07T20:29:54.5397683Z   BUILD_TARGET: genai
2025-05-07T20:29:54.5397930Z   BUILD_VARIANT: cuda
2025-05-07T20:29:54.5398167Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:54.5398419Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:54.5398724Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:54.5399063Z ##[endgroup]
2025-05-07T20:29:54.8749094Z ################################################################################
2025-05-07T20:29:54.8749573Z # Test All FBGEMM-GPU Modules
2025-05-07T20:29:54.8749830Z #
2025-05-07T20:29:54.8764275Z # [2025-05-07T20:29:54.876Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:29:54.8764820Z ################################################################################
2025-05-07T20:29:54.8765098Z 
2025-05-07T20:30:02.6885464Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:02.6886034Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:02.6886429Z [TEST] Determined the test directories:
2025-05-07T20:30:02.6886769Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:02.6887069Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:02.6887366Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:02.6887667Z 
2025-05-07T20:30:02.6897996Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:02.6905100Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:02.6905761Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:02.6906174Z 
2025-05-07T20:30:03.1138587Z 
2025-05-07T20:30:03.1138999Z [TEST] Installing PyTest ...
2025-05-07T20:30:03.1161436Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:04.2286530Z Channels:
2025-05-07T20:30:04.2286768Z  - conda-forge
2025-05-07T20:30:04.2287005Z Platform: linux-64
2025-05-07T20:30:07.5537151Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:08.6978878Z Solving environment: \ | / done
2025-05-07T20:30:08.9444216Z 
2025-05-07T20:30:08.9444487Z ## Package Plan ##
2025-05-07T20:30:08.9444658Z 
2025-05-07T20:30:08.9444867Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:08.9445165Z 
2025-05-07T20:30:08.9445272Z   added / updated specs:
2025-05-07T20:30:08.9445525Z     - expecttest
2025-05-07T20:30:08.9445755Z     - pytest
2025-05-07T20:30:08.9445879Z 
2025-05-07T20:30:08.9445893Z 
2025-05-07T20:30:08.9446015Z The following packages will be downloaded:
2025-05-07T20:30:08.9446254Z 
2025-05-07T20:30:08.9446379Z     package                    |            build
2025-05-07T20:30:08.9446692Z     ---------------------------|-----------------
2025-05-07T20:30:08.9447062Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:08.9447586Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:08.9448046Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:08.9448499Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:08.9448940Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:08.9449359Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:08.9449763Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:08.9450508Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:08.9450899Z     ------------------------------------------------------------
2025-05-07T20:30:08.9451236Z                                            Total:         428 KB
2025-05-07T20:30:08.9451441Z 
2025-05-07T20:30:08.9451566Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:08.9451788Z 
2025-05-07T20:30:08.9451990Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:08.9452487Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:08.9453009Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:08.9453477Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:08.9453937Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:08.9454381Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:08.9454808Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:08.9455216Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:08.9455476Z 
2025-05-07T20:30:08.9455479Z 
2025-05-07T20:30:08.9455483Z 
2025-05-07T20:30:08.9455622Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:08.9455985Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:08.9456203Z 
2025-05-07T20:30:08.9456845Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:08.9457126Z 
2025-05-07T20:30:08.9457135Z 
2025-05-07T20:30:08.9468561Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:08.9468926Z 
2025-05-07T20:30:08.9468933Z 
2025-05-07T20:30:08.9470584Z 
2025-05-07T20:30:08.9484116Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:08.9484477Z 
2025-05-07T20:30:08.9484482Z 
2025-05-07T20:30:08.9484488Z 
2025-05-07T20:30:08.9488014Z 
2025-05-07T20:30:08.9498847Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:08.9499273Z 
2025-05-07T20:30:08.9499279Z 
2025-05-07T20:30:08.9499285Z 
2025-05-07T20:30:08.9499288Z 
2025-05-07T20:30:08.9501649Z 
2025-05-07T20:30:08.9503152Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:08.9503536Z 
2025-05-07T20:30:08.9503540Z 
2025-05-07T20:30:08.9503544Z 
2025-05-07T20:30:08.9503548Z 
2025-05-07T20:30:08.9503551Z 
2025-05-07T20:30:08.9503555Z 
2025-05-07T20:30:08.9508636Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:08.9509226Z 
2025-05-07T20:30:08.9509231Z 
2025-05-07T20:30:08.9509237Z 
2025-05-07T20:30:08.9509242Z 
2025-05-07T20:30:08.9509246Z 
2025-05-07T20:30:08.9509251Z 
2025-05-07T20:30:08.9509256Z 
2025-05-07T20:30:09.0335163Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:09.0335447Z 
2025-05-07T20:30:09.0335451Z 
2025-05-07T20:30:09.0335454Z 
2025-05-07T20:30:09.0347707Z 
2025-05-07T20:30:09.1238470Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:09.1238772Z 
2025-05-07T20:30:09.1238776Z 
2025-05-07T20:30:09.1238780Z 
2025-05-07T20:30:09.1238784Z 
2025-05-07T20:30:09.1245333Z 
2025-05-07T20:30:09.1399618Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:09.1400333Z 
2025-05-07T20:30:09.1400345Z 
2025-05-07T20:30:09.1400355Z 
2025-05-07T20:30:09.1400366Z 
2025-05-07T20:30:09.1402661Z 
2025-05-07T20:30:09.2660282Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:09.2660666Z 
2025-05-07T20:30:09.2660670Z 
2025-05-07T20:30:09.2660674Z 
2025-05-07T20:30:09.2660677Z 
2025-05-07T20:30:09.2660681Z 
2025-05-07T20:30:09.2660685Z 
2025-05-07T20:30:09.2664338Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:09.2665855Z 
2025-05-07T20:30:09.3283854Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:09.3284226Z 
2025-05-07T20:30:09.3284232Z 
2025-05-07T20:30:09.3284237Z 
2025-05-07T20:30:09.3284463Z 
2025-05-07T20:30:09.3284468Z 
2025-05-07T20:30:09.3284811Z 
2025-05-07T20:30:09.3385024Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:09.3387218Z 
2025-05-07T20:30:09.3496125Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:09.3496437Z 
2025-05-07T20:30:09.3496442Z 
2025-05-07T20:30:09.3497379Z 
2025-05-07T20:30:09.3539034Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:09.3539378Z 
2025-05-07T20:30:09.3539384Z 
2025-05-07T20:30:09.3542022Z 
2025-05-07T20:30:09.3598370Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:09.3598723Z 
2025-05-07T20:30:09.3598737Z 
2025-05-07T20:30:09.3598742Z 
2025-05-07T20:30:09.3598747Z 
2025-05-07T20:30:09.3604535Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:09.3604820Z 
2025-05-07T20:30:09.3604824Z 
2025-05-07T20:30:09.3604836Z 
2025-05-07T20:30:09.3608214Z 
2025-05-07T20:30:09.3652328Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:09.3652610Z 
2025-05-07T20:30:09.3652614Z 
2025-05-07T20:30:09.3652624Z 
2025-05-07T20:30:09.3652628Z 
2025-05-07T20:30:09.3652632Z 
2025-05-07T20:30:09.3680095Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:09.3680346Z 
2025-05-07T20:30:09.3680350Z 
2025-05-07T20:30:09.3680360Z 
2025-05-07T20:30:09.3680364Z 
2025-05-07T20:30:09.3680367Z 
2025-05-07T20:30:09.3680422Z 
2025-05-07T20:30:09.3811807Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:09.3812093Z 
2025-05-07T20:30:09.3812097Z 
2025-05-07T20:30:09.3812101Z 
2025-05-07T20:30:09.3816920Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:09.3817169Z 
2025-05-07T20:30:09.3817173Z 
2025-05-07T20:30:09.3817183Z 
2025-05-07T20:30:09.3817187Z 
2025-05-07T20:30:09.3817190Z 
2025-05-07T20:30:09.3817194Z 
2025-05-07T20:30:09.3817198Z 
2025-05-07T20:30:09.3819595Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:09.3829904Z 
2025-05-07T20:30:09.3830133Z 
2025-05-07T20:30:09.3830141Z 
2025-05-07T20:30:09.3830146Z 
2025-05-07T20:30:09.3830150Z 
2025-05-07T20:30:09.3830154Z 
2025-05-07T20:30:09.3830166Z 
2025-05-07T20:30:09.3830573Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:09.3830859Z 
2025-05-07T20:30:09.3831080Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:09.3831527Z 
2025-05-07T20:30:09.3898977Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:09.3899227Z 
2025-05-07T20:30:09.3899231Z 
2025-05-07T20:30:09.3899235Z 
2025-05-07T20:30:09.3899239Z 
2025-05-07T20:30:09.3899243Z 
2025-05-07T20:30:09.3899246Z 
2025-05-07T20:30:09.3899250Z 
2025-05-07T20:30:09.4094300Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:09.4141424Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:09.4236558Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:09.4236889Z 
2025-05-07T20:30:09.4237043Z 
2025-05-07T20:30:09.4247883Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:09.4248150Z 
2025-05-07T20:30:09.4249674Z 
2025-05-07T20:30:09.4412551Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:09.4412805Z 
2025-05-07T20:30:09.4412809Z 
2025-05-07T20:30:09.4479000Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:09.4485320Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:09.4485791Z                                                      
2025-05-07T20:30:09.4486081Z 
2025-05-07T20:30:09.4486358Z                                                      [A
2025-05-07T20:30:09.4486633Z 
2025-05-07T20:30:09.4486639Z 
2025-05-07T20:30:09.4486872Z                                                      [A[A
2025-05-07T20:30:09.4487150Z 
2025-05-07T20:30:09.4487156Z 
2025-05-07T20:30:09.4487494Z 
2025-05-07T20:30:09.4487694Z                                                      [A[A[A
2025-05-07T20:30:09.4488128Z 
2025-05-07T20:30:09.4488134Z 
2025-05-07T20:30:09.4488139Z 
2025-05-07T20:30:09.4488145Z 
2025-05-07T20:30:09.4488429Z                                                      [A[A[A[A
2025-05-07T20:30:09.4488715Z 
2025-05-07T20:30:09.4488720Z 
2025-05-07T20:30:09.4488726Z 
2025-05-07T20:30:09.4488731Z 
2025-05-07T20:30:09.4488736Z 
2025-05-07T20:30:09.4488993Z                                                      [A[A[A[A[A
2025-05-07T20:30:09.4489235Z 
2025-05-07T20:30:09.4489239Z 
2025-05-07T20:30:09.4489243Z 
2025-05-07T20:30:09.4489247Z 
2025-05-07T20:30:09.4489251Z 
2025-05-07T20:30:09.4489254Z 
2025-05-07T20:30:09.4489435Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:09.4489653Z 
2025-05-07T20:30:09.4489657Z 
2025-05-07T20:30:09.4489660Z 
2025-05-07T20:30:09.4489664Z 
2025-05-07T20:30:09.4489668Z 
2025-05-07T20:30:09.4489671Z 
2025-05-07T20:30:09.4489681Z 
2025-05-07T20:30:09.4489870Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:09.5491191Z Preparing transaction: \ done
2025-05-07T20:30:09.6496008Z Verifying transaction: / done
2025-05-07T20:30:11.5522083Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:11.6837607Z [TEST] Checking imports ...
2025-05-07T20:30:15.5818114Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:15.5830851Z [TEST] Setting feature flags ...
2025-05-07T20:30:15.5831451Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:15.5831913Z 
2025-05-07T20:30:16.0039441Z 
2025-05-07T20:30:16.0039986Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:16.0041187Z ################################################################################
2025-05-07T20:30:16.0041662Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:16.0041994Z #
2025-05-07T20:30:16.0061343Z # [2025-05-07T20:30:16.005Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:16.0061922Z ################################################################################
2025-05-07T20:30:16.0062210Z 
2025-05-07T20:30:16.0068982Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:16.0097409Z ./attention/gqa_test.py
2025-05-07T20:30:16.0097776Z ./coalesce/coalesce_test.py
2025-05-07T20:30:16.0098430Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:16.0098809Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:16.0099173Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:16.0099432Z ./moe/activation_test.py
2025-05-07T20:30:16.0099685Z ./moe/gather_scatter_test.py
2025-05-07T20:30:16.0099932Z ./moe/layers_test.py
2025-05-07T20:30:16.0100162Z ./moe/shuffling_test.py
2025-05-07T20:30:16.0100404Z ./quantize/quantize_test.py
2025-05-07T20:30:16.0100566Z 
2025-05-07T20:30:16.0100678Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:16.0100896Z 
2025-05-07T20:30:16.0118080Z ################################################################################
2025-05-07T20:30:16.0133348Z # [2025-05-07T20:30:16.013Z] Run Python Test Suite:
2025-05-07T20:30:16.0133808Z #   ./attention/gqa_test.py
2025-05-07T20:30:16.0134169Z ################################################################################
2025-05-07T20:30:16.0157842Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:16.0158602Z 
2025-05-07T20:30:18.5123110Z ============================= test session starts ==============================
2025-05-07T20:30:18.5124167Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:18.5125062Z cachedir: .pytest_cache
2025-05-07T20:30:18.5126346Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:18.5127752Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:18.5128433Z plugins: hypothesis-6.131.14
2025-05-07T20:30:20.2352422Z collecting ... collected 2 items
2025-05-07T20:30:20.2352655Z 
2025-05-07T20:30:58.2441904Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:30:58.2445033Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2445686Z     int4_kv=False,
2025-05-07T20:30:58.2446052Z     num_groups=1,
2025-05-07T20:30:58.2446364Z     B=1,
2025-05-07T20:30:58.2446660Z     MAX_T=4,
2025-05-07T20:30:58.2447000Z     N_H_L=1,
2025-05-07T20:30:58.2447325Z )
2025-05-07T20:30:58.2447806Z Trying example: test_gqa(
2025-05-07T20:30:58.2448299Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2448787Z     int4_kv=True,
2025-05-07T20:30:58.2449110Z     num_groups=1,
2025-05-07T20:30:58.2449418Z     B=1,
2025-05-07T20:30:58.2449723Z     MAX_T=4,
2025-05-07T20:30:58.2450018Z     N_H_L=1,
2025-05-07T20:30:58.2450316Z )
2025-05-07T20:30:58.2450593Z Trying example: test_gqa(
2025-05-07T20:30:58.2451014Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2451508Z     int4_kv=True,
2025-05-07T20:30:58.2451838Z     num_groups=4,
2025-05-07T20:30:58.2452166Z     B=23,
2025-05-07T20:30:58.2452465Z     MAX_T=33,
2025-05-07T20:30:58.2452777Z     N_H_L=68,
2025-05-07T20:30:58.2453079Z )
2025-05-07T20:30:58.2453400Z Trying example: test_gqa(
2025-05-07T20:30:58.2453876Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2454388Z     int4_kv=True,
2025-05-07T20:30:58.2454727Z     num_groups=4,
2025-05-07T20:30:58.2455054Z     B=77,
2025-05-07T20:30:58.2455348Z     MAX_T=4,
2025-05-07T20:30:58.2455655Z     N_H_L=1,
2025-05-07T20:30:58.2455956Z )
2025-05-07T20:30:58.2456259Z Trying example: test_gqa(
2025-05-07T20:30:58.2456739Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2457265Z     int4_kv=True,
2025-05-07T20:30:58.2457593Z     num_groups=4,
2025-05-07T20:30:58.2457924Z     B=77,
2025-05-07T20:30:58.2458224Z     MAX_T=52,
2025-05-07T20:30:58.2458527Z     N_H_L=67,
2025-05-07T20:30:58.2458833Z )
2025-05-07T20:30:58.2459140Z Trying example: test_gqa(
2025-05-07T20:30:58.2459606Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2460127Z     int4_kv=False,
2025-05-07T20:30:58.2460471Z     num_groups=4,
2025-05-07T20:30:58.2461283Z     B=57,
2025-05-07T20:30:58.2461580Z     MAX_T=45,
2025-05-07T20:30:58.2461890Z     N_H_L=120,
2025-05-07T20:30:58.2462185Z )
2025-05-07T20:30:58.2462488Z Trying example: test_gqa(
2025-05-07T20:30:58.2462961Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2463467Z     int4_kv=True,
2025-05-07T20:30:58.2463801Z     num_groups=4,
2025-05-07T20:30:58.2464128Z     B=52,
2025-05-07T20:30:58.2464416Z     MAX_T=42,
2025-05-07T20:30:58.2464730Z     N_H_L=53,
2025-05-07T20:30:58.2465031Z )
2025-05-07T20:30:58.2465343Z Trying example: test_gqa(
2025-05-07T20:30:58.2465810Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2466323Z     int4_kv=True,
2025-05-07T20:30:58.2466661Z     num_groups=1,
2025-05-07T20:30:58.2466983Z     B=77,
2025-05-07T20:30:58.2467292Z     MAX_T=95,
2025-05-07T20:30:58.2467649Z     N_H_L=53,
2025-05-07T20:30:58.2467951Z )
2025-05-07T20:30:58.2468258Z Trying example: test_gqa(
2025-05-07T20:30:58.2468733Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2469249Z     int4_kv=True,
2025-05-07T20:30:58.2469595Z     num_groups=4,
2025-05-07T20:30:58.2469925Z     B=113,
2025-05-07T20:30:58.2470220Z     MAX_T=48,
2025-05-07T20:30:58.2470532Z     N_H_L=96,
2025-05-07T20:30:58.2470842Z )
2025-05-07T20:30:58.2471138Z Trying example: test_gqa(
2025-05-07T20:30:58.2471616Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2472136Z     int4_kv=False,
2025-05-07T20:30:58.2472469Z     num_groups=1,
2025-05-07T20:30:58.2473047Z     B=51,
2025-05-07T20:30:58.2473357Z     MAX_T=61,
2025-05-07T20:30:58.2473663Z     N_H_L=69,
2025-05-07T20:30:58.2473970Z )
2025-05-07T20:30:58.2474278Z Trying example: test_gqa(
2025-05-07T20:30:58.2474752Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2475266Z     int4_kv=False,
2025-05-07T20:30:58.2475610Z     num_groups=4,
2025-05-07T20:30:58.2475925Z     B=17,
2025-05-07T20:30:58.2476225Z     MAX_T=113,
2025-05-07T20:30:58.2476544Z     N_H_L=65,
2025-05-07T20:30:58.2476859Z )
2025-05-07T20:30:58.2477157Z Trying example: test_gqa(
2025-05-07T20:30:58.2477630Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2478148Z     int4_kv=False,
2025-05-07T20:30:58.2478485Z     num_groups=4,
2025-05-07T20:30:58.2478814Z     B=17,
2025-05-07T20:30:58.2479109Z     MAX_T=65,
2025-05-07T20:30:58.2479417Z     N_H_L=65,
2025-05-07T20:30:58.2479723Z )
2025-05-07T20:30:58.2480034Z Trying example: test_gqa(
2025-05-07T20:30:58.2480509Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2481028Z     int4_kv=False,
2025-05-07T20:30:58.2481368Z     num_groups=4,
2025-05-07T20:30:58.2481696Z     B=65,
2025-05-07T20:30:58.2481998Z     MAX_T=65,
2025-05-07T20:30:58.2482317Z     N_H_L=65,
2025-05-07T20:30:58.2482619Z )
2025-05-07T20:30:58.2482925Z Trying example: test_gqa(
2025-05-07T20:30:58.2483406Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2483921Z     int4_kv=False,
2025-05-07T20:30:58.2484270Z     num_groups=1,
2025-05-07T20:30:58.2484610Z     B=6,
2025-05-07T20:30:58.2484900Z     MAX_T=108,
2025-05-07T20:30:58.2485218Z     N_H_L=14,
2025-05-07T20:30:58.2485528Z )
2025-05-07T20:30:58.2485828Z Trying example: test_gqa(
2025-05-07T20:30:58.2486300Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2486815Z     int4_kv=False,
2025-05-07T20:30:58.2487146Z     num_groups=1,
2025-05-07T20:30:58.2487592Z     B=6,
2025-05-07T20:30:58.2487900Z     MAX_T=14,
2025-05-07T20:30:58.2488211Z     N_H_L=14,
2025-05-07T20:30:58.2488529Z )
2025-05-07T20:30:58.2488840Z Trying example: test_gqa(
2025-05-07T20:30:58.2489315Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2489824Z     int4_kv=False,
2025-05-07T20:30:58.2490169Z     num_groups=1,
2025-05-07T20:30:58.2490499Z     B=6,
2025-05-07T20:30:58.2490790Z     MAX_T=6,
2025-05-07T20:30:58.2491099Z     N_H_L=14,
2025-05-07T20:30:58.2491406Z )
2025-05-07T20:30:58.2491705Z Trying example: test_gqa(
2025-05-07T20:30:58.2492322Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2492841Z     int4_kv=False,
2025-05-07T20:30:58.2493180Z     num_groups=1,
2025-05-07T20:30:58.2493511Z     B=6,
2025-05-07T20:30:58.2493811Z     MAX_T=6,
2025-05-07T20:30:58.2494116Z     N_H_L=6,
2025-05-07T20:30:58.2494421Z )
2025-05-07T20:30:58.2494731Z Trying example: test_gqa(
2025-05-07T20:30:58.2495198Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2495716Z     int4_kv=False,
2025-05-07T20:30:58.2496056Z     num_groups=1,
2025-05-07T20:30:58.2496386Z     B=70,
2025-05-07T20:30:58.2496689Z     MAX_T=94,
2025-05-07T20:30:58.2497001Z     N_H_L=78,
2025-05-07T20:30:58.2497297Z )
2025-05-07T20:30:58.2497606Z Trying example: test_gqa(
2025-05-07T20:30:58.2498083Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2498586Z     int4_kv=False,
2025-05-07T20:30:58.2498925Z     num_groups=1,
2025-05-07T20:30:58.2499251Z     B=78,
2025-05-07T20:30:58.2499540Z     MAX_T=94,
2025-05-07T20:30:58.2499866Z     N_H_L=78,
2025-05-07T20:30:58.2500174Z )
2025-05-07T20:30:58.2500474Z Trying example: test_gqa(
2025-05-07T20:30:58.2500945Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2501466Z     int4_kv=False,
2025-05-07T20:30:58.2501804Z     num_groups=1,
2025-05-07T20:30:58.2502124Z     B=94,
2025-05-07T20:30:58.2502427Z     MAX_T=94,
2025-05-07T20:30:58.2502740Z     N_H_L=78,
2025-05-07T20:30:58.2503034Z )
2025-05-07T20:30:58.2503337Z Trying example: test_gqa(
2025-05-07T20:30:58.2503950Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2504467Z     int4_kv=False,
2025-05-07T20:30:58.2504811Z     num_groups=1,
2025-05-07T20:30:58.2505144Z     B=94,
2025-05-07T20:30:58.2505437Z     MAX_T=94,
2025-05-07T20:30:58.2506080Z     N_H_L=94,
2025-05-07T20:30:58.2506391Z )
2025-05-07T20:30:58.2506689Z Trying example: test_gqa(
2025-05-07T20:30:58.2507212Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2507774Z     int4_kv=False,
2025-05-07T20:30:58.2508125Z     num_groups=4,
2025-05-07T20:30:58.2508453Z     B=41,
2025-05-07T20:30:58.2508751Z     MAX_T=105,
2025-05-07T20:30:58.2509077Z     N_H_L=126,
2025-05-07T20:30:58.2509389Z )
2025-05-07T20:30:58.2509692Z Trying example: test_gqa(
2025-05-07T20:30:58.2510167Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2510680Z     int4_kv=False,
2025-05-07T20:30:58.2511012Z     num_groups=4,
2025-05-07T20:30:58.2511344Z     B=105,
2025-05-07T20:30:58.2511652Z     MAX_T=105,
2025-05-07T20:30:58.2511982Z     N_H_L=126,
2025-05-07T20:30:58.2512299Z )
2025-05-07T20:30:58.2512609Z Trying example: test_gqa(
2025-05-07T20:30:58.2513090Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2513605Z     int4_kv=False,
2025-05-07T20:30:58.2513948Z     num_groups=4,
2025-05-07T20:30:58.2514278Z     B=105,
2025-05-07T20:30:58.2514576Z     MAX_T=105,
2025-05-07T20:30:58.2514899Z     N_H_L=105,
2025-05-07T20:30:58.2515209Z )
2025-05-07T20:30:58.2515505Z Trying example: test_gqa(
2025-05-07T20:30:58.2515992Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2516503Z     int4_kv=True,
2025-05-07T20:30:58.2516829Z     num_groups=1,
2025-05-07T20:30:58.2517158Z     B=95,
2025-05-07T20:30:58.2517456Z     MAX_T=114,
2025-05-07T20:30:58.2517761Z     N_H_L=43,
2025-05-07T20:30:58.2518064Z )
2025-05-07T20:30:58.2518370Z Trying example: test_gqa(
2025-05-07T20:30:58.2518837Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2519345Z     int4_kv=True,
2025-05-07T20:30:58.2519686Z     num_groups=1,
2025-05-07T20:30:58.2520018Z     B=43,
2025-05-07T20:30:58.2520311Z     MAX_T=114,
2025-05-07T20:30:58.2520628Z     N_H_L=43,
2025-05-07T20:30:58.2520942Z )
2025-05-07T20:30:58.2521245Z Trying example: test_gqa(
2025-05-07T20:30:58.2521719Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2522230Z     int4_kv=True,
2025-05-07T20:30:58.2522565Z     num_groups=1,
2025-05-07T20:30:58.2522897Z     B=43,
2025-05-07T20:30:58.2523416Z     MAX_T=43,
2025-05-07T20:30:58.2523722Z     N_H_L=43,
2025-05-07T20:30:58.2524028Z )
2025-05-07T20:30:58.2524336Z Trying example: test_gqa(
2025-05-07T20:30:58.2524800Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2525318Z     int4_kv=False,
2025-05-07T20:30:58.2525659Z     num_groups=1,
2025-05-07T20:30:58.2525980Z     B=21,
2025-05-07T20:30:58.2526282Z     MAX_T=38,
2025-05-07T20:30:58.2526594Z     N_H_L=42,
2025-05-07T20:30:58.2526896Z )
2025-05-07T20:30:58.2527204Z Trying example: test_gqa(
2025-05-07T20:30:58.2527780Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2528291Z     int4_kv=False,
2025-05-07T20:30:58.2528634Z     num_groups=1,
2025-05-07T20:30:58.2528963Z     B=38,
2025-05-07T20:30:58.2529251Z     MAX_T=38,
2025-05-07T20:30:58.2529562Z     N_H_L=42,
2025-05-07T20:30:58.2529869Z )
2025-05-07T20:30:58.2530173Z Trying example: test_gqa(
2025-05-07T20:30:58.2530636Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2531164Z     int4_kv=False,
2025-05-07T20:30:58.2531508Z     num_groups=1,
2025-05-07T20:30:58.2531826Z     B=38,
2025-05-07T20:30:58.2532129Z     MAX_T=42,
2025-05-07T20:30:58.2532441Z     N_H_L=42,
2025-05-07T20:30:58.2532737Z )
2025-05-07T20:30:58.2533042Z Trying example: test_gqa(
2025-05-07T20:30:58.2533512Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2533982Z     int4_kv=False,
2025-05-07T20:30:58.2534270Z     num_groups=1,
2025-05-07T20:30:58.2534558Z     B=42,
2025-05-07T20:30:58.2535120Z     MAX_T=42,
2025-05-07T20:30:58.2535447Z     N_H_L=42,
2025-05-07T20:30:58.2535756Z )
2025-05-07T20:30:58.2536052Z Trying example: test_gqa(
2025-05-07T20:30:58.2536528Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2537043Z     int4_kv=True,
2025-05-07T20:30:58.2537367Z     num_groups=1,
2025-05-07T20:30:58.2537691Z     B=74,
2025-05-07T20:30:58.2537989Z     MAX_T=20,
2025-05-07T20:30:58.2538291Z     N_H_L=15,
2025-05-07T20:30:58.2538606Z )
2025-05-07T20:30:58.2538918Z Trying example: test_gqa(
2025-05-07T20:30:58.2539378Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2539895Z     int4_kv=True,
2025-05-07T20:30:58.2540233Z     num_groups=1,
2025-05-07T20:30:58.2540553Z     B=20,
2025-05-07T20:30:58.2540857Z     MAX_T=20,
2025-05-07T20:30:58.2541167Z     N_H_L=15,
2025-05-07T20:30:58.2541464Z )
2025-05-07T20:30:58.2541775Z Trying example: test_gqa(
2025-05-07T20:30:58.2542253Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2542778Z     int4_kv=True,
2025-05-07T20:30:58.2543111Z     num_groups=1,
2025-05-07T20:30:58.2543443Z     B=20,
2025-05-07T20:30:58.2543740Z     MAX_T=15,
2025-05-07T20:30:58.2544045Z     N_H_L=15,
2025-05-07T20:30:58.2544348Z )
2025-05-07T20:30:58.2544654Z Trying example: test_gqa(
2025-05-07T20:30:58.2545122Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2545640Z     int4_kv=True,
2025-05-07T20:30:58.2545997Z     num_groups=1,
2025-05-07T20:30:58.2546327Z     B=15,
2025-05-07T20:30:58.2546623Z     MAX_T=20,
2025-05-07T20:30:58.2546935Z     N_H_L=15,
2025-05-07T20:30:58.2547239Z )
2025-05-07T20:30:58.2547529Z Trying example: test_gqa(
2025-05-07T20:30:58.2547996Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2548510Z     int4_kv=True,
2025-05-07T20:30:58.2548838Z     num_groups=1,
2025-05-07T20:30:58.2549165Z     B=15,
2025-05-07T20:30:58.2549459Z     MAX_T=15,
2025-05-07T20:30:58.2549766Z     N_H_L=15,
2025-05-07T20:30:58.2550072Z )
2025-05-07T20:30:58.2550388Z Trying example: test_gqa(
2025-05-07T20:30:58.2550854Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2551378Z     int4_kv=False,
2025-05-07T20:30:58.2551720Z     num_groups=4,
2025-05-07T20:30:58.2552041Z     B=117,
2025-05-07T20:30:58.2552349Z     MAX_T=104,
2025-05-07T20:30:58.2552662Z     N_H_L=69,
2025-05-07T20:30:58.2553062Z )
2025-05-07T20:30:58.2565734Z Trying example: test_gqa(
2025-05-07T20:30:58.2566234Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2566936Z     int4_kv=False,
2025-05-07T20:30:58.2567335Z     num_groups=4,
2025-05-07T20:30:58.2567769Z     B=117,
2025-05-07T20:30:58.2568078Z     MAX_T=117,
2025-05-07T20:30:58.2568392Z     N_H_L=69,
2025-05-07T20:30:58.2568718Z )
2025-05-07T20:30:58.2569032Z Trying example: test_gqa(
2025-05-07T20:30:58.2569510Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2570038Z     int4_kv=False,
2025-05-07T20:30:58.2570396Z     num_groups=4,
2025-05-07T20:30:58.2570723Z     B=69,
2025-05-07T20:30:58.2571043Z     MAX_T=117,
2025-05-07T20:30:58.2571366Z     N_H_L=69,
2025-05-07T20:30:58.2571668Z )
2025-05-07T20:30:58.2571980Z Trying example: test_gqa(
2025-05-07T20:30:58.2572461Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:58.2572976Z     int4_kv=False,
2025-05-07T20:30:58.2573325Z     num_groups=4,
2025-05-07T20:30:58.2573659Z     B=117,
2025-05-07T20:30:58.2573953Z     MAX_T=69,
2025-05-07T20:30:58.2574271Z     N_H_L=69,
2025-05-07T20:30:58.2574595Z )
2025-05-07T20:30:58.2574874Z PASSED
2025-05-07T20:30:58.2795380Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:30:58.2795835Z 
2025-05-07T20:30:58.2796050Z =========================== short test summary info ============================
2025-05-07T20:30:58.2797105Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when CUDA is not available or xformers is not available
2025-05-07T20:30:58.2798381Z ======================== 1 passed, 1 skipped in 40.24s =========================
2025-05-07T20:30:58.9330533Z 
2025-05-07T20:30:58.9331095Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:30:58.9351980Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds
2025-05-07T20:30:58.9352255Z 
2025-05-07T20:30:58.9352265Z 
2025-05-07T20:30:58.9352269Z 
2025-05-07T20:30:58.9352333Z 
2025-05-07T20:30:58.9372707Z ################################################################################
2025-05-07T20:30:58.9387967Z # [2025-05-07T20:30:58.938Z] Run Python Test Suite:
2025-05-07T20:30:58.9388309Z #   ./coalesce/coalesce_test.py
2025-05-07T20:30:58.9388603Z ################################################################################
2025-05-07T20:30:58.9414362Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:30:58.9414975Z 
2025-05-07T20:31:01.0803012Z ============================= test session starts ==============================
2025-05-07T20:31:01.0803677Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:01.0804205Z cachedir: .pytest_cache
2025-05-07T20:31:01.0804781Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:01.0805514Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:01.0806329Z plugins: hypothesis-6.131.14
2025-05-07T20:31:02.7497889Z collecting ... collected 1 item
2025-05-07T20:31:02.7498132Z 
2025-05-07T20:31:03.4745707Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:03.4746053Z 
2025-05-07T20:31:03.4746217Z ============================== 1 passed in 2.51s ===============================
2025-05-07T20:31:04.1194604Z 
2025-05-07T20:31:04.1194888Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:04.1218014Z [TEST] Python test time for ./coalesce/coalesce_test.py: 6 seconds
2025-05-07T20:31:04.1218308Z 
2025-05-07T20:31:04.1218313Z 
2025-05-07T20:31:04.1218317Z 
2025-05-07T20:31:04.1218321Z 
2025-05-07T20:31:04.1238404Z ################################################################################
2025-05-07T20:31:04.1253697Z # [2025-05-07T20:31:04.125Z] Run Python Test Suite:
2025-05-07T20:31:04.1254039Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:04.1254583Z ################################################################################
2025-05-07T20:31:04.1279111Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:04.1279744Z 
2025-05-07T20:31:06.2531653Z ============================= test session starts ==============================
2025-05-07T20:31:06.2532384Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:06.2532913Z cachedir: .pytest_cache
2025-05-07T20:31:06.2533495Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:06.2534232Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:06.2534648Z plugins: hypothesis-6.131.14
2025-05-07T20:31:07.9550448Z collecting ... collected 5 items
2025-05-07T20:31:07.9550726Z 
2025-05-07T20:31:07.9560071Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:07.9578699Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:07.9585817Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:07.9592425Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:07.9608465Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:07.9608930Z 
2025-05-07T20:31:07.9609140Z =========================== short test summary info ============================
2025-05-07T20:31:07.9609836Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.9610758Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.9611682Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.9612591Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.9613519Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.9614165Z ============================== 5 skipped in 1.82s ==============================
2025-05-07T20:31:08.5245651Z 
2025-05-07T20:31:08.5246009Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:08.5267234Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds
2025-05-07T20:31:08.5267652Z 
2025-05-07T20:31:08.5267658Z 
2025-05-07T20:31:08.5267664Z 
2025-05-07T20:31:08.5267678Z 
2025-05-07T20:31:08.5289013Z ################################################################################
2025-05-07T20:31:08.5304517Z # [2025-05-07T20:31:08.530Z] Run Python Test Suite:
2025-05-07T20:31:08.5304972Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:08.5305290Z ################################################################################
2025-05-07T20:31:08.5329529Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:08.5330182Z 
2025-05-07T20:31:10.6632463Z ============================= test session starts ==============================
2025-05-07T20:31:10.6633701Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:10.6634730Z cachedir: .pytest_cache
2025-05-07T20:31:10.6635878Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:10.6637829Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:10.6638634Z plugins: hypothesis-6.131.14
2025-05-07T20:31:12.4831122Z collecting ... collected 2 items
2025-05-07T20:31:12.4831361Z 
2025-05-07T20:31:12.4840158Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:12.4854542Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:12.4855011Z 
2025-05-07T20:31:12.4855161Z =========================== short test summary info ============================
2025-05-07T20:31:12.4855785Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:12.4856618Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:12.4857216Z ============================== 2 skipped in 1.94s ==============================
2025-05-07T20:31:13.0745553Z 
2025-05-07T20:31:13.0745953Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:13.0766672Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:31:13.0767137Z 
2025-05-07T20:31:13.0767143Z 
2025-05-07T20:31:13.0767149Z 
2025-05-07T20:31:13.0767154Z 
2025-05-07T20:31:13.0788698Z ################################################################################
2025-05-07T20:31:13.0804430Z # [2025-05-07T20:31:13.080Z] Run Python Test Suite:
2025-05-07T20:31:13.0804905Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:13.0805304Z ################################################################################
2025-05-07T20:31:13.0829499Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:13.0830237Z 
2025-05-07T20:31:15.2105293Z ============================= test session starts ==============================
2025-05-07T20:31:15.2106217Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:15.2106744Z cachedir: .pytest_cache
2025-05-07T20:31:15.2107335Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:15.2108094Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:15.2108506Z plugins: hypothesis-6.131.14
2025-05-07T20:31:16.9521015Z collecting ... collected 4 items
2025-05-07T20:31:16.9521249Z 
2025-05-07T20:31:19.6839856Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:19.6960375Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:19.7104454Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:19.7228969Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:19.7229403Z 
2025-05-07T20:31:19.7229557Z =========================== short test summary info ============================
2025-05-07T20:31:19.7230266Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:19.7231197Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when xformers is not available
2025-05-07T20:31:19.7231805Z ============================== 4 skipped in 4.63s ==============================
2025-05-07T20:31:21.6094861Z 
2025-05-07T20:31:21.6095463Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:21.6116138Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:21.6116440Z 
2025-05-07T20:31:21.6116445Z 
2025-05-07T20:31:21.6117510Z 
2025-05-07T20:31:21.6117515Z 
2025-05-07T20:31:21.6138710Z ################################################################################
2025-05-07T20:31:21.6154381Z # [2025-05-07T20:31:21.615Z] Run Python Test Suite:
2025-05-07T20:31:21.6154740Z #   ./moe/activation_test.py
2025-05-07T20:31:21.6155029Z ################################################################################
2025-05-07T20:31:21.6179238Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:21.6179849Z 
2025-05-07T20:31:23.7526148Z ============================= test session starts ==============================
2025-05-07T20:31:23.7526796Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:23.7527317Z cachedir: .pytest_cache
2025-05-07T20:31:23.7527972Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:23.7528734Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:23.7529139Z plugins: hypothesis-6.131.14
2025-05-07T20:31:25.3786148Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:25.5279640Z collecting ... collected 2 items
2025-05-07T20:31:25.5279859Z 
2025-05-07T20:31:30.8611975Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:30.8613387Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8613848Z     T=1,
2025-05-07T20:31:30.8614045Z     D=5120,
2025-05-07T20:31:30.8614251Z     contiguous=True,
2025-05-07T20:31:30.8614476Z     compiled=True,
2025-05-07T20:31:30.8614697Z )
2025-05-07T20:31:30.8614902Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8615274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8615672Z     T=4096,
2025-05-07T20:31:30.8615869Z     D=5120,
2025-05-07T20:31:30.8616061Z     contiguous=True,
2025-05-07T20:31:30.8616294Z     compiled=True,
2025-05-07T20:31:30.8616502Z )
2025-05-07T20:31:30.8616699Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8617077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8617459Z     T=4096,
2025-05-07T20:31:30.8617650Z     D=7168,
2025-05-07T20:31:30.8617845Z     contiguous=False,
2025-05-07T20:31:30.8618075Z     compiled=False,
2025-05-07T20:31:30.8618291Z )
2025-05-07T20:31:30.8618485Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8618860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8619241Z     T=4096,
2025-05-07T20:31:30.8619426Z     D=5120,
2025-05-07T20:31:30.8619631Z     contiguous=False,
2025-05-07T20:31:30.8619860Z     compiled=True,
2025-05-07T20:31:30.8620062Z )
2025-05-07T20:31:30.8620264Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8620643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8621017Z     T=1,
2025-05-07T20:31:30.8621206Z     D=7168,
2025-05-07T20:31:30.8621404Z     contiguous=True,
2025-05-07T20:31:30.8621624Z     compiled=True,
2025-05-07T20:31:30.8621832Z )
2025-05-07T20:31:30.8622037Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8622400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8622784Z     T=1,
2025-05-07T20:31:30.8622978Z     D=7168,
2025-05-07T20:31:30.8623176Z     contiguous=False,
2025-05-07T20:31:30.8623405Z     compiled=True,
2025-05-07T20:31:30.8623614Z )
2025-05-07T20:31:30.8623810Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8624182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8624565Z     T=4096,
2025-05-07T20:31:30.8624754Z     D=5120,
2025-05-07T20:31:30.8624946Z     contiguous=False,
2025-05-07T20:31:30.8625175Z     compiled=False,
2025-05-07T20:31:30.8625560Z )
2025-05-07T20:31:30.8625756Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8626124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8626508Z     T=1,
2025-05-07T20:31:30.8626688Z     D=7168,
2025-05-07T20:31:30.8626883Z     contiguous=True,
2025-05-07T20:31:30.8627107Z     compiled=False,
2025-05-07T20:31:30.8627308Z )
2025-05-07T20:31:30.8627508Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8627888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8628261Z     T=2048,
2025-05-07T20:31:30.8628449Z     D=5120,
2025-05-07T20:31:30.8628647Z     contiguous=True,
2025-05-07T20:31:30.8628867Z     compiled=True,
2025-05-07T20:31:30.8629072Z )
2025-05-07T20:31:30.8629269Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8629632Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8630009Z     T=2048,
2025-05-07T20:31:30.8630201Z     D=7168,
2025-05-07T20:31:30.8630408Z     contiguous=True,
2025-05-07T20:31:30.8630626Z     compiled=True,
2025-05-07T20:31:30.8630836Z )
2025-05-07T20:31:30.8631035Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8631399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8631774Z     T=2048,
2025-05-07T20:31:30.8631965Z     D=7168,
2025-05-07T20:31:30.8632156Z     contiguous=True,
2025-05-07T20:31:30.8632383Z     compiled=False,
2025-05-07T20:31:30.8632593Z )
2025-05-07T20:31:30.8632791Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8633257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8633642Z     T=128,
2025-05-07T20:31:30.8633829Z     D=5120,
2025-05-07T20:31:30.8634030Z     contiguous=False,
2025-05-07T20:31:30.8634262Z     compiled=True,
2025-05-07T20:31:30.8634463Z )
2025-05-07T20:31:30.8634667Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8635043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8635421Z     T=128,
2025-05-07T20:31:30.8635612Z     D=5120,
2025-05-07T20:31:30.8635817Z     contiguous=True,
2025-05-07T20:31:30.8636037Z     compiled=True,
2025-05-07T20:31:30.8636246Z )
2025-05-07T20:31:30.8636449Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8636822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8637198Z     T=16384,
2025-05-07T20:31:30.8637395Z     D=5120,
2025-05-07T20:31:30.8637598Z     contiguous=False,
2025-05-07T20:31:30.8637822Z     compiled=True,
2025-05-07T20:31:30.8638027Z )
2025-05-07T20:31:30.8638227Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8638591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8638968Z     T=16384,
2025-05-07T20:31:30.8639163Z     D=5120,
2025-05-07T20:31:30.8639357Z     contiguous=False,
2025-05-07T20:31:30.8639587Z     compiled=False,
2025-05-07T20:31:30.8639796Z )
2025-05-07T20:31:30.8639996Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8640366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8640747Z     T=128,
2025-05-07T20:31:30.8640938Z     D=7168,
2025-05-07T20:31:30.8641127Z     contiguous=True,
2025-05-07T20:31:30.8641355Z     compiled=False,
2025-05-07T20:31:30.8641563Z )
2025-05-07T20:31:30.8641756Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8642129Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8642509Z     T=128,
2025-05-07T20:31:30.8642703Z     D=7168,
2025-05-07T20:31:30.8642894Z     contiguous=False,
2025-05-07T20:31:30.8643122Z     compiled=False,
2025-05-07T20:31:30.8643330Z )
2025-05-07T20:31:30.8643523Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8643894Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8644269Z     T=1,
2025-05-07T20:31:30.8644449Z     D=5120,
2025-05-07T20:31:30.8644649Z     contiguous=False,
2025-05-07T20:31:30.8644877Z     compiled=False,
2025-05-07T20:31:30.8645167Z )
2025-05-07T20:31:30.8645369Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8645741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8646112Z     T=1,
2025-05-07T20:31:30.8646304Z     D=7168,
2025-05-07T20:31:30.8646507Z     contiguous=False,
2025-05-07T20:31:30.8646734Z     compiled=False,
2025-05-07T20:31:30.8646959Z )
2025-05-07T20:31:30.8647191Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8647658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8648037Z     T=4096,
2025-05-07T20:31:30.8648228Z     D=5120,
2025-05-07T20:31:30.8648421Z     contiguous=True,
2025-05-07T20:31:30.8648649Z     compiled=False,
2025-05-07T20:31:30.8648856Z )
2025-05-07T20:31:30.8649054Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8649425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8649806Z     T=128,
2025-05-07T20:31:30.8649999Z     D=7168,
2025-05-07T20:31:30.8650195Z     contiguous=True,
2025-05-07T20:31:30.8650417Z     compiled=True,
2025-05-07T20:31:30.8650623Z )
2025-05-07T20:31:30.8650820Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8651188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8651567Z     T=1,
2025-05-07T20:31:30.8651750Z     D=5120,
2025-05-07T20:31:30.8651948Z     contiguous=False,
2025-05-07T20:31:30.8652175Z     compiled=True,
2025-05-07T20:31:30.8652378Z )
2025-05-07T20:31:30.8652668Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8653092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8653521Z     T=4096,
2025-05-07T20:31:30.8653718Z     D=7168,
2025-05-07T20:31:30.8653925Z     contiguous=True,
2025-05-07T20:31:30.8654155Z     compiled=False,
2025-05-07T20:31:30.8654376Z )
2025-05-07T20:31:30.8654583Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8654998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8655449Z     T=4096,
2025-05-07T20:31:30.8655645Z     D=7168,
2025-05-07T20:31:30.8655854Z     contiguous=False,
2025-05-07T20:31:30.8656088Z     compiled=True,
2025-05-07T20:31:30.8656308Z )
2025-05-07T20:31:30.8656517Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8656935Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8657375Z     T=128,
2025-05-07T20:31:30.8657571Z     D=5120,
2025-05-07T20:31:30.8657772Z     contiguous=True,
2025-05-07T20:31:30.8658017Z     compiled=False,
2025-05-07T20:31:30.8658237Z )
2025-05-07T20:31:30.8658442Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8658872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8659313Z     T=128,
2025-05-07T20:31:30.8659506Z     D=5120,
2025-05-07T20:31:30.8659716Z     contiguous=False,
2025-05-07T20:31:30.8659961Z     compiled=False,
2025-05-07T20:31:30.8660180Z )
2025-05-07T20:31:30.8660393Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8660822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8661256Z     T=1,
2025-05-07T20:31:30.8661452Z     D=5120,
2025-05-07T20:31:30.8661660Z     contiguous=True,
2025-05-07T20:31:30.8661895Z     compiled=False,
2025-05-07T20:31:30.8662118Z )
2025-05-07T20:31:30.8662330Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8662755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8663185Z     T=2048,
2025-05-07T20:31:30.8663388Z     D=7168,
2025-05-07T20:31:30.8663598Z     contiguous=False,
2025-05-07T20:31:30.8663839Z     compiled=True,
2025-05-07T20:31:30.8664075Z )
2025-05-07T20:31:30.8664289Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8664706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8665148Z     T=2048,
2025-05-07T20:31:30.8665348Z     D=7168,
2025-05-07T20:31:30.8665555Z     contiguous=False,
2025-05-07T20:31:30.8665792Z     compiled=False,
2025-05-07T20:31:30.8666099Z )
2025-05-07T20:31:30.8666313Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8666732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8667171Z     T=16384,
2025-05-07T20:31:30.8667377Z     D=7168,
2025-05-07T20:31:30.8667580Z     contiguous=False,
2025-05-07T20:31:30.8667824Z     compiled=True,
2025-05-07T20:31:30.8668047Z )
2025-05-07T20:31:30.8668249Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8668679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8678035Z     T=16384,
2025-05-07T20:31:30.8678253Z     D=7168,
2025-05-07T20:31:30.8678466Z     contiguous=True,
2025-05-07T20:31:30.8678704Z     compiled=True,
2025-05-07T20:31:30.8678919Z )
2025-05-07T20:31:30.8679138Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8679529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8679916Z     T=4096,
2025-05-07T20:31:30.8680116Z     D=7168,
2025-05-07T20:31:30.8680337Z     contiguous=True,
2025-05-07T20:31:30.8680565Z     compiled=True,
2025-05-07T20:31:30.8680784Z )
2025-05-07T20:31:30.8680996Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8681380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8681762Z     T=2048,
2025-05-07T20:31:30.8681959Z     D=5120,
2025-05-07T20:31:30.8682167Z     contiguous=False,
2025-05-07T20:31:30.8682397Z     compiled=False,
2025-05-07T20:31:30.8682616Z )
2025-05-07T20:31:30.8682951Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8683327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8683712Z     T=2048,
2025-05-07T20:31:30.8683913Z     D=5120,
2025-05-07T20:31:30.8684110Z     contiguous=True,
2025-05-07T20:31:30.8684345Z     compiled=False,
2025-05-07T20:31:30.8684561Z )
2025-05-07T20:31:30.8684763Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8685139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8685531Z     T=128,
2025-05-07T20:31:30.8685721Z     D=7168,
2025-05-07T20:31:30.8685933Z     contiguous=False,
2025-05-07T20:31:30.8686167Z     compiled=True,
2025-05-07T20:31:30.8686374Z )
2025-05-07T20:31:30.8686578Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8686953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8687330Z     T=16384,
2025-05-07T20:31:30.8687578Z     D=5120,
2025-05-07T20:31:30.8687781Z     contiguous=True,
2025-05-07T20:31:30.8688016Z     compiled=True,
2025-05-07T20:31:30.8688223Z )
2025-05-07T20:31:30.8688432Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8688808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8689191Z     T=2048,
2025-05-07T20:31:30.8689387Z     D=5120,
2025-05-07T20:31:30.8689590Z     contiguous=False,
2025-05-07T20:31:30.8689818Z     compiled=True,
2025-05-07T20:31:30.8690030Z )
2025-05-07T20:31:30.8690238Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8690614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8691000Z     T=16384,
2025-05-07T20:31:30.8691204Z     D=5120,
2025-05-07T20:31:30.8691405Z     contiguous=True,
2025-05-07T20:31:30.8691639Z     compiled=False,
2025-05-07T20:31:30.8691857Z )
2025-05-07T20:31:30.8692056Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8692434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8692823Z     T=16384,
2025-05-07T20:31:30.8693028Z     D=7168,
2025-05-07T20:31:30.8693238Z     contiguous=False,
2025-05-07T20:31:30.8693475Z     compiled=False,
2025-05-07T20:31:30.8693683Z )
2025-05-07T20:31:30.8693894Z Trying example: test_silu_mul(
2025-05-07T20:31:30.8694276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.8694664Z     T=16384,
2025-05-07T20:31:30.8694861Z     D=7168,
2025-05-07T20:31:30.8695070Z     contiguous=True,
2025-05-07T20:31:30.8695427Z     compiled=False,
2025-05-07T20:31:30.8695635Z )
2025-05-07T20:31:30.8695836Z PASSED
2025-05-07T20:31:30.9291490Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.9293716Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:30.9296569Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.9298056Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.9299047Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9300363Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.9301932Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.9302928Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9304158Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.9305541Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.9306926Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9308220Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.9309474Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:30.9310699Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.9311919Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:30.9312748Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9313782Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.9314806Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:30.9315607Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.9316965Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.9318243Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.9319368Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.9320418Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:30.9321606Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.9322965Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.9324028Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.9325052Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.9325801Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:30.9326820Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.9446665Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.9448963Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:30.9452338Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.9455713Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.9457090Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9458404Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.9459786Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.9460772Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9462004Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.9463382Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.9464625Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9465916Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.9467212Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:30.9468420Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.9469630Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:30.9470455Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9471622Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.9472632Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:30.9473424Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.9474631Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.9475912Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.9477024Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.9478064Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:30.9479236Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.9480587Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.9481662Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.9482570Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.9483314Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:30.9484334Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.9825677Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.9827132Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:30.9828458Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.9829871Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.9830845Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9832146Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.9833521Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.9834501Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9835864Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.9837235Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.9838296Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9839570Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.9840818Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:30.9842025Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.9843225Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:30.9844055Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9845074Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.9846091Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:30.9846879Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.9848147Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.9850295Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.9851405Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.9852435Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:30.9853611Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.9854959Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.9856019Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.9856926Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.9857651Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:30.9858748Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.9865536Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.9866788Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:30.9868117Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.9869516Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.9870502Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9871802Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.9873175Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.9874155Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9875372Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.9876740Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.9877795Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9879180Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.9880416Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:30.9881628Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.9882822Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:30.9883639Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.9884659Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.9885667Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:30.9886522Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.9887836Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.9889116Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.9890231Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.9891269Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:30.9892438Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.9893786Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.9894839Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.9895751Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.9896481Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:30.9897492Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.3968962Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:31.3969648Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:31.3970065Z     T=1,
2025-05-07T20:31:31.3970260Z     D=5120,
2025-05-07T20:31:31.3970455Z     scale_ub=None,
2025-05-07T20:31:31.3970677Z     contiguous=True,
2025-05-07T20:31:31.3970903Z     compiled=True,
2025-05-07T20:31:31.3971109Z )
2025-05-07T20:31:31.3971807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:31.3972300Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:31.3972560Z 
2025-05-07T20:31:31.3972644Z     @given(
2025-05-07T20:31:31.3972886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:31.3973205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:31.3973510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:31.3973850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:31.3974183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:31.3974476Z     )
2025-05-07T20:31:31.3974825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:31.3975271Z     def test_silu_mul_quant(
2025-05-07T20:31:31.3975522Z         self,
2025-05-07T20:31:31.3975718Z         T: int,
2025-05-07T20:31:31.3975922Z         D: int,
2025-05-07T20:31:31.3976151Z         scale_ub: Optional[float],
2025-05-07T20:31:31.3976422Z         contiguous: bool,
2025-05-07T20:31:31.3976675Z         compiled: bool,
2025-05-07T20:31:31.3976907Z     ) -> None:
2025-05-07T20:31:31.3977120Z         torch.manual_seed(2025)
2025-05-07T20:31:31.3977370Z     
2025-05-07T20:31:31.3977652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:31.3977996Z     
2025-05-07T20:31:31.3978194Z         x_sign = torch.sign(x)
2025-05-07T20:31:31.3978633Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:31.3978943Z         x = x_sign * x_clamp
2025-05-07T20:31:31.3979191Z         x0 = x[:, :D]
2025-05-07T20:31:31.3979414Z         x1 = x[:, D:]
2025-05-07T20:31:31.3979625Z     
2025-05-07T20:31:31.3979816Z         if contiguous:
2025-05-07T20:31:31.3980049Z             x0 = x0.contiguous()
2025-05-07T20:31:31.3980313Z             x1 = x1.contiguous()
2025-05-07T20:31:31.3980548Z     
2025-05-07T20:31:31.3980742Z         if scale_ub is not None:
2025-05-07T20:31:31.3981026Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:31.3981360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:31.3981675Z             )
2025-05-07T20:31:31.3981879Z         else:
2025-05-07T20:31:31.3982088Z             scale_ub_tensor = None
2025-05-07T20:31:31.3982345Z     
2025-05-07T20:31:31.3982580Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:31.3982892Z             op = silu_mul_quant
2025-05-07T20:31:31.3983155Z             if compiled:
2025-05-07T20:31:31.3983404Z                 op = torch.compile(op)
2025-05-07T20:31:31.3983699Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:31.3983980Z     
2025-05-07T20:31:31.3984177Z         y_fp8, y_scale = fn()
2025-05-07T20:31:31.3984465Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:31.3984756Z     
2025-05-07T20:31:31.3984997Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:31.3985338Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:31.3985629Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:31.3985945Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:31.3986310Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:31.3986619Z     
2025-05-07T20:31:31.3986829Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:31.3987023Z 
2025-05-07T20:31:31.3987137Z moe/activation_test.py:126: 
2025-05-07T20:31:31.3987438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:31.3987781Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:31.3988110Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:31.3988907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:31.3989664Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:31.3990308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:31.3990998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:31.3991694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:31.3992416Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:31.3993173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:31.3993931Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:31.3994660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:31.3995293Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:31.3995898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:31.3996420Z     fn()
2025-05-07T20:31:31.3996927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:31.3997508Z     self.fn.run(
2025-05-07T20:31:31.3997978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:31.3998620Z     kernel = self.compile(
2025-05-07T20:31:31.3999157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:31.3999809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.4000207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:31.4000439Z 
2025-05-07T20:31:31.4000651Z self = <triton.compiler.compiler.ASTSource object at 0x7f51f3f98e90>
2025-05-07T20:31:31.4001734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:31.4003146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f3937380>}
2025-05-07T20:31:31.4004509Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:31.4005548Z context = <triton._C.libtriton.ir.context object at 0x7f51f3ffbbb0>
2025-05-07T20:31:31.4006175Z 
2025-05-07T20:31:31.4006355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:31.4006881Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.4007361Z                            module_map=module_map)
2025-05-07T20:31:31.4007796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.4008149Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:31.4008427Z E       ^
2025-05-07T20:31:31.4008897Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.4009349Z 
2025-05-07T20:31:31.4009783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:31.4010301Z 
2025-05-07T20:31:31.4010405Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:31.4010822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:31.4011229Z     T=2048,
2025-05-07T20:31:31.4011417Z     D=5120,
2025-05-07T20:31:31.4011617Z     scale_ub=1200.0,
2025-05-07T20:31:31.4011993Z     contiguous=True,
2025-05-07T20:31:31.4012217Z     compiled=False,
2025-05-07T20:31:31.4012426Z )
2025-05-07T20:31:31.7364873Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.7367038Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:31.7368687Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.7370124Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.7371111Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.7372431Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.7374057Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.7375194Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.7376432Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.7377874Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.7378945Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.7380241Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.7381500Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:31.7382716Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.7383933Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:31.7384762Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.7385794Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:31.7386811Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:31.7387603Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:31.7389261Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.7390875Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.7392275Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:31.7393567Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:31.7395047Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.7396765Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.7398096Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.7399299Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.7400199Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:31.7401474Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.8318561Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.8320714Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:31.8323415Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.8326270Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.8327670Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8328994Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.8330378Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.8331369Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8332597Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.8333981Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.8335232Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8336523Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.8337780Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:31.8339007Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.8340233Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:31.8341066Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8342201Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:31.8343227Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:31.8344015Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:31.8345231Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.8346528Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.8347650Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:31.8348703Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:31.8349881Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.8351245Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.8352321Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.8353237Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.8353980Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:31.8355004Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.0976916Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.0978566Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:32.0979925Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.0981405Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.0982388Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.0983704Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.0985101Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.0986093Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.0987448Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.0988826Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.0989905Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.0991195Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.0992454Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:32.0993680Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.0994895Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:32.0995732Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.0996766Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.0997802Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:32.0998608Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.0999815Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.1001207Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.1002329Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.1003373Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:32.1004561Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.1006295Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.1007378Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.1008357Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.1009105Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:32.1010248Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.1117779Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.1119038Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:32.1120378Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.1121798Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.1122782Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.1124084Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.1125477Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.1126460Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.1127813Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.1129185Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.1130244Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.1131697Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.1132941Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:32.1134168Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.1135380Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:32.1136205Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.1137235Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.1138303Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:32.1139208Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.1140415Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.1141705Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.1142833Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.1143879Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:32.1145071Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.1146419Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.1147502Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.1148462Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.1149208Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:32.1150240Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.4259108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.4259902Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:32.4260324Z 
2025-05-07T20:31:32.4260438Z     @given(
2025-05-07T20:31:32.4260765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.4261138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.4261449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.4261998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.4262325Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.4262618Z     )
2025-05-07T20:31:32.4262975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.4263417Z     def test_silu_mul_quant(
2025-05-07T20:31:32.4263677Z         self,
2025-05-07T20:31:32.4263882Z         T: int,
2025-05-07T20:31:32.4264085Z         D: int,
2025-05-07T20:31:32.4264317Z         scale_ub: Optional[float],
2025-05-07T20:31:32.4264601Z         contiguous: bool,
2025-05-07T20:31:32.4264847Z         compiled: bool,
2025-05-07T20:31:32.4265074Z     ) -> None:
2025-05-07T20:31:32.4265303Z         torch.manual_seed(2025)
2025-05-07T20:31:32.4265556Z     
2025-05-07T20:31:32.4265829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.4266181Z     
2025-05-07T20:31:32.4266381Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.4266679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.4266997Z         x = x_sign * x_clamp
2025-05-07T20:31:32.4267248Z         x0 = x[:, :D]
2025-05-07T20:31:32.4267464Z         x1 = x[:, D:]
2025-05-07T20:31:32.4267682Z     
2025-05-07T20:31:32.4267877Z         if contiguous:
2025-05-07T20:31:32.4268108Z             x0 = x0.contiguous()
2025-05-07T20:31:32.4268375Z             x1 = x1.contiguous()
2025-05-07T20:31:32.4268619Z     
2025-05-07T20:31:32.4268929Z         if scale_ub is not None:
2025-05-07T20:31:32.4269209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.4269550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.4269870Z             )
2025-05-07T20:31:32.4270069Z         else:
2025-05-07T20:31:32.4270287Z             scale_ub_tensor = None
2025-05-07T20:31:32.4270544Z     
2025-05-07T20:31:32.4270776Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.4271103Z             op = silu_mul_quant
2025-05-07T20:31:32.4271358Z             if compiled:
2025-05-07T20:31:32.4271604Z                 op = torch.compile(op)
2025-05-07T20:31:32.4271904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.4272186Z     
2025-05-07T20:31:32.4272378Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:32.4272551Z 
2025-05-07T20:31:32.4272653Z moe/activation_test.py:117: 
2025-05-07T20:31:32.4272954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.4273294Z moe/activation_test.py:115: in fn
2025-05-07T20:31:32.4273587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.4274285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:32.4274983Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:32.4275517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.4276205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.4276868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.4277402Z     kernel = self.compile(
2025-05-07T20:31:32.4277939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.4278600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.4279001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.4279230Z 
2025-05-07T20:31:32.4279438Z self = <triton.compiler.compiler.ASTSource object at 0x7f51cbf36510>
2025-05-07T20:31:32.4280519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.4281980Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f3756660>}
2025-05-07T20:31:32.4283321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.4284352Z context = <triton._C.libtriton.ir.context object at 0x7f51f3e922b0>
2025-05-07T20:31:32.4284638Z 
2025-05-07T20:31:32.4284805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.4285328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.4285801Z                            module_map=module_map)
2025-05-07T20:31:32.4286160Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.4286526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.4286791Z E       ^
2025-05-07T20:31:32.4287260Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.4287843Z 
2025-05-07T20:31:32.4288259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.4288776Z 
2025-05-07T20:31:32.4288883Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.4289382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.4289792Z     T=2048,
2025-05-07T20:31:32.4289983Z     D=5120,
2025-05-07T20:31:32.4290182Z     scale_ub=1200.0,
2025-05-07T20:31:32.4290408Z     contiguous=True,
2025-05-07T20:31:32.4290626Z     compiled=True,
2025-05-07T20:31:32.4290839Z )
2025-05-07T20:31:32.4291162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.4291656Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:32.4291932Z 
2025-05-07T20:31:32.4292014Z     @given(
2025-05-07T20:31:32.4292270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.4292588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.4292893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.4293227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.4293557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.4293853Z     )
2025-05-07T20:31:32.4294200Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.4294645Z     def test_silu_mul_quant(
2025-05-07T20:31:32.4294895Z         self,
2025-05-07T20:31:32.4295088Z         T: int,
2025-05-07T20:31:32.4295292Z         D: int,
2025-05-07T20:31:32.4295517Z         scale_ub: Optional[float],
2025-05-07T20:31:32.4295784Z         contiguous: bool,
2025-05-07T20:31:32.4296034Z         compiled: bool,
2025-05-07T20:31:32.4296265Z     ) -> None:
2025-05-07T20:31:32.4296478Z         torch.manual_seed(2025)
2025-05-07T20:31:32.4296728Z     
2025-05-07T20:31:32.4297005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.4297347Z     
2025-05-07T20:31:32.4297545Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.4297848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.4298193Z         x = x_sign * x_clamp
2025-05-07T20:31:32.4298451Z         x0 = x[:, :D]
2025-05-07T20:31:32.4298676Z         x1 = x[:, D:]
2025-05-07T20:31:32.4298879Z     
2025-05-07T20:31:32.4299073Z         if contiguous:
2025-05-07T20:31:32.4299305Z             x0 = x0.contiguous()
2025-05-07T20:31:32.4299568Z             x1 = x1.contiguous()
2025-05-07T20:31:32.4299805Z     
2025-05-07T20:31:32.4300004Z         if scale_ub is not None:
2025-05-07T20:31:32.4300277Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.4300612Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.4301015Z             )
2025-05-07T20:31:32.4301218Z         else:
2025-05-07T20:31:32.4301429Z             scale_ub_tensor = None
2025-05-07T20:31:32.4301687Z     
2025-05-07T20:31:32.4301923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.4302231Z             op = silu_mul_quant
2025-05-07T20:31:32.4302489Z             if compiled:
2025-05-07T20:31:32.4302739Z                 op = torch.compile(op)
2025-05-07T20:31:32.4303039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.4303320Z     
2025-05-07T20:31:32.4303520Z         y_fp8, y_scale = fn()
2025-05-07T20:31:32.4303799Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:32.4304096Z     
2025-05-07T20:31:32.4304337Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.4304676Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:32.4304962Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:32.4305284Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:32.4305822Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:32.4306136Z     
2025-05-07T20:31:32.4306382Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:32.4306661Z 
2025-05-07T20:31:32.4306809Z moe/activation_test.py:126: 
2025-05-07T20:31:32.4307228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.4307925Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:32.4308423Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:32.4309621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:32.4310390Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:32.4310943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.4311641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.4312334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:32.4313058Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:32.4313815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:32.4314570Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:32.4315296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:32.4315938Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:32.4316543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:32.4317101Z     fn()
2025-05-07T20:31:32.4317631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:32.4318218Z     self.fn.run(
2025-05-07T20:31:32.4318694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.4319226Z     kernel = self.compile(
2025-05-07T20:31:32.4319779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.4320441Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.4320843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.4321076Z 
2025-05-07T20:31:32.4321288Z self = <triton.compiler.compiler.ASTSource object at 0x7f51f288ded0>
2025-05-07T20:31:32.4322376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.4323913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f38cbce0>}
2025-05-07T20:31:32.4325260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.4326289Z context = <triton._C.libtriton.ir.context object at 0x7f51f28634f0>
2025-05-07T20:31:32.4326578Z 
2025-05-07T20:31:32.4326746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.4327276Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.4327815Z                            module_map=module_map)
2025-05-07T20:31:32.4328191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.4328558Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:32.4328836Z E       ^
2025-05-07T20:31:32.4329308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.4329757Z 
2025-05-07T20:31:32.4330285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.4330806Z 
2025-05-07T20:31:32.4330915Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.4331335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.4331744Z     T=16384,
2025-05-07T20:31:32.4331943Z     D=7168,
2025-05-07T20:31:32.4332150Z     scale_ub=1200.0,
2025-05-07T20:31:32.4332384Z     contiguous=False,
2025-05-07T20:31:32.4332616Z     compiled=False,
2025-05-07T20:31:32.4332830Z )
2025-05-07T20:31:32.6685442Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.6686511Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:32.6688217Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.6691032Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.6692972Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6695578Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.6697962Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.6698950Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6700180Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.6701548Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.6702775Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6704056Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.6705291Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:32.6706672Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.6707881Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:32.6708705Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6709856Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.6710865Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:32.6711655Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.6712855Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.6714136Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.6715252Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.6716288Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:32.6717453Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.6718803Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.6719867Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.6720773Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.6721510Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:32.6722528Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.7376234Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.7379007Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:32.7381648Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.7384466Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.7386396Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.7388061Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.7389442Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.7390535Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.7391760Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.7393121Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.7394184Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.7395458Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.7396702Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:32.7397919Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.7399114Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:32.7399944Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.7400963Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.7401983Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:32.7402767Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.7403977Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.7405335Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.7406610Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.7407699Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:32.7408861Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.7410211Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.7411271Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.7412179Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.7412920Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:32.7414046Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.1315051Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:33.1317531Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:33.1318883Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:33.1320318Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:33.1321303Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.1322620Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:33.1324013Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.1324997Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.1326228Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:33.1327711Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.1328787Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.1330247Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:33.1331498Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:33.1332725Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:33.1333937Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:33.1334762Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.1335795Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:33.1336819Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:33.1337756Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:33.1338963Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:33.1340249Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:33.1341372Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:33.1342417Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:33.1343600Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:33.1344951Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:33.1346016Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.1346934Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:33.1347728Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:33.1348746Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.1453814Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:33.1455913Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:33.1458218Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:33.1459788Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:33.1460779Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.1462085Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:33.1463473Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.1464469Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.1465704Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:33.1467200Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.1468276Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.1469564Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:33.1470820Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:33.1472043Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:33.1473255Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:33.1474086Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:33.1475111Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:33.1476138Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:33.1476929Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:33.1478146Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:33.1479432Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:33.1480555Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:33.1481684Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:33.1482867Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:33.1484230Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:33.1485298Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.1493677Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:33.1494478Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:33.1495512Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.9331145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:33.9332305Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:33.9332618Z 
2025-05-07T20:31:33.9332704Z     @given(
2025-05-07T20:31:33.9332949Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:33.9333260Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:33.9333575Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:33.9333909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:33.9334241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:33.9334530Z     )
2025-05-07T20:31:33.9334883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:33.9335331Z     def test_silu_mul_quant(
2025-05-07T20:31:33.9335571Z         self,
2025-05-07T20:31:33.9335768Z         T: int,
2025-05-07T20:31:33.9335968Z         D: int,
2025-05-07T20:31:33.9336183Z         scale_ub: Optional[float],
2025-05-07T20:31:33.9336459Z         contiguous: bool,
2025-05-07T20:31:33.9336706Z         compiled: bool,
2025-05-07T20:31:33.9336929Z     ) -> None:
2025-05-07T20:31:33.9337150Z         torch.manual_seed(2025)
2025-05-07T20:31:33.9337393Z     
2025-05-07T20:31:33.9337665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:33.9338015Z     
2025-05-07T20:31:33.9338213Z         x_sign = torch.sign(x)
2025-05-07T20:31:33.9338501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:33.9338821Z         x = x_sign * x_clamp
2025-05-07T20:31:33.9339064Z         x0 = x[:, :D]
2025-05-07T20:31:33.9339277Z         x1 = x[:, D:]
2025-05-07T20:31:33.9339485Z     
2025-05-07T20:31:33.9339672Z         if contiguous:
2025-05-07T20:31:33.9339898Z             x0 = x0.contiguous()
2025-05-07T20:31:33.9340158Z             x1 = x1.contiguous()
2025-05-07T20:31:33.9340401Z     
2025-05-07T20:31:33.9340591Z         if scale_ub is not None:
2025-05-07T20:31:33.9340867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:33.9341214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:33.9341525Z             )
2025-05-07T20:31:33.9341713Z         else:
2025-05-07T20:31:33.9341928Z             scale_ub_tensor = None
2025-05-07T20:31:33.9342182Z     
2025-05-07T20:31:33.9342413Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:33.9342726Z             op = silu_mul_quant
2025-05-07T20:31:33.9342978Z             if compiled:
2025-05-07T20:31:33.9343392Z                 op = torch.compile(op)
2025-05-07T20:31:33.9343689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:33.9343967Z     
2025-05-07T20:31:33.9344154Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:33.9344326Z 
2025-05-07T20:31:33.9344429Z moe/activation_test.py:117: 
2025-05-07T20:31:33.9344729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.9345067Z moe/activation_test.py:115: in fn
2025-05-07T20:31:33.9345346Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:33.9346051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:33.9346752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:33.9347288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:33.9347975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:33.9348646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:33.9349181Z     kernel = self.compile(
2025-05-07T20:31:33.9349723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:33.9350381Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.9350859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.9351088Z 
2025-05-07T20:31:33.9351293Z self = <triton.compiler.compiler.ASTSource object at 0x7f51f2913310>
2025-05-07T20:31:33.9352375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:33.9354051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f3e0c720>}
2025-05-07T20:31:33.9355406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:33.9356432Z context = <triton._C.libtriton.ir.context object at 0x7f51f291b0f0>
2025-05-07T20:31:33.9356720Z 
2025-05-07T20:31:33.9356900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:33.9357422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.9357914Z                            module_map=module_map)
2025-05-07T20:31:33.9358326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.9358678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:33.9358948Z E       ^
2025-05-07T20:31:33.9359553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.9360008Z 
2025-05-07T20:31:33.9360433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:33.9360948Z 
2025-05-07T20:31:33.9361057Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:33.9361476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:33.9361892Z     T=1,
2025-05-07T20:31:33.9362079Z     D=7168,
2025-05-07T20:31:33.9362287Z     scale_ub=None,
2025-05-07T20:31:33.9362510Z     contiguous=True,
2025-05-07T20:31:33.9362732Z     compiled=True,
2025-05-07T20:31:33.9362950Z )
2025-05-07T20:31:33.9363278Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:33.9363770Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:33.9364029Z 
2025-05-07T20:31:33.9364207Z     @given(
2025-05-07T20:31:33.9364445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:33.9364764Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:33.9365078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:33.9365418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:33.9365748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:33.9366032Z     )
2025-05-07T20:31:33.9366391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:33.9366837Z     def test_silu_mul_quant(
2025-05-07T20:31:33.9367081Z         self,
2025-05-07T20:31:33.9367286Z         T: int,
2025-05-07T20:31:33.9367584Z         D: int,
2025-05-07T20:31:33.9367816Z         scale_ub: Optional[float],
2025-05-07T20:31:33.9368086Z         contiguous: bool,
2025-05-07T20:31:33.9368334Z         compiled: bool,
2025-05-07T20:31:33.9368565Z     ) -> None:
2025-05-07T20:31:33.9368789Z         torch.manual_seed(2025)
2025-05-07T20:31:33.9369035Z     
2025-05-07T20:31:33.9369312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:33.9369661Z     
2025-05-07T20:31:33.9369856Z         x_sign = torch.sign(x)
2025-05-07T20:31:33.9370147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:33.9370463Z         x = x_sign * x_clamp
2025-05-07T20:31:33.9370702Z         x0 = x[:, :D]
2025-05-07T20:31:33.9370925Z         x1 = x[:, D:]
2025-05-07T20:31:33.9371135Z     
2025-05-07T20:31:33.9371419Z         if contiguous:
2025-05-07T20:31:33.9371659Z             x0 = x0.contiguous()
2025-05-07T20:31:33.9371925Z             x1 = x1.contiguous()
2025-05-07T20:31:33.9372163Z     
2025-05-07T20:31:33.9372379Z         if scale_ub is not None:
2025-05-07T20:31:33.9372761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:33.9373099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:33.9373416Z             )
2025-05-07T20:31:33.9373628Z         else:
2025-05-07T20:31:33.9373838Z             scale_ub_tensor = None
2025-05-07T20:31:33.9374099Z     
2025-05-07T20:31:33.9374335Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:33.9374656Z             op = silu_mul_quant
2025-05-07T20:31:33.9374904Z             if compiled:
2025-05-07T20:31:33.9375163Z                 op = torch.compile(op)
2025-05-07T20:31:33.9375464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:33.9375740Z     
2025-05-07T20:31:33.9375944Z         y_fp8, y_scale = fn()
2025-05-07T20:31:33.9376235Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:33.9376525Z     
2025-05-07T20:31:33.9376768Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:33.9377109Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:33.9377400Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:33.9377721Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:33.9378084Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:33.9378407Z     
2025-05-07T20:31:33.9378610Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:33.9378819Z 
2025-05-07T20:31:33.9378918Z moe/activation_test.py:126: 
2025-05-07T20:31:33.9379220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.9379557Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:33.9379885Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:33.9380679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:33.9381436Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:33.9381976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:33.9382664Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:33.9383618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:33.9384498Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:33.9385257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:33.9386004Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:33.9386743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:33.9387377Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:33.9387994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:33.9388559Z     fn()
2025-05-07T20:31:33.9389069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:33.9389653Z     self.fn.run(
2025-05-07T20:31:33.9390123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:33.9390657Z     kernel = self.compile(
2025-05-07T20:31:33.9391194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:33.9391980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.9392385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.9392612Z 
2025-05-07T20:31:33.9392827Z self = <triton.compiler.compiler.ASTSource object at 0x7f51cbde7fd0>
2025-05-07T20:31:33.9393908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:33.9395279Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f3e0e2a0>}
2025-05-07T20:31:33.9396616Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:33.9397636Z context = <triton._C.libtriton.ir.context object at 0x7f51cb76a270>
2025-05-07T20:31:33.9397921Z 
2025-05-07T20:31:33.9398085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:33.9398604Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.9399069Z                            module_map=module_map)
2025-05-07T20:31:33.9399431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.9399785Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:33.9400053Z E       ^
2025-05-07T20:31:33.9400511Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.9400956Z 
2025-05-07T20:31:33.9401375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:33.9401883Z 
2025-05-07T20:31:33.9401987Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:33.9402408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:33.9402812Z     T=4096,
2025-05-07T20:31:33.9402994Z     D=5120,
2025-05-07T20:31:33.9403187Z     scale_ub=None,
2025-05-07T20:31:33.9403408Z     contiguous=False,
2025-05-07T20:31:33.9403629Z     compiled=False,
2025-05-07T20:31:33.9403839Z )
2025-05-07T20:31:34.2771486Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.2773092Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:34.2774441Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.2775865Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.2776841Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.2778148Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.2779534Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.2780641Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.2781862Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.2783237Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.2784306Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.2785594Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.2786837Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:34.2788051Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.2789255Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:34.2790089Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.2791108Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.2792128Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:34.2792911Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.2794115Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.2795487Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.2796607Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.2797650Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:34.2798884Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.2800241Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.2801313Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.2802230Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.2802965Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:34.2804067Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.5281477Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.5283931Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:34.5286821Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.5288807Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.5289795Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.5291107Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.5292504Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.5293492Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.5294733Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.5296114Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.5297313Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.5298902Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.5300167Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:34.5301384Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.5302594Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:34.5303423Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.5304457Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.5305468Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:34.5306720Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.5307933Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.5309216Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.5310331Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.5311360Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:34.5312536Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.5313886Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.5314940Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.5315846Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.5316585Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:34.5317608Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.8910743Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.8911811Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:34.8913145Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.8914830Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.8915830Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8917142Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.8918521Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.8919511Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8920845Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.8922222Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.8923291Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8924573Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.8925821Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:34.8927049Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.8928366Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:34.8929198Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.8930226Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.8931249Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:34.8932042Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.8933258Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.8934542Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.8935659Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.8936791Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:34.8937967Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.8939321Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.8940385Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.8941303Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.8942046Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:34.8943069Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.9048807Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.9049934Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:34.9051269Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.9052689Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.9053657Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.9054960Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.9056323Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.9057298Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.9058519Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.9059881Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.9060926Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.9062189Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.9063561Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:34.9064768Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.9065967Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:34.9066784Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.9067787Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.9068796Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:34.9069579Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.9071007Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.9072310Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.9073417Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.9074461Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:34.9075634Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.9076991Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.9078041Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.9078951Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.9079689Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:34.9080709Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.6220226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.6220814Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.6221099Z 
2025-05-07T20:31:36.6221248Z     @given(
2025-05-07T20:31:36.6221583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.6222008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.6222432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.6222779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.6223098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.6223384Z     )
2025-05-07T20:31:36.6224091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.6224562Z     def test_silu_mul_quant(
2025-05-07T20:31:36.6224806Z         self,
2025-05-07T20:31:36.6224994Z         T: int,
2025-05-07T20:31:36.6225193Z         D: int,
2025-05-07T20:31:36.6225412Z         scale_ub: Optional[float],
2025-05-07T20:31:36.6225677Z         contiguous: bool,
2025-05-07T20:31:36.6225913Z         compiled: bool,
2025-05-07T20:31:36.6226135Z     ) -> None:
2025-05-07T20:31:36.6226350Z         torch.manual_seed(2025)
2025-05-07T20:31:36.6226594Z     
2025-05-07T20:31:36.6226868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.6227208Z     
2025-05-07T20:31:36.6227400Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.6227689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.6227989Z         x = x_sign * x_clamp
2025-05-07T20:31:36.6228231Z         x0 = x[:, :D]
2025-05-07T20:31:36.6228454Z         x1 = x[:, D:]
2025-05-07T20:31:36.6228672Z     
2025-05-07T20:31:36.6228884Z         if contiguous:
2025-05-07T20:31:36.6229135Z             x0 = x0.contiguous()
2025-05-07T20:31:36.6229395Z             x1 = x1.contiguous()
2025-05-07T20:31:36.6229629Z     
2025-05-07T20:31:36.6229837Z         if scale_ub is not None:
2025-05-07T20:31:36.6230111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.6230440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.6230753Z             )
2025-05-07T20:31:36.6231080Z         else:
2025-05-07T20:31:36.6231293Z             scale_ub_tensor = None
2025-05-07T20:31:36.6231540Z     
2025-05-07T20:31:36.6231775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.6232090Z             op = silu_mul_quant
2025-05-07T20:31:36.6232334Z             if compiled:
2025-05-07T20:31:36.6232584Z                 op = torch.compile(op)
2025-05-07T20:31:36.6232886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6233165Z     
2025-05-07T20:31:36.6233360Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.6233522Z 
2025-05-07T20:31:36.6233628Z moe/activation_test.py:117: 
2025-05-07T20:31:36.6233921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6234257Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.6234540Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6235239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.6235927Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.6244391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.6245092Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.6245763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.6246318Z     kernel = self.compile(
2025-05-07T20:31:36.6246866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.6247598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.6248005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6248241Z 
2025-05-07T20:31:36.6248467Z self = <triton.compiler.compiler.ASTSource object at 0x7f51cb7ff250>
2025-05-07T20:31:36.6249548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.6250945Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f2b454e0>}
2025-05-07T20:31:36.6252416Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.6253448Z context = <triton._C.libtriton.ir.context object at 0x7f51cb7870b0>
2025-05-07T20:31:36.6253735Z 
2025-05-07T20:31:36.6253901Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.6254435Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.6254909Z                            module_map=module_map)
2025-05-07T20:31:36.6255282Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6255639Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.6255910Z E       ^
2025-05-07T20:31:36.6256388Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.6256850Z 
2025-05-07T20:31:36.6257270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.6257791Z 
2025-05-07T20:31:36.6257900Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.6258325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.6258740Z     T=4096,
2025-05-07T20:31:36.6258933Z     D=7168,
2025-05-07T20:31:36.6259136Z     scale_ub=None,
2025-05-07T20:31:36.6259473Z     contiguous=False,
2025-05-07T20:31:36.6259706Z     compiled=False,
2025-05-07T20:31:36.6259926Z )
2025-05-07T20:31:36.6260254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.6260751Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.6261033Z 
2025-05-07T20:31:36.6261118Z     @given(
2025-05-07T20:31:36.6261363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.6261692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.6262000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.6262336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.6262672Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.6262959Z     )
2025-05-07T20:31:36.6263317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.6263769Z     def test_silu_mul_quant(
2025-05-07T20:31:36.6264013Z         self,
2025-05-07T20:31:36.6264230Z         T: int,
2025-05-07T20:31:36.6264438Z         D: int,
2025-05-07T20:31:36.6264658Z         scale_ub: Optional[float],
2025-05-07T20:31:36.6264939Z         contiguous: bool,
2025-05-07T20:31:36.6265188Z         compiled: bool,
2025-05-07T20:31:36.6265413Z     ) -> None:
2025-05-07T20:31:36.6265638Z         torch.manual_seed(2025)
2025-05-07T20:31:36.6265894Z     
2025-05-07T20:31:36.6266174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.6266542Z     
2025-05-07T20:31:36.6266751Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.6267059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.6267376Z         x = x_sign * x_clamp
2025-05-07T20:31:36.6267627Z         x0 = x[:, :D]
2025-05-07T20:31:36.6267858Z         x1 = x[:, D:]
2025-05-07T20:31:36.6268068Z     
2025-05-07T20:31:36.6268266Z         if contiguous:
2025-05-07T20:31:36.6268510Z             x0 = x0.contiguous()
2025-05-07T20:31:36.6268781Z             x1 = x1.contiguous()
2025-05-07T20:31:36.6269068Z     
2025-05-07T20:31:36.6269289Z         if scale_ub is not None:
2025-05-07T20:31:36.6269564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.6269914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.6270239Z             )
2025-05-07T20:31:36.6270450Z         else:
2025-05-07T20:31:36.6270683Z             scale_ub_tensor = None
2025-05-07T20:31:36.6270954Z     
2025-05-07T20:31:36.6271279Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.6271608Z             op = silu_mul_quant
2025-05-07T20:31:36.6271868Z             if compiled:
2025-05-07T20:31:36.6272127Z                 op = torch.compile(op)
2025-05-07T20:31:36.6272426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6272716Z     
2025-05-07T20:31:36.6272920Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.6273086Z 
2025-05-07T20:31:36.6273190Z moe/activation_test.py:117: 
2025-05-07T20:31:36.6273504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6273853Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.6274140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6274842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.6275546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.6276094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.6276785Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.6277470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.6278016Z     kernel = self.compile(
2025-05-07T20:31:36.6278637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.6279343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.6279761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6279993Z 
2025-05-07T20:31:36.6280212Z self = <triton.compiler.compiler.ASTSource object at 0x7f51cb7925d0>
2025-05-07T20:31:36.6281292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.6282672Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f2b45580>}
2025-05-07T20:31:36.6284030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.6285054Z context = <triton._C.libtriton.ir.context object at 0x7f51cb7aa4f0>
2025-05-07T20:31:36.6285342Z 
2025-05-07T20:31:36.6285513Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.6286032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.6286504Z                            module_map=module_map)
2025-05-07T20:31:36.6286882Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6287241Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.6287574Z E       ^
2025-05-07T20:31:36.6288053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.6288506Z 
2025-05-07T20:31:36.6288927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.6289442Z 
2025-05-07T20:31:36.6289556Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.6289964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.6290377Z     T=128,
2025-05-07T20:31:36.6290573Z     D=7168,
2025-05-07T20:31:36.6290765Z     scale_ub=None,
2025-05-07T20:31:36.6290988Z     contiguous=False,
2025-05-07T20:31:36.6291219Z     compiled=True,
2025-05-07T20:31:36.6291423Z )
2025-05-07T20:31:36.6752260Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.6754023Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:36.6754770Z 
2025-05-07T20:31:36.6754981Z     @given(
2025-05-07T20:31:36.6755446Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.6756057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.6756667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.6757332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.6757978Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.6758534Z     )
2025-05-07T20:31:36.6758932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.6759377Z     def test_silu_mul_quant(
2025-05-07T20:31:36.6759618Z         self,
2025-05-07T20:31:36.6759822Z         T: int,
2025-05-07T20:31:36.6760022Z         D: int,
2025-05-07T20:31:36.6760239Z         scale_ub: Optional[float],
2025-05-07T20:31:36.6760519Z         contiguous: bool,
2025-05-07T20:31:36.6760761Z         compiled: bool,
2025-05-07T20:31:36.6760981Z     ) -> None:
2025-05-07T20:31:36.6761198Z         torch.manual_seed(2025)
2025-05-07T20:31:36.6761439Z     
2025-05-07T20:31:36.6761712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.6762066Z     
2025-05-07T20:31:36.6762267Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.6762559Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.6763034Z         x = x_sign * x_clamp
2025-05-07T20:31:36.6763285Z         x0 = x[:, :D]
2025-05-07T20:31:36.6763504Z         x1 = x[:, D:]
2025-05-07T20:31:36.6763718Z     
2025-05-07T20:31:36.6763914Z         if contiguous:
2025-05-07T20:31:36.6764142Z             x0 = x0.contiguous()
2025-05-07T20:31:36.6764407Z             x1 = x1.contiguous()
2025-05-07T20:31:36.6764649Z     
2025-05-07T20:31:36.6764840Z         if scale_ub is not None:
2025-05-07T20:31:36.6765127Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.6765467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.6765785Z             )
2025-05-07T20:31:36.6765980Z         else:
2025-05-07T20:31:36.6766199Z             scale_ub_tensor = None
2025-05-07T20:31:36.6766461Z     
2025-05-07T20:31:36.6766693Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.6767013Z             op = silu_mul_quant
2025-05-07T20:31:36.6767267Z             if compiled:
2025-05-07T20:31:36.6767605Z                 op = torch.compile(op)
2025-05-07T20:31:36.6767906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6768194Z     
2025-05-07T20:31:36.6768387Z         y_fp8, y_scale = fn()
2025-05-07T20:31:36.6768675Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:36.6768974Z     
2025-05-07T20:31:36.6769211Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.6769557Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:36.6769857Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:36.6770177Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:36.6770535Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.6770854Z     
2025-05-07T20:31:36.6771062Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:36.6771258Z 
2025-05-07T20:31:36.6771360Z moe/activation_test.py:126: 
2025-05-07T20:31:36.6771666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6772003Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:36.6772326Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.6773118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:36.6773874Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:36.6774419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.6775191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.6775879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:36.6776602Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.6777359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:36.6778100Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.6778831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:36.6779471Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:36.6780072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:36.6780593Z     fn()
2025-05-07T20:31:36.6781105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:36.6781690Z     self.fn.run(
2025-05-07T20:31:36.6782151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.6782682Z     kernel = self.compile(
2025-05-07T20:31:36.6783301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.6783959Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.6784351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6784588Z 
2025-05-07T20:31:36.6784796Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca4ade10>
2025-05-07T20:31:36.6785888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.6787259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f2b46c00>}
2025-05-07T20:31:36.6788652Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.6789678Z context = <triton._C.libtriton.ir.context object at 0x7f51ca3040b0>
2025-05-07T20:31:36.6789973Z 
2025-05-07T20:31:36.6790140Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.6790670Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.6791138Z                            module_map=module_map)
2025-05-07T20:31:36.6791506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6791869Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:36.6792137Z E       ^
2025-05-07T20:31:36.6792605Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.6793060Z 
2025-05-07T20:31:36.6793486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.6793996Z 
2025-05-07T20:31:36.6794109Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.6794517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.6794925Z     T=128,
2025-05-07T20:31:36.6795124Z     D=7168,
2025-05-07T20:31:36.6795318Z     scale_ub=None,
2025-05-07T20:31:36.6795541Z     contiguous=False,
2025-05-07T20:31:36.6795863Z     compiled=False,
2025-05-07T20:31:36.6796067Z )
2025-05-07T20:31:36.8313387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.8314905Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.8315648Z 
2025-05-07T20:31:36.8315861Z     @given(
2025-05-07T20:31:36.8316423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.8317051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.8317666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.8318324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.8318814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.8319122Z     )
2025-05-07T20:31:36.8319474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.8319917Z     def test_silu_mul_quant(
2025-05-07T20:31:36.8320158Z         self,
2025-05-07T20:31:36.8320364Z         T: int,
2025-05-07T20:31:36.8320567Z         D: int,
2025-05-07T20:31:36.8320789Z         scale_ub: Optional[float],
2025-05-07T20:31:36.8321057Z         contiguous: bool,
2025-05-07T20:31:36.8321297Z         compiled: bool,
2025-05-07T20:31:36.8321523Z     ) -> None:
2025-05-07T20:31:36.8321735Z         torch.manual_seed(2025)
2025-05-07T20:31:36.8321983Z     
2025-05-07T20:31:36.8322262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.8322603Z     
2025-05-07T20:31:36.8322968Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.8323273Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.8323580Z         x = x_sign * x_clamp
2025-05-07T20:31:36.8323828Z         x0 = x[:, :D]
2025-05-07T20:31:36.8324053Z         x1 = x[:, D:]
2025-05-07T20:31:36.8324258Z     
2025-05-07T20:31:36.8324449Z         if contiguous:
2025-05-07T20:31:36.8324683Z             x0 = x0.contiguous()
2025-05-07T20:31:36.8324940Z             x1 = x1.contiguous()
2025-05-07T20:31:36.8325188Z     
2025-05-07T20:31:36.8325385Z         if scale_ub is not None:
2025-05-07T20:31:36.8325654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.8325992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.8326308Z             )
2025-05-07T20:31:36.8326508Z         else:
2025-05-07T20:31:36.8326720Z             scale_ub_tensor = None
2025-05-07T20:31:36.8326977Z     
2025-05-07T20:31:36.8327212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.8327619Z             op = silu_mul_quant
2025-05-07T20:31:36.8327875Z             if compiled:
2025-05-07T20:31:36.8328125Z                 op = torch.compile(op)
2025-05-07T20:31:36.8328417Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.8328701Z     
2025-05-07T20:31:36.8328899Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.8329062Z 
2025-05-07T20:31:36.8329163Z moe/activation_test.py:117: 
2025-05-07T20:31:36.8329464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.8329810Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.8330094Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.8330777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.8331467Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.8332010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.8332688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.8333351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.8333889Z     kernel = self.compile(
2025-05-07T20:31:36.8334430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.8335207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.8335604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.8335832Z 
2025-05-07T20:31:36.8336043Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca9df790>
2025-05-07T20:31:36.8337118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.8338483Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51ca943100>}
2025-05-07T20:31:36.8339816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.8340843Z context = <triton._C.libtriton.ir.context object at 0x7f51cb68df30>
2025-05-07T20:31:36.8341128Z 
2025-05-07T20:31:36.8341301Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.8341820Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.8342286Z                            module_map=module_map)
2025-05-07T20:31:36.8342732Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.8343087Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.8343352Z E       ^
2025-05-07T20:31:36.8343817Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.8344262Z 
2025-05-07T20:31:36.8344682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.8345190Z 
2025-05-07T20:31:36.8345302Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.8345715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.8346126Z     T=4096,
2025-05-07T20:31:36.8346313Z     D=5120,
2025-05-07T20:31:36.8346511Z     scale_ub=1200.0,
2025-05-07T20:31:36.8346740Z     contiguous=True,
2025-05-07T20:31:36.8346959Z     compiled=False,
2025-05-07T20:31:36.8347176Z )
2025-05-07T20:31:36.8347494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.8347993Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:36.8348266Z 
2025-05-07T20:31:36.8348346Z     @given(
2025-05-07T20:31:36.8348577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.8348891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.8349193Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.8349524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.8349859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.8350147Z     )
2025-05-07T20:31:36.8350499Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.8350942Z     def test_silu_mul_quant(
2025-05-07T20:31:36.8351187Z         self,
2025-05-07T20:31:36.8351380Z         T: int,
2025-05-07T20:31:36.8351585Z         D: int,
2025-05-07T20:31:36.8351810Z         scale_ub: Optional[float],
2025-05-07T20:31:36.8352080Z         contiguous: bool,
2025-05-07T20:31:36.8352331Z         compiled: bool,
2025-05-07T20:31:36.8352561Z     ) -> None:
2025-05-07T20:31:36.8352771Z         torch.manual_seed(2025)
2025-05-07T20:31:36.8353017Z     
2025-05-07T20:31:36.8353299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.8353637Z     
2025-05-07T20:31:36.8353837Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.8354130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.8354549Z         x = x_sign * x_clamp
2025-05-07T20:31:36.8354794Z         x0 = x[:, :D]
2025-05-07T20:31:36.8355015Z         x1 = x[:, D:]
2025-05-07T20:31:36.8355223Z     
2025-05-07T20:31:36.8355420Z         if contiguous:
2025-05-07T20:31:36.8355653Z             x0 = x0.contiguous()
2025-05-07T20:31:36.8355907Z             x1 = x1.contiguous()
2025-05-07T20:31:36.8356148Z     
2025-05-07T20:31:36.8356348Z         if scale_ub is not None:
2025-05-07T20:31:36.8356614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.8356957Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.8357273Z             )
2025-05-07T20:31:36.8357472Z         else:
2025-05-07T20:31:36.8357681Z             scale_ub_tensor = None
2025-05-07T20:31:36.8357929Z     
2025-05-07T20:31:36.8358159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.8358466Z             op = silu_mul_quant
2025-05-07T20:31:36.8358718Z             if compiled:
2025-05-07T20:31:36.8358972Z                 op = torch.compile(op)
2025-05-07T20:31:36.8359260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.8359533Z     
2025-05-07T20:31:36.8359722Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.8359883Z 
2025-05-07T20:31:36.8359983Z moe/activation_test.py:117: 
2025-05-07T20:31:36.8360274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.8360601Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.8360882Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.8361648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.8362341Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.8362873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.8363550Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.8364215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.8364745Z     kernel = self.compile(
2025-05-07T20:31:36.8365280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.8365925Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.8366325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.8366552Z 
2025-05-07T20:31:36.8366762Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca380250>
2025-05-07T20:31:36.8367881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.8369291Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51cb66b560>}
2025-05-07T20:31:36.8370628Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.8371645Z context = <triton._C.libtriton.ir.context object at 0x7f51ca3b8070>
2025-05-07T20:31:36.8371930Z 
2025-05-07T20:31:36.8372104Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.8372619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.8373082Z                            module_map=module_map)
2025-05-07T20:31:36.8373447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.8373803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.8374059Z E       ^
2025-05-07T20:31:36.8374613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.8375066Z 
2025-05-07T20:31:36.8375481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.8375987Z 
2025-05-07T20:31:36.8376098Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.8376506Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.8376914Z     T=1,
2025-05-07T20:31:36.8377107Z     D=5120,
2025-05-07T20:31:36.8377296Z     scale_ub=None,
2025-05-07T20:31:36.8377508Z     contiguous=True,
2025-05-07T20:31:36.8377732Z     compiled=True,
2025-05-07T20:31:36.8377929Z )
2025-05-07T20:31:37.1786711Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.1787926Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.1789279Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.1790881Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.1791867Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.1793168Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.1794545Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.1795528Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.1796765Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.1798134Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.1799191Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.1800467Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.1801711Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.1802921Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.1804117Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.1805070Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.1806369Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.1807381Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.1808246Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.1809448Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.1810727Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.1811840Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.1812882Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.1814179Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.1815533Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.1816588Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.1817501Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.1818240Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.1819425Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.2636677Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.2638861Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.2640322Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.2641748Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.2642734Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.2644035Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.2645418Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.2646577Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.2647902Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.2649281Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.2650341Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.2651629Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.2652874Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.2654209Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.2655432Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.2656271Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.2657311Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.2658329Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.2659133Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.2660348Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.2661627Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.2662739Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.2663788Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.2664978Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.2666334Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.2667404Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.2668312Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.2669167Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.2670189Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.5260714Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.5261832Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.5263182Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.5264709Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.5265690Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.5267158Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.5268547Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.5269542Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.5270774Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.5272154Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.5273214Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.5274501Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.5275752Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.5276981Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.5278194Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.5279019Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.5280044Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.5281187Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.5281982Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.5283189Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.5284474Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.5285593Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.5286639Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.5287923Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.5289352Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.5290415Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.5291326Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.5292066Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.5293085Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.5409947Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.5411206Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.5412531Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.5413942Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.5414904Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.5416207Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.5417578Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.5418554Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.5419952Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.5421310Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.5422373Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.5423647Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.5424886Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.5426096Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.5427396Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.5428224Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.5429247Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.5430255Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.5431048Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.5432242Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.5433517Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.5434624Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.5435658Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.5436829Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.5438175Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.5439236Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.5440138Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.5440875Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.5441879Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.7366949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.7367788Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:37.7368164Z 
2025-05-07T20:31:37.7368285Z     @given(
2025-05-07T20:31:37.7368595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.7369046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.7369375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.7369714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.7370040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.7370334Z     )
2025-05-07T20:31:37.7370789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.7371256Z     def test_silu_mul_quant(
2025-05-07T20:31:37.7371506Z         self,
2025-05-07T20:31:37.7371710Z         T: int,
2025-05-07T20:31:37.7371912Z         D: int,
2025-05-07T20:31:37.7372138Z         scale_ub: Optional[float],
2025-05-07T20:31:37.7372415Z         contiguous: bool,
2025-05-07T20:31:37.7372655Z         compiled: bool,
2025-05-07T20:31:37.7372887Z     ) -> None:
2025-05-07T20:31:37.7373109Z         torch.manual_seed(2025)
2025-05-07T20:31:37.7373353Z     
2025-05-07T20:31:37.7373826Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.7374182Z     
2025-05-07T20:31:37.7374390Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.7374681Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.7374996Z         x = x_sign * x_clamp
2025-05-07T20:31:37.7375244Z         x0 = x[:, :D]
2025-05-07T20:31:37.7375464Z         x1 = x[:, D:]
2025-05-07T20:31:37.7375677Z     
2025-05-07T20:31:37.7375869Z         if contiguous:
2025-05-07T20:31:37.7376108Z             x0 = x0.contiguous()
2025-05-07T20:31:37.7376374Z             x1 = x1.contiguous()
2025-05-07T20:31:37.7376619Z     
2025-05-07T20:31:37.7376815Z         if scale_ub is not None:
2025-05-07T20:31:37.7377096Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.7377456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.7377770Z             )
2025-05-07T20:31:37.7377964Z         else:
2025-05-07T20:31:37.7378182Z             scale_ub_tensor = None
2025-05-07T20:31:37.7378446Z     
2025-05-07T20:31:37.7378681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.7379003Z             op = silu_mul_quant
2025-05-07T20:31:37.7379260Z             if compiled:
2025-05-07T20:31:37.7379508Z                 op = torch.compile(op)
2025-05-07T20:31:37.7379819Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.7380101Z     
2025-05-07T20:31:37.7380307Z         y_fp8, y_scale = fn()
2025-05-07T20:31:37.7380596Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:37.7380899Z     
2025-05-07T20:31:37.7381142Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.7381479Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:37.7381779Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:37.7382100Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:37.7382460Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.7382776Z     
2025-05-07T20:31:37.7382991Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:37.7383185Z 
2025-05-07T20:31:37.7383291Z moe/activation_test.py:126: 
2025-05-07T20:31:37.7383586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.7383922Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:37.7384249Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.7385036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:37.7385980Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:37.7386528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.7387207Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.7387895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:37.7388619Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.7389370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:37.7390110Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.7390838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:37.7391483Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:37.7392081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:37.7392591Z     fn()
2025-05-07T20:31:37.7393096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:37.7393759Z     self.fn.run(
2025-05-07T20:31:37.7394230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.7394756Z     kernel = self.compile(
2025-05-07T20:31:37.7395297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.7395949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.7396336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.7396575Z 
2025-05-07T20:31:37.7396782Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca35ced0>
2025-05-07T20:31:37.7397860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.7399239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51cbd077e0>}
2025-05-07T20:31:37.7400583Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.7401605Z context = <triton._C.libtriton.ir.context object at 0x7f51ca354eb0>
2025-05-07T20:31:37.7401904Z 
2025-05-07T20:31:37.7402068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.7402586Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.7403054Z                            module_map=module_map)
2025-05-07T20:31:37.7403416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.7403775Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:37.7404044Z E       ^
2025-05-07T20:31:37.7404508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.7404964Z 
2025-05-07T20:31:37.7405376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.7406151Z 
2025-05-07T20:31:37.7406259Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.7406675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.7407215Z     T=2048,
2025-05-07T20:31:37.7407407Z     D=5120,
2025-05-07T20:31:37.7407657Z     scale_ub=None,
2025-05-07T20:31:37.7407867Z     contiguous=True,
2025-05-07T20:31:37.7408092Z     compiled=True,
2025-05-07T20:31:37.7408300Z )
2025-05-07T20:31:38.0540782Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.0542077Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.0543409Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.0544818Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.0545797Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.0547256Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.0548629Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.0549603Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.0550815Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.0552172Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.0553225Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.0554492Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.0555728Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.0556940Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.0558134Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.0558953Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.0559968Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:38.0560975Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.0561874Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:38.0563072Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.0564363Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.0565466Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.0566489Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.0567800Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.0569203Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.0570390Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.0571295Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.0572023Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.0573034Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.1385062Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.1386323Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.1387666Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.1389076Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.1390047Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.1391345Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.1392722Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.1393698Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.1394921Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.1396453Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.1397512Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.1398784Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.1400020Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.1401231Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.1402430Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.1403508Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.1404531Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:38.1405542Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.1406624Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:38.1407883Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.1409157Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.1410272Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.1411304Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.1412469Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.1413824Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.1414879Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.1415789Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.1416524Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.1417532Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.4017910Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.4019896Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.4021229Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.4022642Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.4023604Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.4024906Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.4026272Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.4027407Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.4028629Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.4029987Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.4031038Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.4032521Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.4033774Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.4034994Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.4036212Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.4037039Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.4038066Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:38.4039080Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.4039874Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:38.4041075Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.4042505Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.4043884Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.4044923Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.4046181Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.4047622Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.4048687Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.4049594Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.4050429Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.4051442Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.4159486Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.4161851Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.4164491Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.4167300Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.4169151Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.4170473Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.4171853Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.4172836Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.4174054Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.4175414Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.4176632Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.4177907Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.4179147Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.4180355Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.4181558Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.4182381Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.4183405Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:38.4184522Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.4185312Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:38.4186515Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.4187794Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.4188919Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.4190013Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.4191187Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.4192542Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.4193595Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.4194503Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.4195239Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.4196255Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.7756616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.7757299Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:38.7757587Z 
2025-05-07T20:31:38.7757671Z     @given(
2025-05-07T20:31:38.7758129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.7758455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.7758767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.7759104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.7759444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.7759732Z     )
2025-05-07T20:31:38.7760095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.7760549Z     def test_silu_mul_quant(
2025-05-07T20:31:38.7767950Z         self,
2025-05-07T20:31:38.7768168Z         T: int,
2025-05-07T20:31:38.7768380Z         D: int,
2025-05-07T20:31:38.7768604Z         scale_ub: Optional[float],
2025-05-07T20:31:38.7768889Z         contiguous: bool,
2025-05-07T20:31:38.7769139Z         compiled: bool,
2025-05-07T20:31:38.7769368Z     ) -> None:
2025-05-07T20:31:38.7769590Z         torch.manual_seed(2025)
2025-05-07T20:31:38.7769855Z     
2025-05-07T20:31:38.7770131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.7770488Z     
2025-05-07T20:31:38.7770693Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.7770987Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.7771309Z         x = x_sign * x_clamp
2025-05-07T20:31:38.7771559Z         x0 = x[:, :D]
2025-05-07T20:31:38.7771784Z         x1 = x[:, D:]
2025-05-07T20:31:38.7772003Z     
2025-05-07T20:31:38.7772199Z         if contiguous:
2025-05-07T20:31:38.7772589Z             x0 = x0.contiguous()
2025-05-07T20:31:38.7772863Z             x1 = x1.contiguous()
2025-05-07T20:31:38.7773112Z     
2025-05-07T20:31:38.7773315Z         if scale_ub is not None:
2025-05-07T20:31:38.7773593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.7773939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.7774258Z             )
2025-05-07T20:31:38.7774460Z         else:
2025-05-07T20:31:38.7774693Z             scale_ub_tensor = None
2025-05-07T20:31:38.7774954Z     
2025-05-07T20:31:38.7775188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.7775518Z             op = silu_mul_quant
2025-05-07T20:31:38.7775780Z             if compiled:
2025-05-07T20:31:38.7776032Z                 op = torch.compile(op)
2025-05-07T20:31:38.7776337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.7776619Z     
2025-05-07T20:31:38.7776815Z         y_fp8, y_scale = fn()
2025-05-07T20:31:38.7777113Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:38.7777409Z     
2025-05-07T20:31:38.7777648Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.7777990Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:38.7778290Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:38.7778609Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:38.7778966Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:38.7779288Z     
2025-05-07T20:31:38.7779494Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:38.7779691Z 
2025-05-07T20:31:38.7779796Z moe/activation_test.py:126: 
2025-05-07T20:31:38.7780099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.7780444Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:38.7780772Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:38.7781572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:38.7782335Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:38.7782891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.7783573Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.7784264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:38.7785083Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:38.7785839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:38.7786585Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:38.7787325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:38.7787977Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:38.7788588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:38.7789155Z     fn()
2025-05-07T20:31:38.7789676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:38.7790267Z     self.fn.run(
2025-05-07T20:31:38.7790735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.7791271Z     kernel = self.compile(
2025-05-07T20:31:38.7791818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.7792487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.7792968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.7793206Z 
2025-05-07T20:31:38.7793417Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca16ee90>
2025-05-07T20:31:38.7794512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.7796053Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51cbc36c00>}
2025-05-07T20:31:38.7797405Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.7798443Z context = <triton._C.libtriton.ir.context object at 0x7f51ca176530>
2025-05-07T20:31:38.7798738Z 
2025-05-07T20:31:38.7798908Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.7799435Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.7799904Z                            module_map=module_map)
2025-05-07T20:31:38.7800276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.7800641Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:38.7800918Z E       ^
2025-05-07T20:31:38.7801390Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.7801851Z 
2025-05-07T20:31:38.7802272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.7802787Z 
2025-05-07T20:31:38.7802901Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.7803320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.7803731Z     T=128,
2025-05-07T20:31:38.7803930Z     D=5120,
2025-05-07T20:31:38.7804122Z     scale_ub=None,
2025-05-07T20:31:38.7804348Z     contiguous=True,
2025-05-07T20:31:38.7804576Z     compiled=True,
2025-05-07T20:31:38.7804786Z )
2025-05-07T20:31:39.0997192Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.0998638Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.0999969Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.1001397Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.1002365Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.1003663Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.1005048Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.1006383Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.1007661Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.1009068Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.1010143Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.1011420Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.1012666Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.1013877Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.1015073Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.1015897Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.1016914Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.1017927Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.1018710Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.1019912Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.1021309Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.1022417Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.1023455Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.1024618Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.1025966Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.1027025Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.1027928Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.1028655Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.1029746Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.1849576Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.1850817Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.1852157Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.1853565Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.1854537Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.1855836Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.1857215Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.1858195Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.1859474Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.1860836Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.1861891Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.1863330Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.1864562Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.1865778Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.1866964Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.1867783Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.1868801Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.1869805Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.1870697Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.1871890Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.1873159Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.1874266Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.1875293Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.1876457Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.1877794Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.1878842Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.1879747Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.1880477Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.1881484Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.4512302Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.4513651Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.4514983Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.4516576Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.4517550Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.4518841Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.4520215Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.4521192Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.4522405Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.4523876Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.4524923Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.4526197Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.4527444Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.4528729Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.4529937Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.4530754Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.4531773Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.4532788Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.4533575Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.4534781Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.4536047Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.4537157Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.4538303Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.4539473Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.4540874Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.4541923Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.4542825Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.4543563Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.4544582Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.4655087Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.4656342Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.4657680Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.4659124Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.4660125Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.4661435Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.4662805Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.4663795Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.4665025Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.4666398Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.4667457Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.4668733Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.4670153Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.4671375Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.4672589Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.4673415Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.4674431Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.4675456Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.4676250Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.4677532Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.4678811Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.4679931Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.4680973Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.4682158Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.4683515Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.4684573Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.4685482Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.4686222Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.4687247Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.6935601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.6936359Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:39.6936723Z 
2025-05-07T20:31:39.6936838Z     @given(
2025-05-07T20:31:39.6937149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.6937467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.6937775Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.6938107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.6938438Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.6938718Z     )
2025-05-07T20:31:39.6939075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.6939701Z     def test_silu_mul_quant(
2025-05-07T20:31:39.6939950Z         self,
2025-05-07T20:31:39.6940144Z         T: int,
2025-05-07T20:31:39.6940347Z         D: int,
2025-05-07T20:31:39.6940566Z         scale_ub: Optional[float],
2025-05-07T20:31:39.6940835Z         contiguous: bool,
2025-05-07T20:31:39.6941079Z         compiled: bool,
2025-05-07T20:31:39.6941309Z     ) -> None:
2025-05-07T20:31:39.6941524Z         torch.manual_seed(2025)
2025-05-07T20:31:39.6941773Z     
2025-05-07T20:31:39.6942049Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.6942394Z     
2025-05-07T20:31:39.6942593Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.6942887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.6943196Z         x = x_sign * x_clamp
2025-05-07T20:31:39.6943440Z         x0 = x[:, :D]
2025-05-07T20:31:39.6943659Z         x1 = x[:, D:]
2025-05-07T20:31:39.6943873Z     
2025-05-07T20:31:39.6944061Z         if contiguous:
2025-05-07T20:31:39.6944292Z             x0 = x0.contiguous()
2025-05-07T20:31:39.6944548Z             x1 = x1.contiguous()
2025-05-07T20:31:39.6944790Z     
2025-05-07T20:31:39.6944989Z         if scale_ub is not None:
2025-05-07T20:31:39.6945264Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.6945595Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.6945912Z             )
2025-05-07T20:31:39.6946114Z         else:
2025-05-07T20:31:39.6946438Z             scale_ub_tensor = None
2025-05-07T20:31:39.6946700Z     
2025-05-07T20:31:39.6946939Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.6947248Z             op = silu_mul_quant
2025-05-07T20:31:39.6947499Z             if compiled:
2025-05-07T20:31:39.6947748Z                 op = torch.compile(op)
2025-05-07T20:31:39.6948039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.6948318Z     
2025-05-07T20:31:39.6948517Z         y_fp8, y_scale = fn()
2025-05-07T20:31:39.6948798Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:39.6949090Z     
2025-05-07T20:31:39.6949337Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.6949676Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:39.6949969Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:39.6950279Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:39.6950642Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.6950962Z     
2025-05-07T20:31:39.6951171Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:39.6951372Z 
2025-05-07T20:31:39.6951477Z moe/activation_test.py:126: 
2025-05-07T20:31:39.6951784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.6952123Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:39.6952455Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.6953257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:39.6954018Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:39.6954563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.6955250Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.6955947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:39.6956676Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.6957428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:39.6958180Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.6959000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:39.6959648Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:39.6960247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:39.6960766Z     fn()
2025-05-07T20:31:39.6961277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:39.6961854Z     self.fn.run(
2025-05-07T20:31:39.6962323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.6962854Z     kernel = self.compile(
2025-05-07T20:31:39.6963395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.6964039Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.6964441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.6964669Z 
2025-05-07T20:31:39.6964885Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca05f350>
2025-05-07T20:31:39.6966045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.6967421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51cbb0aac0>}
2025-05-07T20:31:39.6968853Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.6969928Z context = <triton._C.libtriton.ir.context object at 0x7f51ca0629f0>
2025-05-07T20:31:39.6970215Z 
2025-05-07T20:31:39.6970385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.6970902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.6971371Z                            module_map=module_map)
2025-05-07T20:31:39.6971736Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.6972098Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:39.6972370Z E       ^
2025-05-07T20:31:39.6972840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.6973285Z 
2025-05-07T20:31:39.6973703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.6974214Z 
2025-05-07T20:31:39.6974318Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.6974736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.6975143Z     T=4096,
2025-05-07T20:31:39.6975332Z     D=5120,
2025-05-07T20:31:39.6975523Z     scale_ub=None,
2025-05-07T20:31:39.6975739Z     contiguous=True,
2025-05-07T20:31:39.6975963Z     compiled=True,
2025-05-07T20:31:39.6976163Z )
2025-05-07T20:31:40.0199993Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.0201545Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:40.0203123Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.0204751Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.0206005Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.0207311Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.0208746Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.0209719Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.0210936Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.0212420Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.0213482Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.0214750Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.0215993Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:40.0217194Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.0218396Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:40.0219211Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.0220223Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:40.0221232Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:40.0222008Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:40.0223209Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.0224477Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.0225581Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.0226605Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:40.0227889Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.0229235Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.0230332Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.0231232Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.0231956Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:40.0232971Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.1058086Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.1059551Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:40.1060884Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.1062301Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.1063282Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.1064587Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.1065956Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.1066929Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.1068149Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.1069515Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.1070575Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.1071843Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.1073072Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:40.1074401Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.1075599Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:40.1076428Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.1077447Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:40.1078454Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:40.1079245Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:40.1094101Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.1095594Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.1096758Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.1097843Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:40.1099085Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.1100469Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.1101555Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.1102479Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.1103232Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:40.1104259Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.3711257Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.3712495Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:40.3713850Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.3715281Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.3716255Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.3717831Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.3719211Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.3720189Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.3721410Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.3722779Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.3723835Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.3725233Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.3726477Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:40.3727779Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.3728990Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:40.3729808Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.3730836Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:40.3731847Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:40.3732633Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:40.3733838Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.3735105Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.3736217Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.3737252Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:40.3738422Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.3739862Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.3740911Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.3741818Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.3742553Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:40.3743568Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.3853004Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.3854239Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:40.3855733Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.3857153Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.3858132Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.3859493Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.3860869Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.3861859Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.3863088Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.3864462Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.3865532Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.3866809Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.3868051Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:40.3869271Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.3870605Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:40.3871437Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.3872464Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:40.3873483Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:40.3874276Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:40.3875486Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.3876767Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.3877878Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.3878997Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:40.3880177Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.3881531Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.3882591Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.3883505Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.3884252Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:40.3885276Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.6222320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.6223768Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.6224545Z 
2025-05-07T20:31:40.6224767Z     @given(
2025-05-07T20:31:40.6225251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.6225882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.6226498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.6227171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.6227822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.6228404Z     )
2025-05-07T20:31:40.6229126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.6229671Z     def test_silu_mul_quant(
2025-05-07T20:31:40.6229927Z         self,
2025-05-07T20:31:40.6230140Z         T: int,
2025-05-07T20:31:40.6230339Z         D: int,
2025-05-07T20:31:40.6230574Z         scale_ub: Optional[float],
2025-05-07T20:31:40.6230856Z         contiguous: bool,
2025-05-07T20:31:40.6231105Z         compiled: bool,
2025-05-07T20:31:40.6231333Z     ) -> None:
2025-05-07T20:31:40.6231762Z         torch.manual_seed(2025)
2025-05-07T20:31:40.6232012Z     
2025-05-07T20:31:40.6232291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.6232646Z     
2025-05-07T20:31:40.6232853Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.6233146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.6233466Z         x = x_sign * x_clamp
2025-05-07T20:31:40.6233717Z         x0 = x[:, :D]
2025-05-07T20:31:40.6233936Z         x1 = x[:, D:]
2025-05-07T20:31:40.6234163Z     
2025-05-07T20:31:40.6234364Z         if contiguous:
2025-05-07T20:31:40.6234603Z             x0 = x0.contiguous()
2025-05-07T20:31:40.6234880Z             x1 = x1.contiguous()
2025-05-07T20:31:40.6235131Z     
2025-05-07T20:31:40.6235329Z         if scale_ub is not None:
2025-05-07T20:31:40.6235614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.6235972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.6236298Z             )
2025-05-07T20:31:40.6236514Z         else:
2025-05-07T20:31:40.6236726Z             scale_ub_tensor = None
2025-05-07T20:31:40.6236984Z     
2025-05-07T20:31:40.6237225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.6237541Z             op = silu_mul_quant
2025-05-07T20:31:40.6237798Z             if compiled:
2025-05-07T20:31:40.6238050Z                 op = torch.compile(op)
2025-05-07T20:31:40.6238350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.6238632Z     
2025-05-07T20:31:40.6238954Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.6239244Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.6239543Z     
2025-05-07T20:31:40.6239808Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.6240174Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.6240483Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.6240805Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.6241167Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.6241481Z     
2025-05-07T20:31:40.6241690Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.6241888Z 
2025-05-07T20:31:40.6241999Z moe/activation_test.py:126: 
2025-05-07T20:31:40.6242301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.6242645Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.6242990Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.6243778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.6244543Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.6245099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.6245788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.6246476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.6247205Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.6248050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.6248805Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.6249533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.6250177Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.6250782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.6251306Z     fn()
2025-05-07T20:31:40.6251812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.6252485Z     self.fn.run(
2025-05-07T20:31:40.6252958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.6253488Z     kernel = self.compile(
2025-05-07T20:31:40.6254033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.6254698Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.6255102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.6255332Z 
2025-05-07T20:31:40.6255538Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c9339050>
2025-05-07T20:31:40.6256620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.6258004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9bea5c0>}
2025-05-07T20:31:40.6259345Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.6260441Z context = <triton._C.libtriton.ir.context object at 0x7f51c933c6f0>
2025-05-07T20:31:40.6260739Z 
2025-05-07T20:31:40.6260909Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.6261440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.6261910Z                            module_map=module_map)
2025-05-07T20:31:40.6262275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.6262646Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.6262923Z E       ^
2025-05-07T20:31:40.6263385Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.6263841Z 
2025-05-07T20:31:40.6264257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.6264769Z 
2025-05-07T20:31:40.6264883Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.6265299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.6265703Z     T=16384,
2025-05-07T20:31:40.6265906Z     D=5120,
2025-05-07T20:31:40.6266111Z     scale_ub=None,
2025-05-07T20:31:40.6266328Z     contiguous=True,
2025-05-07T20:31:40.6266560Z     compiled=True,
2025-05-07T20:31:40.6266777Z )
2025-05-07T20:31:40.6531833Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:40.6534751Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:40.6537398Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:40.6539377Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:40.6540740Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:40.7194031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.7194978Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.7195356Z 
2025-05-07T20:31:40.7195451Z     @given(
2025-05-07T20:31:40.7195695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.7196025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.7196333Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.7196676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.7197018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.7197307Z     )
2025-05-07T20:31:40.7197670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.7198123Z     def test_silu_mul_quant(
2025-05-07T20:31:40.7198374Z         self,
2025-05-07T20:31:40.7198574Z         T: int,
2025-05-07T20:31:40.7198781Z         D: int,
2025-05-07T20:31:40.7199007Z         scale_ub: Optional[float],
2025-05-07T20:31:40.7199277Z         contiguous: bool,
2025-05-07T20:31:40.7199534Z         compiled: bool,
2025-05-07T20:31:40.7199764Z     ) -> None:
2025-05-07T20:31:40.7199983Z         torch.manual_seed(2025)
2025-05-07T20:31:40.7200235Z     
2025-05-07T20:31:40.7200516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.7200859Z     
2025-05-07T20:31:40.7201062Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.7201360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.7201672Z         x = x_sign * x_clamp
2025-05-07T20:31:40.7202046Z         x0 = x[:, :D]
2025-05-07T20:31:40.7202274Z         x1 = x[:, D:]
2025-05-07T20:31:40.7202484Z     
2025-05-07T20:31:40.7202682Z         if contiguous:
2025-05-07T20:31:40.7202924Z             x0 = x0.contiguous()
2025-05-07T20:31:40.7203186Z             x1 = x1.contiguous()
2025-05-07T20:31:40.7203432Z     
2025-05-07T20:31:40.7203632Z         if scale_ub is not None:
2025-05-07T20:31:40.7203906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.7204253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.7204571Z             )
2025-05-07T20:31:40.7204774Z         else:
2025-05-07T20:31:40.7204991Z             scale_ub_tensor = None
2025-05-07T20:31:40.7205251Z     
2025-05-07T20:31:40.7205493Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.7206078Z             op = silu_mul_quant
2025-05-07T20:31:40.7206339Z             if compiled:
2025-05-07T20:31:40.7206598Z                 op = torch.compile(op)
2025-05-07T20:31:40.7206910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.7207199Z     
2025-05-07T20:31:40.7207404Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.7207745Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.7208051Z     
2025-05-07T20:31:40.7208299Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.7208635Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.7208938Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.7209267Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.7209635Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.7209948Z     
2025-05-07T20:31:40.7210163Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.7210360Z 
2025-05-07T20:31:40.7210473Z moe/activation_test.py:126: 
2025-05-07T20:31:40.7210776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.7211127Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.7211464Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.7212260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.7213015Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.7213563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.7214399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.7215085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.7215812Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.7216571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.7217322Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.7218047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.7218688Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.7219293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.7219824Z     fn()
2025-05-07T20:31:40.7220332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.7220919Z     self.fn.run(
2025-05-07T20:31:40.7221395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.7221926Z     kernel = self.compile(
2025-05-07T20:31:40.7222587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.7223256Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.7223660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.7223893Z 
2025-05-07T20:31:40.7224101Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8c6e850>
2025-05-07T20:31:40.7225183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.7226561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51caa054e0>}
2025-05-07T20:31:40.7227911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.7228934Z context = <triton._C.libtriton.ir.context object at 0x7f51c8c71eb0>
2025-05-07T20:31:40.7229228Z 
2025-05-07T20:31:40.7229397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.7229924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.7230407Z                            module_map=module_map)
2025-05-07T20:31:40.7230774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.7231141Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.7231419Z E       ^
2025-05-07T20:31:40.7231885Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.7232342Z 
2025-05-07T20:31:40.7232764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.7233289Z 
2025-05-07T20:31:40.7233400Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.7233820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.7234224Z     T=1,
2025-05-07T20:31:40.7234420Z     D=5120,
2025-05-07T20:31:40.7234625Z     scale_ub=1200.0,
2025-05-07T20:31:40.7234851Z     contiguous=True,
2025-05-07T20:31:40.7235085Z     compiled=True,
2025-05-07T20:31:40.7235408Z )
2025-05-07T20:31:41.0184667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.0185409Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.0185770Z 
2025-05-07T20:31:41.0185880Z     @given(
2025-05-07T20:31:41.0186192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.0186515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.0186820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.0187165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.0187496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.0187779Z     )
2025-05-07T20:31:41.0188138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.0188585Z     def test_silu_mul_quant(
2025-05-07T20:31:41.0188838Z         self,
2025-05-07T20:31:41.0189035Z         T: int,
2025-05-07T20:31:41.0189240Z         D: int,
2025-05-07T20:31:41.0189470Z         scale_ub: Optional[float],
2025-05-07T20:31:41.0189786Z         contiguous: bool,
2025-05-07T20:31:41.0190040Z         compiled: bool,
2025-05-07T20:31:41.0190271Z     ) -> None:
2025-05-07T20:31:41.0190488Z         torch.manual_seed(2025)
2025-05-07T20:31:41.0190738Z     
2025-05-07T20:31:41.0191015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.0191361Z     
2025-05-07T20:31:41.0191558Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.0192009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.0192322Z         x = x_sign * x_clamp
2025-05-07T20:31:41.0192567Z         x0 = x[:, :D]
2025-05-07T20:31:41.0192789Z         x1 = x[:, D:]
2025-05-07T20:31:41.0192998Z     
2025-05-07T20:31:41.0193194Z         if contiguous:
2025-05-07T20:31:41.0193434Z             x0 = x0.contiguous()
2025-05-07T20:31:41.0193693Z             x1 = x1.contiguous()
2025-05-07T20:31:41.0193944Z     
2025-05-07T20:31:41.0194154Z         if scale_ub is not None:
2025-05-07T20:31:41.0194432Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.0194768Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.0195085Z             )
2025-05-07T20:31:41.0195288Z         else:
2025-05-07T20:31:41.0195503Z             scale_ub_tensor = None
2025-05-07T20:31:41.0195765Z     
2025-05-07T20:31:41.0196002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.0196318Z             op = silu_mul_quant
2025-05-07T20:31:41.0196581Z             if compiled:
2025-05-07T20:31:41.0196834Z                 op = torch.compile(op)
2025-05-07T20:31:41.0197129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0197411Z     
2025-05-07T20:31:41.0197612Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.0197778Z 
2025-05-07T20:31:41.0197881Z moe/activation_test.py:117: 
2025-05-07T20:31:41.0198184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0198529Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.0198813Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0199374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.0199940Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.0200605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.0201302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.0201846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.0202533Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.0203204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.0203735Z     kernel = self.compile(
2025-05-07T20:31:41.0204467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.0205130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.0205532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0206144Z 
2025-05-07T20:31:41.0206354Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c839fe10>
2025-05-07T20:31:41.0207442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.0208884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51ca2f09a0>}
2025-05-07T20:31:41.0210233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.0211257Z context = <triton._C.libtriton.ir.context object at 0x7f51c83afbb0>
2025-05-07T20:31:41.0211550Z 
2025-05-07T20:31:41.0211716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.0212243Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.0212835Z                            module_map=module_map)
2025-05-07T20:31:41.0213204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.0213566Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.0213832Z E       ^
2025-05-07T20:31:41.0214298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.0214754Z 
2025-05-07T20:31:41.0215171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.0215691Z 
2025-05-07T20:31:41.0215797Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.0216219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.0216622Z     T=1,
2025-05-07T20:31:41.0216820Z     D=5120,
2025-05-07T20:31:41.0217023Z     scale_ub=None,
2025-05-07T20:31:41.0217247Z     contiguous=False,
2025-05-07T20:31:41.0217483Z     compiled=True,
2025-05-07T20:31:41.0217702Z )
2025-05-07T20:31:41.0691787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.0692530Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.0692898Z 
2025-05-07T20:31:41.0693008Z     @given(
2025-05-07T20:31:41.0702532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.0702901Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.0703220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.0703575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.0703909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.0704196Z     )
2025-05-07T20:31:41.0704560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.0705019Z     def test_silu_mul_quant(
2025-05-07T20:31:41.0705271Z         self,
2025-05-07T20:31:41.0705475Z         T: int,
2025-05-07T20:31:41.0705931Z         D: int,
2025-05-07T20:31:41.0706163Z         scale_ub: Optional[float],
2025-05-07T20:31:41.0706447Z         contiguous: bool,
2025-05-07T20:31:41.0706696Z         compiled: bool,
2025-05-07T20:31:41.0706933Z     ) -> None:
2025-05-07T20:31:41.0707154Z         torch.manual_seed(2025)
2025-05-07T20:31:41.0707407Z     
2025-05-07T20:31:41.0707693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.0708043Z     
2025-05-07T20:31:41.0708249Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.0708724Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.0709037Z         x = x_sign * x_clamp
2025-05-07T20:31:41.0709288Z         x0 = x[:, :D]
2025-05-07T20:31:41.0709519Z         x1 = x[:, D:]
2025-05-07T20:31:41.0709764Z     
2025-05-07T20:31:41.0709976Z         if contiguous:
2025-05-07T20:31:41.0710219Z             x0 = x0.contiguous()
2025-05-07T20:31:41.0710483Z             x1 = x1.contiguous()
2025-05-07T20:31:41.0710734Z     
2025-05-07T20:31:41.0710942Z         if scale_ub is not None:
2025-05-07T20:31:41.0711221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.0711565Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.0711885Z             )
2025-05-07T20:31:41.0712091Z         else:
2025-05-07T20:31:41.0712307Z             scale_ub_tensor = None
2025-05-07T20:31:41.0712567Z     
2025-05-07T20:31:41.0712809Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.0713135Z             op = silu_mul_quant
2025-05-07T20:31:41.0713393Z             if compiled:
2025-05-07T20:31:41.0713652Z                 op = torch.compile(op)
2025-05-07T20:31:41.0713954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0714239Z     
2025-05-07T20:31:41.0714441Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.0714731Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.0715031Z     
2025-05-07T20:31:41.0715278Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.0715731Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.0716038Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.0716361Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.0716729Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.0717044Z     
2025-05-07T20:31:41.0717254Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.0717451Z 
2025-05-07T20:31:41.0717570Z moe/activation_test.py:126: 
2025-05-07T20:31:41.0717876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0718227Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.0718564Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.0719358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.0720135Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.0720693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.0721390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.0722084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.0722812Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.0723582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.0724341Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.0725073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.0725724Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.0726467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.0727214Z     fn()
2025-05-07T20:31:41.0727922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.0728521Z     self.fn.run(
2025-05-07T20:31:41.0729003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.0729641Z     kernel = self.compile(
2025-05-07T20:31:41.0730191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.0730854Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.0731257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0731492Z 
2025-05-07T20:31:41.0731707Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c832ba90>
2025-05-07T20:31:41.0732799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.0734185Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c97751c0>}
2025-05-07T20:31:41.0735545Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.0736568Z context = <triton._C.libtriton.ir.context object at 0x7f51c8346b70>
2025-05-07T20:31:41.0736866Z 
2025-05-07T20:31:41.0737036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.0737644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.0738124Z                            module_map=module_map)
2025-05-07T20:31:41.0738488Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.0738853Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.0739126Z E       ^
2025-05-07T20:31:41.0739593Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.0740057Z 
2025-05-07T20:31:41.0740520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.0741042Z 
2025-05-07T20:31:41.0741149Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.0741570Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.0741967Z     T=1,
2025-05-07T20:31:41.0742161Z     D=5120,
2025-05-07T20:31:41.0742362Z     scale_ub=None,
2025-05-07T20:31:41.0742586Z     contiguous=True,
2025-05-07T20:31:41.0742818Z     compiled=False,
2025-05-07T20:31:41.0743029Z )
2025-05-07T20:31:41.1894925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1895678Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.1896068Z 
2025-05-07T20:31:41.1896170Z     @given(
2025-05-07T20:31:41.1896480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1896921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1897332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1897713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1898035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1898319Z     )
2025-05-07T20:31:41.1898667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1899108Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1899356Z         self,
2025-05-07T20:31:41.1899553Z         T: int,
2025-05-07T20:31:41.1899756Z         D: int,
2025-05-07T20:31:41.1899973Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1900249Z         contiguous: bool,
2025-05-07T20:31:41.1900488Z         compiled: bool,
2025-05-07T20:31:41.1900709Z     ) -> None:
2025-05-07T20:31:41.1900927Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1901170Z     
2025-05-07T20:31:41.1901439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1901960Z     
2025-05-07T20:31:41.1902157Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1902447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1902756Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1902999Z         x0 = x[:, :D]
2025-05-07T20:31:41.1903216Z         x1 = x[:, D:]
2025-05-07T20:31:41.1903421Z     
2025-05-07T20:31:41.1903608Z         if contiguous:
2025-05-07T20:31:41.1903843Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1904099Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1904343Z     
2025-05-07T20:31:41.1904539Z         if scale_ub is not None:
2025-05-07T20:31:41.1904808Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1905147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1905461Z             )
2025-05-07T20:31:41.1905934Z         else:
2025-05-07T20:31:41.1906151Z             scale_ub_tensor = None
2025-05-07T20:31:41.1906415Z     
2025-05-07T20:31:41.1906644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1906962Z             op = silu_mul_quant
2025-05-07T20:31:41.1907211Z             if compiled:
2025-05-07T20:31:41.1907457Z                 op = torch.compile(op)
2025-05-07T20:31:41.1907776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1908046Z     
2025-05-07T20:31:41.1908238Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.1908400Z 
2025-05-07T20:31:41.1908508Z moe/activation_test.py:117: 
2025-05-07T20:31:41.1908931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1909266Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.1909577Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1910294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.1910979Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.1911518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1912194Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1912854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1913380Z     kernel = self.compile(
2025-05-07T20:31:41.1913922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1914572Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1914961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1915194Z 
2025-05-07T20:31:41.1915400Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c83e20d0>
2025-05-07T20:31:41.1916471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1917840Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9a50a40>}
2025-05-07T20:31:41.1919178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1920188Z context = <triton._C.libtriton.ir.context object at 0x7f51c86affb0>
2025-05-07T20:31:41.1920478Z 
2025-05-07T20:31:41.1920643Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1921163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1921627Z                            module_map=module_map)
2025-05-07T20:31:41.1922108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1922463Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.1922725Z E       ^
2025-05-07T20:31:41.1923183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1923639Z 
2025-05-07T20:31:41.1924057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1924568Z 
2025-05-07T20:31:41.1924675Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1925085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1925481Z     T=128,
2025-05-07T20:31:41.1925673Z     D=5120,
2025-05-07T20:31:41.1925866Z     scale_ub=None,
2025-05-07T20:31:41.1926081Z     contiguous=False,
2025-05-07T20:31:41.1926312Z     compiled=True,
2025-05-07T20:31:41.1926519Z )
2025-05-07T20:31:41.1926840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1927333Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.1927689Z 
2025-05-07T20:31:41.1927779Z     @given(
2025-05-07T20:31:41.1928012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1928319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1928623Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1929066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1929390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1929678Z     )
2025-05-07T20:31:41.1930031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1930466Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1930709Z         self,
2025-05-07T20:31:41.1930903Z         T: int,
2025-05-07T20:31:41.1931095Z         D: int,
2025-05-07T20:31:41.1931321Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1931593Z         contiguous: bool,
2025-05-07T20:31:41.1931828Z         compiled: bool,
2025-05-07T20:31:41.1932056Z     ) -> None:
2025-05-07T20:31:41.1932270Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1932515Z     
2025-05-07T20:31:41.1932784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1933127Z     
2025-05-07T20:31:41.1933321Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1933616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1933924Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1934162Z         x0 = x[:, :D]
2025-05-07T20:31:41.1934376Z         x1 = x[:, D:]
2025-05-07T20:31:41.1934584Z     
2025-05-07T20:31:41.1934770Z         if contiguous:
2025-05-07T20:31:41.1935001Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1935266Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1935508Z     
2025-05-07T20:31:41.1935696Z         if scale_ub is not None:
2025-05-07T20:31:41.1935978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1936310Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1936611Z             )
2025-05-07T20:31:41.1936808Z         else:
2025-05-07T20:31:41.1937019Z             scale_ub_tensor = None
2025-05-07T20:31:41.1937269Z     
2025-05-07T20:31:41.1937500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1937820Z             op = silu_mul_quant
2025-05-07T20:31:41.1938073Z             if compiled:
2025-05-07T20:31:41.1938319Z                 op = torch.compile(op)
2025-05-07T20:31:41.1938615Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1938901Z     
2025-05-07T20:31:41.1939096Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.1939266Z 
2025-05-07T20:31:41.1939365Z moe/activation_test.py:117: 
2025-05-07T20:31:41.1939663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1940016Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.1940414Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1940969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.1941532Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.1942184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.1942875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.1943416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1944092Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1944753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1945283Z     kernel = self.compile(
2025-05-07T20:31:41.1945818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1946465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1946859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1947087Z 
2025-05-07T20:31:41.1947296Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8672950>
2025-05-07T20:31:41.1948449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1949801Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9a52c00>}
2025-05-07T20:31:41.1951131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1952149Z context = <triton._C.libtriton.ir.context object at 0x7f51c867a730>
2025-05-07T20:31:41.1952430Z 
2025-05-07T20:31:41.1952596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1953108Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1953577Z                            module_map=module_map)
2025-05-07T20:31:41.1953935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1954283Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.1954534Z E       ^
2025-05-07T20:31:41.1954989Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1955431Z 
2025-05-07T20:31:41.1955846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1956355Z 
2025-05-07T20:31:41.1956460Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1956860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1957256Z     T=128,
2025-05-07T20:31:41.1957442Z     D=7168,
2025-05-07T20:31:41.1957629Z     scale_ub=1200.0,
2025-05-07T20:31:41.1957845Z     contiguous=False,
2025-05-07T20:31:41.1958067Z     compiled=False,
2025-05-07T20:31:41.1958263Z )
2025-05-07T20:31:41.2829524Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2830308Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.2830689Z 
2025-05-07T20:31:41.2830801Z     @given(
2025-05-07T20:31:41.2831095Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2831402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2831709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2832232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2832554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2832840Z     )
2025-05-07T20:31:41.2833183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2833618Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2833862Z         self,
2025-05-07T20:31:41.2834046Z         T: int,
2025-05-07T20:31:41.2834240Z         D: int,
2025-05-07T20:31:41.2834458Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2834730Z         contiguous: bool,
2025-05-07T20:31:41.2834966Z         compiled: bool,
2025-05-07T20:31:41.2835186Z     ) -> None:
2025-05-07T20:31:41.2835395Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2835636Z     
2025-05-07T20:31:41.2835912Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2836254Z     
2025-05-07T20:31:41.2836441Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2836733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2837035Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2837266Z         x0 = x[:, :D]
2025-05-07T20:31:41.2837475Z         x1 = x[:, D:]
2025-05-07T20:31:41.2837679Z     
2025-05-07T20:31:41.2837856Z         if contiguous:
2025-05-07T20:31:41.2838085Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2838341Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2838571Z     
2025-05-07T20:31:41.2838899Z         if scale_ub is not None:
2025-05-07T20:31:41.2839171Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2839497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2839803Z             )
2025-05-07T20:31:41.2839994Z         else:
2025-05-07T20:31:41.2840197Z             scale_ub_tensor = None
2025-05-07T20:31:41.2840443Z     
2025-05-07T20:31:41.2840671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2840984Z             op = silu_mul_quant
2025-05-07T20:31:41.2841228Z             if compiled:
2025-05-07T20:31:41.2841472Z                 op = torch.compile(op)
2025-05-07T20:31:41.2841762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2842026Z     
2025-05-07T20:31:41.2842209Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2842372Z 
2025-05-07T20:31:41.2842481Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2842767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2843103Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2843382Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2844070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2844765Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2845300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2845988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2846640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2847172Z     kernel = self.compile(
2025-05-07T20:31:41.2847834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2848492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2848890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2849126Z 
2025-05-07T20:31:41.2849334Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c86decd0>
2025-05-07T20:31:41.2850423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2851950Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c971f380>}
2025-05-07T20:31:41.2853301Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2854329Z context = <triton._C.libtriton.ir.context object at 0x7f51c86dabb0>
2025-05-07T20:31:41.2854621Z 
2025-05-07T20:31:41.2854787Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2855313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2855779Z                            module_map=module_map)
2025-05-07T20:31:41.2856141Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2856501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2856763Z E       ^
2025-05-07T20:31:41.2857221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2857675Z 
2025-05-07T20:31:41.2858088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2858603Z 
2025-05-07T20:31:41.2858707Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2859198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2859600Z     T=128,
2025-05-07T20:31:41.2859793Z     D=5120,
2025-05-07T20:31:41.2859996Z     scale_ub=None,
2025-05-07T20:31:41.2860208Z     contiguous=False,
2025-05-07T20:31:41.2860438Z     compiled=False,
2025-05-07T20:31:41.2860644Z )
2025-05-07T20:31:41.2860997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.2861579Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.2861862Z 
2025-05-07T20:31:41.2861944Z     @given(
2025-05-07T20:31:41.2862178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.2862488Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.2862798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.2863129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.2863447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.2863740Z     )
2025-05-07T20:31:41.2864091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.2864527Z     def test_silu_mul_quant(
2025-05-07T20:31:41.2864771Z         self,
2025-05-07T20:31:41.2864971Z         T: int,
2025-05-07T20:31:41.2865161Z         D: int,
2025-05-07T20:31:41.2865383Z         scale_ub: Optional[float],
2025-05-07T20:31:41.2865654Z         contiguous: bool,
2025-05-07T20:31:41.2865897Z         compiled: bool,
2025-05-07T20:31:41.2866124Z     ) -> None:
2025-05-07T20:31:41.2866341Z         torch.manual_seed(2025)
2025-05-07T20:31:41.2866584Z     
2025-05-07T20:31:41.2867108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.2867588Z     
2025-05-07T20:31:41.2867840Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.2868258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.2868700Z         x = x_sign * x_clamp
2025-05-07T20:31:41.2868996Z         x0 = x[:, :D]
2025-05-07T20:31:41.2869344Z         x1 = x[:, D:]
2025-05-07T20:31:41.2869682Z     
2025-05-07T20:31:41.2869925Z         if contiguous:
2025-05-07T20:31:41.2870278Z             x0 = x0.contiguous()
2025-05-07T20:31:41.2870736Z             x1 = x1.contiguous()
2025-05-07T20:31:41.2871066Z     
2025-05-07T20:31:41.2871375Z         if scale_ub is not None:
2025-05-07T20:31:41.2871778Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.2881043Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.2881590Z             )
2025-05-07T20:31:41.2881795Z         else:
2025-05-07T20:31:41.2882013Z             scale_ub_tensor = None
2025-05-07T20:31:41.2882267Z     
2025-05-07T20:31:41.2882515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.2882844Z             op = silu_mul_quant
2025-05-07T20:31:41.2883098Z             if compiled:
2025-05-07T20:31:41.2883358Z                 op = torch.compile(op)
2025-05-07T20:31:41.2883664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2883937Z     
2025-05-07T20:31:41.2884128Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.2884294Z 
2025-05-07T20:31:41.2884394Z moe/activation_test.py:117: 
2025-05-07T20:31:41.2884690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2885020Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.2885306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.2886010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.2886710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.2887256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.2888012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.2888771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.2889316Z     kernel = self.compile(
2025-05-07T20:31:41.2889877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.2890537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.2890941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.2891183Z 
2025-05-07T20:31:41.2891394Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c82b9a10>
2025-05-07T20:31:41.2892490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.2893876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9777380>}
2025-05-07T20:31:41.2895220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.2896248Z context = <triton._C.libtriton.ir.context object at 0x7f51c82f6fb0>
2025-05-07T20:31:41.2896543Z 
2025-05-07T20:31:41.2896714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.2897246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.2897720Z                            module_map=module_map)
2025-05-07T20:31:41.2898085Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.2898443Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.2898711Z E       ^
2025-05-07T20:31:41.2899179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.2899632Z 
2025-05-07T20:31:41.2900048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.2900564Z 
2025-05-07T20:31:41.2900670Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.2901088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.2901493Z     T=128,
2025-05-07T20:31:41.2901788Z     D=5120,
2025-05-07T20:31:41.2901991Z     scale_ub=1200.0,
2025-05-07T20:31:41.2902215Z     contiguous=True,
2025-05-07T20:31:41.2902445Z     compiled=False,
2025-05-07T20:31:41.2902659Z )
2025-05-07T20:31:41.6133280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.6134719Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.6135435Z 
2025-05-07T20:31:41.6135637Z     @given(
2025-05-07T20:31:41.6136235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.6136899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.6137444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.6138046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.6138640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.6139153Z     )
2025-05-07T20:31:41.6139776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.6140584Z     def test_silu_mul_quant(
2025-05-07T20:31:41.6140935Z         self,
2025-05-07T20:31:41.6141123Z         T: int,
2025-05-07T20:31:41.6141320Z         D: int,
2025-05-07T20:31:41.6141541Z         scale_ub: Optional[float],
2025-05-07T20:31:41.6141811Z         contiguous: bool,
2025-05-07T20:31:41.6142046Z         compiled: bool,
2025-05-07T20:31:41.6142272Z     ) -> None:
2025-05-07T20:31:41.6142477Z         torch.manual_seed(2025)
2025-05-07T20:31:41.6142718Z     
2025-05-07T20:31:41.6143172Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.6143516Z     
2025-05-07T20:31:41.6143711Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.6144003Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.6144311Z         x = x_sign * x_clamp
2025-05-07T20:31:41.6144548Z         x0 = x[:, :D]
2025-05-07T20:31:41.6144763Z         x1 = x[:, D:]
2025-05-07T20:31:41.6144974Z     
2025-05-07T20:31:41.6145158Z         if contiguous:
2025-05-07T20:31:41.6145392Z             x0 = x0.contiguous()
2025-05-07T20:31:41.6145653Z             x1 = x1.contiguous()
2025-05-07T20:31:41.6145889Z     
2025-05-07T20:31:41.6146083Z         if scale_ub is not None:
2025-05-07T20:31:41.6146353Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.6146689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.6147001Z             )
2025-05-07T20:31:41.6147195Z         else:
2025-05-07T20:31:41.6147403Z             scale_ub_tensor = None
2025-05-07T20:31:41.6147659Z     
2025-05-07T20:31:41.6147887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.6148197Z             op = silu_mul_quant
2025-05-07T20:31:41.6148449Z             if compiled:
2025-05-07T20:31:41.6148691Z                 op = torch.compile(op)
2025-05-07T20:31:41.6148983Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6149254Z     
2025-05-07T20:31:41.6149442Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.6149609Z 
2025-05-07T20:31:41.6149730Z moe/activation_test.py:117: 
2025-05-07T20:31:41.6150052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6150388Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.6150665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6151345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.6152038Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.6152579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.6153258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.6153908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.6154438Z     kernel = self.compile(
2025-05-07T20:31:41.6154975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.6155756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.6156142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6156371Z 
2025-05-07T20:31:41.6156574Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8268750>
2025-05-07T20:31:41.6157654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.6159025Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c97749a0>}
2025-05-07T20:31:41.6160361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.6161385Z context = <triton._C.libtriton.ir.context object at 0x7f51c8224630>
2025-05-07T20:31:41.6161677Z 
2025-05-07T20:31:41.6161842Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.6162368Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.6162912Z                            module_map=module_map)
2025-05-07T20:31:41.6163283Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.6163639Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.6163895Z E       ^
2025-05-07T20:31:41.6164359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.6164811Z 
2025-05-07T20:31:41.6165225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.6165738Z 
2025-05-07T20:31:41.6165849Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.6166256Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.6166661Z     T=1,
2025-05-07T20:31:41.6166856Z     D=7168,
2025-05-07T20:31:41.6167049Z     scale_ub=1200.0,
2025-05-07T20:31:41.6167275Z     contiguous=True,
2025-05-07T20:31:41.6167619Z     compiled=True,
2025-05-07T20:31:41.6167829Z )
2025-05-07T20:31:41.6168152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.6168637Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.6168895Z 
2025-05-07T20:31:41.6168978Z     @given(
2025-05-07T20:31:41.6169207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.6169523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.6169838Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.6170197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.6170549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.6170844Z     )
2025-05-07T20:31:41.6171188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.6171633Z     def test_silu_mul_quant(
2025-05-07T20:31:41.6171880Z         self,
2025-05-07T20:31:41.6172079Z         T: int,
2025-05-07T20:31:41.6172274Z         D: int,
2025-05-07T20:31:41.6172506Z         scale_ub: Optional[float],
2025-05-07T20:31:41.6172780Z         contiguous: bool,
2025-05-07T20:31:41.6173009Z         compiled: bool,
2025-05-07T20:31:41.6173235Z     ) -> None:
2025-05-07T20:31:41.6173449Z         torch.manual_seed(2025)
2025-05-07T20:31:41.6173686Z     
2025-05-07T20:31:41.6173958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.6174303Z     
2025-05-07T20:31:41.6174498Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.6174891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.6175204Z         x = x_sign * x_clamp
2025-05-07T20:31:41.6175441Z         x0 = x[:, :D]
2025-05-07T20:31:41.6175665Z         x1 = x[:, D:]
2025-05-07T20:31:41.6175878Z     
2025-05-07T20:31:41.6176062Z         if contiguous:
2025-05-07T20:31:41.6176295Z             x0 = x0.contiguous()
2025-05-07T20:31:41.6176555Z             x1 = x1.contiguous()
2025-05-07T20:31:41.6176796Z     
2025-05-07T20:31:41.6176999Z         if scale_ub is not None:
2025-05-07T20:31:41.6177273Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.6177609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.6177917Z             )
2025-05-07T20:31:41.6178112Z         else:
2025-05-07T20:31:41.6178329Z             scale_ub_tensor = None
2025-05-07T20:31:41.6178580Z     
2025-05-07T20:31:41.6178816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.6179131Z             op = silu_mul_quant
2025-05-07T20:31:41.6179380Z             if compiled:
2025-05-07T20:31:41.6179630Z                 op = torch.compile(op)
2025-05-07T20:31:41.6179926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6180196Z     
2025-05-07T20:31:41.6180391Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.6180553Z 
2025-05-07T20:31:41.6180658Z moe/activation_test.py:117: 
2025-05-07T20:31:41.6180949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6181409Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.6181695Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6182248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.6182808Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.6183606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.6184311Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.6184850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.6185527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.6186190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.6186723Z     kernel = self.compile(
2025-05-07T20:31:41.6187277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.6187930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.6188324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6188554Z 
2025-05-07T20:31:41.6188767Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c82b7650>
2025-05-07T20:31:41.6189846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.6191211Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9832840>}
2025-05-07T20:31:41.6192556Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.6193580Z context = <triton._C.libtriton.ir.context object at 0x7f51c8253530>
2025-05-07T20:31:41.6193889Z 
2025-05-07T20:31:41.6194133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.6194698Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.6195269Z                            module_map=module_map)
2025-05-07T20:31:41.6195635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.6195990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.6196249Z E       ^
2025-05-07T20:31:41.6196713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.6197162Z 
2025-05-07T20:31:41.6197587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.6198098Z 
2025-05-07T20:31:41.6198203Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.6198614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.6199016Z     T=1,
2025-05-07T20:31:41.6199206Z     D=7168,
2025-05-07T20:31:41.6199398Z     scale_ub=1200.0,
2025-05-07T20:31:41.6199623Z     contiguous=False,
2025-05-07T20:31:41.6199885Z     compiled=True,
2025-05-07T20:31:41.6200109Z )
2025-05-07T20:31:41.7193285Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.7193933Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.7194246Z 
2025-05-07T20:31:41.7194330Z     @given(
2025-05-07T20:31:41.7194573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.7194890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.7195374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.7195707Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.7196042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.7196334Z     )
2025-05-07T20:31:41.7196680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.7197124Z     def test_silu_mul_quant(
2025-05-07T20:31:41.7197367Z         self,
2025-05-07T20:31:41.7197556Z         T: int,
2025-05-07T20:31:41.7197765Z         D: int,
2025-05-07T20:31:41.7197984Z         scale_ub: Optional[float],
2025-05-07T20:31:41.7198278Z         contiguous: bool,
2025-05-07T20:31:41.7198518Z         compiled: bool,
2025-05-07T20:31:41.7198743Z     ) -> None:
2025-05-07T20:31:41.7198963Z         torch.manual_seed(2025)
2025-05-07T20:31:41.7199202Z     
2025-05-07T20:31:41.7199477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.7199834Z     
2025-05-07T20:31:41.7200064Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.7200361Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.7200741Z         x = x_sign * x_clamp
2025-05-07T20:31:41.7201068Z         x0 = x[:, :D]
2025-05-07T20:31:41.7201280Z         x1 = x[:, D:]
2025-05-07T20:31:41.7201488Z     
2025-05-07T20:31:41.7201676Z         if contiguous:
2025-05-07T20:31:41.7201906Z             x0 = x0.contiguous()
2025-05-07T20:31:41.7202170Z             x1 = x1.contiguous()
2025-05-07T20:31:41.7202416Z     
2025-05-07T20:31:41.7202604Z         if scale_ub is not None:
2025-05-07T20:31:41.7202874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.7203206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.7203503Z             )
2025-05-07T20:31:41.7203704Z         else:
2025-05-07T20:31:41.7203914Z             scale_ub_tensor = None
2025-05-07T20:31:41.7204155Z     
2025-05-07T20:31:41.7204385Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.7204706Z             op = silu_mul_quant
2025-05-07T20:31:41.7204948Z             if compiled:
2025-05-07T20:31:41.7205194Z                 op = torch.compile(op)
2025-05-07T20:31:41.7205488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.7205979Z     
2025-05-07T20:31:41.7206167Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.7206330Z 
2025-05-07T20:31:41.7206427Z moe/activation_test.py:117: 
2025-05-07T20:31:41.7206716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.7207197Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.7207478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.7208100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.7208648Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.7209301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.7209994Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.7210530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.7211199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.7211855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.7212393Z     kernel = self.compile(
2025-05-07T20:31:41.7212927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.7213566Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.7213955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.7214181Z 
2025-05-07T20:31:41.7214389Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8434dd0>
2025-05-07T20:31:41.7215567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.7216930Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8d37420>}
2025-05-07T20:31:41.7218265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.7219284Z context = <triton._C.libtriton.ir.context object at 0x7f51c8468cb0>
2025-05-07T20:31:41.7219572Z 
2025-05-07T20:31:41.7219746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.7220275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.7220742Z                            module_map=module_map)
2025-05-07T20:31:41.7221095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.7221451Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.7221711Z E       ^
2025-05-07T20:31:41.7222174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.7222625Z 
2025-05-07T20:31:41.7223041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.7223553Z 
2025-05-07T20:31:41.7223659Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.7224070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.7224469Z     T=1,
2025-05-07T20:31:41.7224654Z     D=7168,
2025-05-07T20:31:41.7224857Z     scale_ub=None,
2025-05-07T20:31:41.7225067Z     contiguous=False,
2025-05-07T20:31:41.7225307Z     compiled=True,
2025-05-07T20:31:41.7225507Z )
2025-05-07T20:31:41.7901146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.7901893Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.7902249Z 
2025-05-07T20:31:41.7902362Z     @given(
2025-05-07T20:31:41.7902670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.7902981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.7903486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.7903808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.7904134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.7904419Z     )
2025-05-07T20:31:41.7904759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.7905194Z     def test_silu_mul_quant(
2025-05-07T20:31:41.7905436Z         self,
2025-05-07T20:31:41.7905825Z         T: int,
2025-05-07T20:31:41.7906024Z         D: int,
2025-05-07T20:31:41.7906243Z         scale_ub: Optional[float],
2025-05-07T20:31:41.7906512Z         contiguous: bool,
2025-05-07T20:31:41.7906743Z         compiled: bool,
2025-05-07T20:31:41.7906965Z     ) -> None:
2025-05-07T20:31:41.7907181Z         torch.manual_seed(2025)
2025-05-07T20:31:41.7907415Z     
2025-05-07T20:31:41.7907683Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.7908028Z     
2025-05-07T20:31:41.7908215Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.7908505Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.7908811Z         x = x_sign * x_clamp
2025-05-07T20:31:41.7909044Z         x0 = x[:, :D]
2025-05-07T20:31:41.7909258Z         x1 = x[:, D:]
2025-05-07T20:31:41.7909469Z     
2025-05-07T20:31:41.7909648Z         if contiguous:
2025-05-07T20:31:41.7909875Z             x0 = x0.contiguous()
2025-05-07T20:31:41.7910128Z             x1 = x1.contiguous()
2025-05-07T20:31:41.7910480Z     
2025-05-07T20:31:41.7910676Z         if scale_ub is not None:
2025-05-07T20:31:41.7910948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.7911277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.7911579Z             )
2025-05-07T20:31:41.7911770Z         else:
2025-05-07T20:31:41.7911978Z             scale_ub_tensor = None
2025-05-07T20:31:41.7912221Z     
2025-05-07T20:31:41.7912448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.7912764Z             op = silu_mul_quant
2025-05-07T20:31:41.7913006Z             if compiled:
2025-05-07T20:31:41.7913250Z                 op = torch.compile(op)
2025-05-07T20:31:41.7913539Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.7913805Z     
2025-05-07T20:31:41.7913993Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.7914271Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.7914551Z     
2025-05-07T20:31:41.7914792Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.7915121Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.7915410Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.7915717Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.7916074Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.7916385Z     
2025-05-07T20:31:41.7916581Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.7916784Z 
2025-05-07T20:31:41.7916883Z moe/activation_test.py:126: 
2025-05-07T20:31:41.7917180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.7917507Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.7917831Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.7918612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.7919361Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.7919895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.7920576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.7921259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.7921976Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.7922838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.7923623Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.7924639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.7925363Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.7925962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.7926484Z     fn()
2025-05-07T20:31:41.7926990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.7935186Z     self.fn.run(
2025-05-07T20:31:41.7935710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.7936266Z     kernel = self.compile(
2025-05-07T20:31:41.7936823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.7937480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.7937887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.7938127Z 
2025-05-07T20:31:41.7938449Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c84f6c10>
2025-05-07T20:31:41.7939543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.7941788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8d36c00>}
2025-05-07T20:31:41.7943855Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.7945390Z context = <triton._C.libtriton.ir.context object at 0x7f51c8486af0>
2025-05-07T20:31:41.7945819Z 
2025-05-07T20:31:41.7946046Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.7946591Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.7947062Z                            module_map=module_map)
2025-05-07T20:31:41.7947431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.7947793Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.7948078Z E       ^
2025-05-07T20:31:41.7948549Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.7949008Z 
2025-05-07T20:31:41.7949429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.7949942Z 
2025-05-07T20:31:41.7950058Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.7950473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.7950884Z     T=1,
2025-05-07T20:31:41.7951083Z     D=5120,
2025-05-07T20:31:41.7951292Z     scale_ub=1200.0,
2025-05-07T20:31:41.7951518Z     contiguous=False,
2025-05-07T20:31:41.7951759Z     compiled=True,
2025-05-07T20:31:41.7951979Z )
2025-05-07T20:31:41.9126872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.9127644Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.9128058Z 
2025-05-07T20:31:41.9128186Z     @given(
2025-05-07T20:31:41.9128520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.9129163Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.9129586Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.9130036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.9130365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.9130652Z     )
2025-05-07T20:31:41.9130996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.9131438Z     def test_silu_mul_quant(
2025-05-07T20:31:41.9131688Z         self,
2025-05-07T20:31:41.9131879Z         T: int,
2025-05-07T20:31:41.9132075Z         D: int,
2025-05-07T20:31:41.9132294Z         scale_ub: Optional[float],
2025-05-07T20:31:41.9132566Z         contiguous: bool,
2025-05-07T20:31:41.9132803Z         compiled: bool,
2025-05-07T20:31:41.9133028Z     ) -> None:
2025-05-07T20:31:41.9133247Z         torch.manual_seed(2025)
2025-05-07T20:31:41.9133487Z     
2025-05-07T20:31:41.9133772Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.9134117Z     
2025-05-07T20:31:41.9134308Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.9134602Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.9134912Z         x = x_sign * x_clamp
2025-05-07T20:31:41.9135152Z         x0 = x[:, :D]
2025-05-07T20:31:41.9135376Z         x1 = x[:, D:]
2025-05-07T20:31:41.9135591Z     
2025-05-07T20:31:41.9135776Z         if contiguous:
2025-05-07T20:31:41.9136143Z             x0 = x0.contiguous()
2025-05-07T20:31:41.9136410Z             x1 = x1.contiguous()
2025-05-07T20:31:41.9136646Z     
2025-05-07T20:31:41.9136843Z         if scale_ub is not None:
2025-05-07T20:31:41.9137127Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.9137467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.9137780Z             )
2025-05-07T20:31:41.9137977Z         else:
2025-05-07T20:31:41.9138197Z             scale_ub_tensor = None
2025-05-07T20:31:41.9138449Z     
2025-05-07T20:31:41.9138684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.9139000Z             op = silu_mul_quant
2025-05-07T20:31:41.9139250Z             if compiled:
2025-05-07T20:31:41.9139503Z                 op = torch.compile(op)
2025-05-07T20:31:41.9139802Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.9140073Z     
2025-05-07T20:31:41.9140270Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.9140433Z 
2025-05-07T20:31:41.9140547Z moe/activation_test.py:117: 
2025-05-07T20:31:41.9140844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.9141178Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.9141465Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.9142024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.9142583Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.9143250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.9143941Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.9144469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.9145152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.9145820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.9146353Z     kernel = self.compile(
2025-05-07T20:31:41.9146889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.9147544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.9147941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.9148259Z 
2025-05-07T20:31:41.9148468Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8b02ed0>
2025-05-07T20:31:41.9149540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.9150925Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8d351c0>}
2025-05-07T20:31:41.9152269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.9153297Z context = <triton._C.libtriton.ir.context object at 0x7f51c8b7adb0>
2025-05-07T20:31:41.9153586Z 
2025-05-07T20:31:41.9153753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.9154287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.9154757Z                            module_map=module_map)
2025-05-07T20:31:41.9155131Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.9155491Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.9155761Z E       ^
2025-05-07T20:31:41.9156310Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.9156763Z 
2025-05-07T20:31:41.9157180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.9157695Z 
2025-05-07T20:31:41.9157804Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.9158229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.9158642Z     T=1,
2025-05-07T20:31:41.9158843Z     D=5120,
2025-05-07T20:31:41.9159050Z     scale_ub=1200.0,
2025-05-07T20:31:41.9159302Z     contiguous=False,
2025-05-07T20:31:41.9159537Z     compiled=False,
2025-05-07T20:31:41.9159746Z )
2025-05-07T20:31:41.9160071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.9160569Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.9160838Z 
2025-05-07T20:31:41.9160919Z     @given(
2025-05-07T20:31:41.9161163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.9161482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.9161790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.9162128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.9162460Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.9162751Z     )
2025-05-07T20:31:41.9163101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.9163550Z     def test_silu_mul_quant(
2025-05-07T20:31:41.9163801Z         self,
2025-05-07T20:31:41.9163999Z         T: int,
2025-05-07T20:31:41.9164206Z         D: int,
2025-05-07T20:31:41.9164433Z         scale_ub: Optional[float],
2025-05-07T20:31:41.9164707Z         contiguous: bool,
2025-05-07T20:31:41.9164955Z         compiled: bool,
2025-05-07T20:31:41.9165186Z     ) -> None:
2025-05-07T20:31:41.9165401Z         torch.manual_seed(2025)
2025-05-07T20:31:41.9165652Z     
2025-05-07T20:31:41.9165936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.9166281Z     
2025-05-07T20:31:41.9166483Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.9166781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.9167098Z         x = x_sign * x_clamp
2025-05-07T20:31:41.9167345Z         x0 = x[:, :D]
2025-05-07T20:31:41.9167656Z         x1 = x[:, D:]
2025-05-07T20:31:41.9167862Z     
2025-05-07T20:31:41.9168147Z         if contiguous:
2025-05-07T20:31:41.9168389Z             x0 = x0.contiguous()
2025-05-07T20:31:41.9168657Z             x1 = x1.contiguous()
2025-05-07T20:31:41.9168898Z     
2025-05-07T20:31:41.9169097Z         if scale_ub is not None:
2025-05-07T20:31:41.9169376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.9169712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.9170031Z             )
2025-05-07T20:31:41.9170230Z         else:
2025-05-07T20:31:41.9170451Z             scale_ub_tensor = None
2025-05-07T20:31:41.9170706Z     
2025-05-07T20:31:41.9170941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.9171255Z             op = silu_mul_quant
2025-05-07T20:31:41.9171512Z             if compiled:
2025-05-07T20:31:41.9171765Z                 op = torch.compile(op)
2025-05-07T20:31:41.9172061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.9172345Z     
2025-05-07T20:31:41.9172547Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.9172719Z 
2025-05-07T20:31:41.9172827Z moe/activation_test.py:117: 
2025-05-07T20:31:41.9173122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.9173462Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.9173749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.9174440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.9175223Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.9175769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.9176458Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.9177123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.9177664Z     kernel = self.compile(
2025-05-07T20:31:41.9178212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.9178868Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.9179272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.9179509Z 
2025-05-07T20:31:41.9179716Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8be4490>
2025-05-07T20:31:41.9180802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.9182166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c91994e0>}
2025-05-07T20:31:41.9183510Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.9184538Z context = <triton._C.libtriton.ir.context object at 0x7f51c8b68370>
2025-05-07T20:31:41.9184827Z 
2025-05-07T20:31:41.9185002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.9185535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.9186007Z                            module_map=module_map)
2025-05-07T20:31:41.9186376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.9186737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.9187000Z E       ^
2025-05-07T20:31:41.9187467Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.9187915Z 
2025-05-07T20:31:41.9188340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.9188965Z 
2025-05-07T20:31:41.9189077Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.9189490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.9189900Z     T=16384,
2025-05-07T20:31:41.9190103Z     D=5120,
2025-05-07T20:31:41.9190309Z     scale_ub=1200.0,
2025-05-07T20:31:41.9190583Z     contiguous=False,
2025-05-07T20:31:41.9190828Z     compiled=True,
2025-05-07T20:31:41.9191032Z )
2025-05-07T20:31:42.1827691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.1828399Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.1828800Z 
2025-05-07T20:31:42.1828905Z     @given(
2025-05-07T20:31:42.1829213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.1829615Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.1830354Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.1831015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.1831665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.1832238Z     )
2025-05-07T20:31:42.1832936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.1833822Z     def test_silu_mul_quant(
2025-05-07T20:31:42.1834312Z         self,
2025-05-07T20:31:42.1834702Z         T: int,
2025-05-07T20:31:42.1835428Z         D: int,
2025-05-07T20:31:42.1835874Z         scale_ub: Optional[float],
2025-05-07T20:31:42.1836416Z         contiguous: bool,
2025-05-07T20:31:42.1836895Z         compiled: bool,
2025-05-07T20:31:42.1837339Z     ) -> None:
2025-05-07T20:31:42.1837775Z         torch.manual_seed(2025)
2025-05-07T20:31:42.1838263Z     
2025-05-07T20:31:42.1838802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.1839504Z     
2025-05-07T20:31:42.1839878Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.1840175Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.1840489Z         x = x_sign * x_clamp
2025-05-07T20:31:42.1840737Z         x0 = x[:, :D]
2025-05-07T20:31:42.1840951Z         x1 = x[:, D:]
2025-05-07T20:31:42.1841165Z     
2025-05-07T20:31:42.1841356Z         if contiguous:
2025-05-07T20:31:42.1841589Z             x0 = x0.contiguous()
2025-05-07T20:31:42.1841853Z             x1 = x1.contiguous()
2025-05-07T20:31:42.1842100Z     
2025-05-07T20:31:42.1842301Z         if scale_ub is not None:
2025-05-07T20:31:42.1842578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.1842919Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.1843237Z             )
2025-05-07T20:31:42.1843432Z         else:
2025-05-07T20:31:42.1843651Z             scale_ub_tensor = None
2025-05-07T20:31:42.1843907Z     
2025-05-07T20:31:42.1844140Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.1844463Z             op = silu_mul_quant
2025-05-07T20:31:42.1844716Z             if compiled:
2025-05-07T20:31:42.1844965Z                 op = torch.compile(op)
2025-05-07T20:31:42.1845263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.1845546Z     
2025-05-07T20:31:42.1845739Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.1845910Z 
2025-05-07T20:31:42.1846013Z moe/activation_test.py:117: 
2025-05-07T20:31:42.1846321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.1846653Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.1846939Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.1847590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.1848158Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.1848815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.1849644Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.1850189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.1850871Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.1851536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.1852077Z     kernel = self.compile(
2025-05-07T20:31:42.1852626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.1853277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.1853679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.1853909Z 
2025-05-07T20:31:42.1854122Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c889aed0>
2025-05-07T20:31:42.1855220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.1856647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9199580>}
2025-05-07T20:31:42.1858077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.1859109Z context = <triton._C.libtriton.ir.context object at 0x7f51c88cadb0>
2025-05-07T20:31:42.1859401Z 
2025-05-07T20:31:42.1859568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.1860091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.1860563Z                            module_map=module_map)
2025-05-07T20:31:42.1860931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.1861287Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.1861547Z E       ^
2025-05-07T20:31:42.1862017Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.1862472Z 
2025-05-07T20:31:42.1862897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.1863407Z 
2025-05-07T20:31:42.1863518Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.1863927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.1864332Z     T=2048,
2025-05-07T20:31:42.1864523Z     D=7168,
2025-05-07T20:31:42.1864719Z     scale_ub=1200.0,
2025-05-07T20:31:42.1864958Z     contiguous=False,
2025-05-07T20:31:42.1865185Z     compiled=True,
2025-05-07T20:31:42.1865389Z )
2025-05-07T20:31:42.1865706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.1866201Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.1866476Z 
2025-05-07T20:31:42.1866563Z     @given(
2025-05-07T20:31:42.1866790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.1867102Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.1867411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.1867744Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.1868077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.1868365Z     )
2025-05-07T20:31:42.1868714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.1869152Z     def test_silu_mul_quant(
2025-05-07T20:31:42.1869396Z         self,
2025-05-07T20:31:42.1869682Z         T: int,
2025-05-07T20:31:42.1869889Z         D: int,
2025-05-07T20:31:42.1870107Z         scale_ub: Optional[float],
2025-05-07T20:31:42.1870383Z         contiguous: bool,
2025-05-07T20:31:42.1870622Z         compiled: bool,
2025-05-07T20:31:42.1870851Z     ) -> None:
2025-05-07T20:31:42.1871064Z         torch.manual_seed(2025)
2025-05-07T20:31:42.1871306Z     
2025-05-07T20:31:42.1871579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.1871921Z     
2025-05-07T20:31:42.1872124Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.1872421Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.1872727Z         x = x_sign * x_clamp
2025-05-07T20:31:42.1872969Z         x0 = x[:, :D]
2025-05-07T20:31:42.1873186Z         x1 = x[:, D:]
2025-05-07T20:31:42.1873395Z     
2025-05-07T20:31:42.1873584Z         if contiguous:
2025-05-07T20:31:42.1873822Z             x0 = x0.contiguous()
2025-05-07T20:31:42.1874088Z             x1 = x1.contiguous()
2025-05-07T20:31:42.1874331Z     
2025-05-07T20:31:42.1874524Z         if scale_ub is not None:
2025-05-07T20:31:42.1874797Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.1875137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.1875447Z             )
2025-05-07T20:31:42.1875642Z         else:
2025-05-07T20:31:42.1875863Z             scale_ub_tensor = None
2025-05-07T20:31:42.1876118Z     
2025-05-07T20:31:42.1876430Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.1876743Z             op = silu_mul_quant
2025-05-07T20:31:42.1876996Z             if compiled:
2025-05-07T20:31:42.1877248Z                 op = torch.compile(op)
2025-05-07T20:31:42.1877540Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.1877822Z     
2025-05-07T20:31:42.1878020Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.1878197Z 
2025-05-07T20:31:42.1878304Z moe/activation_test.py:117: 
2025-05-07T20:31:42.1878602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.1878940Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.1879224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.1879783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.1880353Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.1881019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.1881712Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.1882243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.1882927Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.1883592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.1884128Z     kernel = self.compile(
2025-05-07T20:31:42.1884673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.1885329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.1885732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.1885961Z 
2025-05-07T20:31:42.1886175Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c88ebc50>
2025-05-07T20:31:42.1887263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.1888702Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c919b060>}
2025-05-07T20:31:42.1890129Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.1891153Z context = <triton._C.libtriton.ir.context object at 0x7f51c88ddf30>
2025-05-07T20:31:42.1891447Z 
2025-05-07T20:31:42.1891613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.1892141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.1892606Z                            module_map=module_map)
2025-05-07T20:31:42.1892966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.1893323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.1893583Z E       ^
2025-05-07T20:31:42.1894043Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.1894505Z 
2025-05-07T20:31:42.1894920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.1895431Z 
2025-05-07T20:31:42.2776773Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.2777666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.2778748Z     T=1,
2025-05-07T20:31:42.2779132Z     D=5120,
2025-05-07T20:31:42.2779530Z     scale_ub=None,
2025-05-07T20:31:42.2780249Z     contiguous=False,
2025-05-07T20:31:42.2780488Z     compiled=False,
2025-05-07T20:31:42.2780696Z )
2025-05-07T20:31:42.2781025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.2781519Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:42.2781784Z 
2025-05-07T20:31:42.2781867Z     @given(
2025-05-07T20:31:42.2782114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.2782448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.2782758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.2783095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.2783433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.2783723Z     )
2025-05-07T20:31:42.2784078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.2784526Z     def test_silu_mul_quant(
2025-05-07T20:31:42.2792248Z         self,
2025-05-07T20:31:42.2792484Z         T: int,
2025-05-07T20:31:42.2792693Z         D: int,
2025-05-07T20:31:42.2792918Z         scale_ub: Optional[float],
2025-05-07T20:31:42.2793195Z         contiguous: bool,
2025-05-07T20:31:42.2793439Z         compiled: bool,
2025-05-07T20:31:42.2793666Z     ) -> None:
2025-05-07T20:31:42.2793883Z         torch.manual_seed(2025)
2025-05-07T20:31:42.2794124Z     
2025-05-07T20:31:42.2794409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.2794768Z     
2025-05-07T20:31:42.2794964Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.2795262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.2795581Z         x = x_sign * x_clamp
2025-05-07T20:31:42.2795818Z         x0 = x[:, :D]
2025-05-07T20:31:42.2796035Z         x1 = x[:, D:]
2025-05-07T20:31:42.2796244Z     
2025-05-07T20:31:42.2796429Z         if contiguous:
2025-05-07T20:31:42.2796660Z             x0 = x0.contiguous()
2025-05-07T20:31:42.2796932Z             x1 = x1.contiguous()
2025-05-07T20:31:42.2797167Z     
2025-05-07T20:31:42.2797363Z         if scale_ub is not None:
2025-05-07T20:31:42.2797644Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.2797977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.2798293Z             )
2025-05-07T20:31:42.2798499Z         else:
2025-05-07T20:31:42.2798709Z             scale_ub_tensor = None
2025-05-07T20:31:42.2798959Z     
2025-05-07T20:31:42.2799358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.2799677Z             op = silu_mul_quant
2025-05-07T20:31:42.2799923Z             if compiled:
2025-05-07T20:31:42.2800175Z                 op = torch.compile(op)
2025-05-07T20:31:42.2800471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2800744Z     
2025-05-07T20:31:42.2800945Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.2801110Z 
2025-05-07T20:31:42.2801217Z moe/activation_test.py:117: 
2025-05-07T20:31:42.2801512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2801847Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.2802131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2802830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.2803525Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.2804073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.2804759Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.2805418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.2806237Z     kernel = self.compile(
2025-05-07T20:31:42.2806901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.2807660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.2808057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2808291Z 
2025-05-07T20:31:42.2808500Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c887d210>
2025-05-07T20:31:42.2809583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.2810963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8e485e0>}
2025-05-07T20:31:42.2812318Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.2813347Z context = <triton._C.libtriton.ir.context object at 0x7f51c889d0f0>
2025-05-07T20:31:42.2813641Z 
2025-05-07T20:31:42.2813810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.2814341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.2814804Z                            module_map=module_map)
2025-05-07T20:31:42.2815174Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.2815532Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.2815796Z E       ^
2025-05-07T20:31:42.2816263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.2816723Z 
2025-05-07T20:31:42.2817139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.2817655Z 
2025-05-07T20:31:42.2817767Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.2818211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.2818618Z     T=4096,
2025-05-07T20:31:42.2818813Z     D=7168,
2025-05-07T20:31:42.2819012Z     scale_ub=1200.0,
2025-05-07T20:31:42.2819237Z     contiguous=False,
2025-05-07T20:31:42.2819471Z     compiled=False,
2025-05-07T20:31:42.2819683Z )
2025-05-07T20:31:42.2820152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.2820647Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:42.2820928Z 
2025-05-07T20:31:42.2821007Z     @given(
2025-05-07T20:31:42.2821238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.2821548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.2821862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.2822196Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.2822524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.2822811Z     )
2025-05-07T20:31:42.2823168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.2823607Z     def test_silu_mul_quant(
2025-05-07T20:31:42.2823847Z         self,
2025-05-07T20:31:42.2824051Z         T: int,
2025-05-07T20:31:42.2824252Z         D: int,
2025-05-07T20:31:42.2824466Z         scale_ub: Optional[float],
2025-05-07T20:31:42.2824747Z         contiguous: bool,
2025-05-07T20:31:42.2824987Z         compiled: bool,
2025-05-07T20:31:42.2825207Z     ) -> None:
2025-05-07T20:31:42.2825426Z         torch.manual_seed(2025)
2025-05-07T20:31:42.2825667Z     
2025-05-07T20:31:42.2825940Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.2826288Z     
2025-05-07T20:31:42.2826489Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.2826861Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.2827181Z         x = x_sign * x_clamp
2025-05-07T20:31:42.2827426Z         x0 = x[:, :D]
2025-05-07T20:31:42.2827642Z         x1 = x[:, D:]
2025-05-07T20:31:42.2827854Z     
2025-05-07T20:31:42.2828040Z         if contiguous:
2025-05-07T20:31:42.2828266Z             x0 = x0.contiguous()
2025-05-07T20:31:42.2828526Z             x1 = x1.contiguous()
2025-05-07T20:31:42.2828765Z     
2025-05-07T20:31:42.2828958Z         if scale_ub is not None:
2025-05-07T20:31:42.2829231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.2829563Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.2829876Z             )
2025-05-07T20:31:42.2830068Z         else:
2025-05-07T20:31:42.2830280Z             scale_ub_tensor = None
2025-05-07T20:31:42.2830531Z     
2025-05-07T20:31:42.2830763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.2831078Z             op = silu_mul_quant
2025-05-07T20:31:42.2831333Z             if compiled:
2025-05-07T20:31:42.2831588Z                 op = torch.compile(op)
2025-05-07T20:31:42.2831889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2832165Z     
2025-05-07T20:31:42.2832361Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.2832527Z 
2025-05-07T20:31:42.2832624Z moe/activation_test.py:117: 
2025-05-07T20:31:42.2832923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2833255Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.2833541Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2834226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.2834923Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.2835459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.2836143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.2836810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.2837344Z     kernel = self.compile(
2025-05-07T20:31:42.2837885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.2838535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.2839017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2839246Z 
2025-05-07T20:31:42.2839458Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c871bb50>
2025-05-07T20:31:42.2840537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.2841911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8e499e0>}
2025-05-07T20:31:42.2843257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.2844287Z context = <triton._C.libtriton.ir.context object at 0x7f51c878f9f0>
2025-05-07T20:31:42.2844580Z 
2025-05-07T20:31:42.2844757Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.2845282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.2845752Z                            module_map=module_map)
2025-05-07T20:31:42.2846119Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.2846471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.2846805Z E       ^
2025-05-07T20:31:42.2847274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.2847777Z 
2025-05-07T20:31:42.2848204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.2848713Z 
2025-05-07T20:31:42.2848819Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.2849228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.2849639Z     T=16384,
2025-05-07T20:31:42.2849836Z     D=7168,
2025-05-07T20:31:42.2850026Z     scale_ub=None,
2025-05-07T20:31:42.2850238Z     contiguous=True,
2025-05-07T20:31:42.2850460Z     compiled=True,
2025-05-07T20:31:42.2850661Z )
2025-05-07T20:31:42.4192404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.4193789Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:42.4194372Z 
2025-05-07T20:31:42.4194528Z     @given(
2025-05-07T20:31:42.4194994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.4195596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.4196204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.4196849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.4197494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.4198046Z     )
2025-05-07T20:31:42.4198760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.4199621Z     def test_silu_mul_quant(
2025-05-07T20:31:42.4200112Z         self,
2025-05-07T20:31:42.4200494Z         T: int,
2025-05-07T20:31:42.4200743Z         D: int,
2025-05-07T20:31:42.4200953Z         scale_ub: Optional[float],
2025-05-07T20:31:42.4201229Z         contiguous: bool,
2025-05-07T20:31:42.4201468Z         compiled: bool,
2025-05-07T20:31:42.4201694Z     ) -> None:
2025-05-07T20:31:42.4201914Z         torch.manual_seed(2025)
2025-05-07T20:31:42.4202160Z     
2025-05-07T20:31:42.4202423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.4202768Z     
2025-05-07T20:31:42.4202965Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.4203253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.4203567Z         x = x_sign * x_clamp
2025-05-07T20:31:42.4203809Z         x0 = x[:, :D]
2025-05-07T20:31:42.4204204Z         x1 = x[:, D:]
2025-05-07T20:31:42.4204412Z     
2025-05-07T20:31:42.4204599Z         if contiguous:
2025-05-07T20:31:42.4204835Z             x0 = x0.contiguous()
2025-05-07T20:31:42.4205091Z             x1 = x1.contiguous()
2025-05-07T20:31:42.4205340Z     
2025-05-07T20:31:42.4205529Z         if scale_ub is not None:
2025-05-07T20:31:42.4205984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.4206323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.4206638Z             )
2025-05-07T20:31:42.4206839Z         else:
2025-05-07T20:31:42.4207047Z             scale_ub_tensor = None
2025-05-07T20:31:42.4207301Z     
2025-05-07T20:31:42.4207589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.4207901Z             op = silu_mul_quant
2025-05-07T20:31:42.4208149Z             if compiled:
2025-05-07T20:31:42.4208391Z                 op = torch.compile(op)
2025-05-07T20:31:42.4208690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4208976Z     
2025-05-07T20:31:42.4209170Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.4209338Z 
2025-05-07T20:31:42.4209438Z moe/activation_test.py:117: 
2025-05-07T20:31:42.4209737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4210067Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.4210351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4211048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.4211612Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.4212267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.4212951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.4213490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.4214170Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.4214829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.4215349Z     kernel = self.compile(
2025-05-07T20:31:42.4215893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.4216535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.4216944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4217179Z 
2025-05-07T20:31:42.4217385Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c87ca750>
2025-05-07T20:31:42.4218467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.4219829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8e4ab60>}
2025-05-07T20:31:42.4221166Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.4222187Z context = <triton._C.libtriton.ir.context object at 0x7f51c87d2770>
2025-05-07T20:31:42.4222470Z 
2025-05-07T20:31:42.4222637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.4223156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.4223613Z                            module_map=module_map)
2025-05-07T20:31:42.4223974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.4224325Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.4224705Z E       ^
2025-05-07T20:31:42.4225164Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4225606Z 
2025-05-07T20:31:42.4226022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.4226527Z 
2025-05-07T20:31:42.4226636Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.4227044Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.4227441Z     T=4096,
2025-05-07T20:31:42.4227634Z     D=5120,
2025-05-07T20:31:42.4227825Z     scale_ub=None,
2025-05-07T20:31:42.4228042Z     contiguous=False,
2025-05-07T20:31:42.4228262Z     compiled=True,
2025-05-07T20:31:42.4228459Z )
2025-05-07T20:31:42.4228770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.4229270Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:42.4229547Z 
2025-05-07T20:31:42.4229632Z     @given(
2025-05-07T20:31:42.4229864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.4230174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.4230479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.4230800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.4231125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.4231492Z     )
2025-05-07T20:31:42.4231835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.4232271Z     def test_silu_mul_quant(
2025-05-07T20:31:42.4232512Z         self,
2025-05-07T20:31:42.4232701Z         T: int,
2025-05-07T20:31:42.4232896Z         D: int,
2025-05-07T20:31:42.4233108Z         scale_ub: Optional[float],
2025-05-07T20:31:42.4233368Z         contiguous: bool,
2025-05-07T20:31:42.4233605Z         compiled: bool,
2025-05-07T20:31:42.4233830Z     ) -> None:
2025-05-07T20:31:42.4234045Z         torch.manual_seed(2025)
2025-05-07T20:31:42.4234279Z     
2025-05-07T20:31:42.4234550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.4234891Z     
2025-05-07T20:31:42.4235082Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.4235369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.4235674Z         x = x_sign * x_clamp
2025-05-07T20:31:42.4235904Z         x0 = x[:, :D]
2025-05-07T20:31:42.4236125Z         x1 = x[:, D:]
2025-05-07T20:31:42.4236331Z     
2025-05-07T20:31:42.4236510Z         if contiguous:
2025-05-07T20:31:42.4236737Z             x0 = x0.contiguous()
2025-05-07T20:31:42.4236997Z             x1 = x1.contiguous()
2025-05-07T20:31:42.4237226Z     
2025-05-07T20:31:42.4237420Z         if scale_ub is not None:
2025-05-07T20:31:42.4237692Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.4238018Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.4238325Z             )
2025-05-07T20:31:42.4238521Z         else:
2025-05-07T20:31:42.4238729Z             scale_ub_tensor = None
2025-05-07T20:31:42.4238970Z     
2025-05-07T20:31:42.4239201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.4239515Z             op = silu_mul_quant
2025-05-07T20:31:42.4239758Z             if compiled:
2025-05-07T20:31:42.4240002Z                 op = torch.compile(op)
2025-05-07T20:31:42.4240296Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4240562Z     
2025-05-07T20:31:42.4240762Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.4240923Z 
2025-05-07T20:31:42.4241026Z moe/activation_test.py:117: 
2025-05-07T20:31:42.4241315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4241644Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.4241918Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4242469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.4243108Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.4243764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.4244448Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.4244975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.4245652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.4246308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.4246832Z     kernel = self.compile(
2025-05-07T20:31:42.4247357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.4248075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.4248480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4248705Z 
2025-05-07T20:31:42.4248918Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbe577d0>
2025-05-07T20:31:42.4250069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.4251433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8e4bd80>}
2025-05-07T20:31:42.4252765Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.4253786Z context = <triton._C.libtriton.ir.context object at 0x7f51bbe3f6b0>
2025-05-07T20:31:42.4254069Z 
2025-05-07T20:31:42.4254232Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.4254750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.4255210Z                            module_map=module_map)
2025-05-07T20:31:42.4255570Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.4255920Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.4256181Z E       ^
2025-05-07T20:31:42.4256641Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4257083Z 
2025-05-07T20:31:42.4257494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.4258004Z 
2025-05-07T20:31:42.5396471Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.5396911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.5397321Z     T=4096,
2025-05-07T20:31:42.5397570Z     D=5120,
2025-05-07T20:31:42.5397850Z     scale_ub=1200.0,
2025-05-07T20:31:42.5398196Z     contiguous=False,
2025-05-07T20:31:42.5398432Z     compiled=False,
2025-05-07T20:31:42.5398650Z )
2025-05-07T20:31:42.5398973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.5399483Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:42.5399761Z 
2025-05-07T20:31:42.5399855Z     @given(
2025-05-07T20:31:42.5400093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.5400454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.5400777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.5401113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.5401445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.5401919Z     )
2025-05-07T20:31:42.5402274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.5402713Z     def test_silu_mul_quant(
2025-05-07T20:31:42.5402964Z         self,
2025-05-07T20:31:42.5403167Z         T: int,
2025-05-07T20:31:42.5403368Z         D: int,
2025-05-07T20:31:42.5403597Z         scale_ub: Optional[float],
2025-05-07T20:31:42.5403877Z         contiguous: bool,
2025-05-07T20:31:42.5404125Z         compiled: bool,
2025-05-07T20:31:42.5404359Z     ) -> None:
2025-05-07T20:31:42.5404585Z         torch.manual_seed(2025)
2025-05-07T20:31:42.5404828Z     
2025-05-07T20:31:42.5405105Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.5405457Z     
2025-05-07T20:31:42.5405843Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.5406139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.5406456Z         x = x_sign * x_clamp
2025-05-07T20:31:42.5406711Z         x0 = x[:, :D]
2025-05-07T20:31:42.5406932Z         x1 = x[:, D:]
2025-05-07T20:31:42.5407146Z     
2025-05-07T20:31:42.5407345Z         if contiguous:
2025-05-07T20:31:42.5407628Z             x0 = x0.contiguous()
2025-05-07T20:31:42.5407891Z             x1 = x1.contiguous()
2025-05-07T20:31:42.5408137Z     
2025-05-07T20:31:42.5408330Z         if scale_ub is not None:
2025-05-07T20:31:42.5408608Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.5409066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.5409378Z             )
2025-05-07T20:31:42.5409580Z         else:
2025-05-07T20:31:42.5409800Z             scale_ub_tensor = None
2025-05-07T20:31:42.5410049Z     
2025-05-07T20:31:42.5410285Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.5410603Z             op = silu_mul_quant
2025-05-07T20:31:42.5410850Z             if compiled:
2025-05-07T20:31:42.5411105Z                 op = torch.compile(op)
2025-05-07T20:31:42.5411409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.5411695Z     
2025-05-07T20:31:42.5411886Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.5412058Z 
2025-05-07T20:31:42.5412158Z moe/activation_test.py:117: 
2025-05-07T20:31:42.5412457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.5412787Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.5413076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.5420597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.5421356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.5421909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.5422603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.5423278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.5423831Z     kernel = self.compile(
2025-05-07T20:31:42.5424379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.5425046Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.5425456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.5425694Z 
2025-05-07T20:31:42.5425913Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbe302d0>
2025-05-07T20:31:42.5427008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.5428389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbe28c20>}
2025-05-07T20:31:42.5429900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.5430925Z context = <triton._C.libtriton.ir.context object at 0x7f51bbed8170>
2025-05-07T20:31:42.5431212Z 
2025-05-07T20:31:42.5431383Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.5431903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.5432372Z                            module_map=module_map)
2025-05-07T20:31:42.5432733Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.5433093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.5433357Z E       ^
2025-05-07T20:31:42.5433822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.5434277Z 
2025-05-07T20:31:42.5434694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.5435210Z 
2025-05-07T20:31:42.5435316Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.5435726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.5436125Z     T=4096,
2025-05-07T20:31:42.5436396Z     D=5120,
2025-05-07T20:31:42.5436592Z     scale_ub=1200.0,
2025-05-07T20:31:42.5436819Z     contiguous=False,
2025-05-07T20:31:42.5437042Z     compiled=True,
2025-05-07T20:31:42.5437250Z )
2025-05-07T20:31:42.5437572Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.5438059Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.5438339Z 
2025-05-07T20:31:42.5438417Z     @given(
2025-05-07T20:31:42.5438653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.5438960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.5439268Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.5439603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.5439936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.5440218Z     )
2025-05-07T20:31:42.5440568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.5441016Z     def test_silu_mul_quant(
2025-05-07T20:31:42.5441250Z         self,
2025-05-07T20:31:42.5441446Z         T: int,
2025-05-07T20:31:42.5441649Z         D: int,
2025-05-07T20:31:42.5441870Z         scale_ub: Optional[float],
2025-05-07T20:31:42.5442145Z         contiguous: bool,
2025-05-07T20:31:42.5442391Z         compiled: bool,
2025-05-07T20:31:42.5442617Z     ) -> None:
2025-05-07T20:31:42.5442842Z         torch.manual_seed(2025)
2025-05-07T20:31:42.5443098Z     
2025-05-07T20:31:42.5443375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.5443725Z     
2025-05-07T20:31:42.5443929Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.5444225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.5444545Z         x = x_sign * x_clamp
2025-05-07T20:31:42.5444792Z         x0 = x[:, :D]
2025-05-07T20:31:42.5445022Z         x1 = x[:, D:]
2025-05-07T20:31:42.5445231Z     
2025-05-07T20:31:42.5445431Z         if contiguous:
2025-05-07T20:31:42.5445680Z             x0 = x0.contiguous()
2025-05-07T20:31:42.5445943Z             x1 = x1.contiguous()
2025-05-07T20:31:42.5446191Z     
2025-05-07T20:31:42.5446393Z         if scale_ub is not None:
2025-05-07T20:31:42.5446670Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.5447013Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.5447335Z             )
2025-05-07T20:31:42.5447620Z         else:
2025-05-07T20:31:42.5447928Z             scale_ub_tensor = None
2025-05-07T20:31:42.5448186Z     
2025-05-07T20:31:42.5448415Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.5448735Z             op = silu_mul_quant
2025-05-07T20:31:42.5448990Z             if compiled:
2025-05-07T20:31:42.5449236Z                 op = torch.compile(op)
2025-05-07T20:31:42.5449533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.5449816Z     
2025-05-07T20:31:42.5450020Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.5450193Z 
2025-05-07T20:31:42.5450293Z moe/activation_test.py:117: 
2025-05-07T20:31:42.5450591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.5450930Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.5451208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.5451765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.5452328Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.5452989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.5453676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.5454211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.5454891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.5455624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.5456160Z     kernel = self.compile(
2025-05-07T20:31:42.5456697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.5457352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.5457744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.5457982Z 
2025-05-07T20:31:42.5458189Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbdc99d0>
2025-05-07T20:31:42.5459265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.5460637Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbe29f80>}
2025-05-07T20:31:42.5461965Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.5462987Z context = <triton._C.libtriton.ir.context object at 0x7f51bbd96a30>
2025-05-07T20:31:42.5463276Z 
2025-05-07T20:31:42.5463451Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.5463968Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.5464427Z                            module_map=module_map)
2025-05-07T20:31:42.5464790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.5465140Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.5465395Z E       ^
2025-05-07T20:31:42.5465871Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.5466323Z 
2025-05-07T20:31:42.5466736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.5467242Z 
2025-05-07T20:31:42.6335761Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.6336880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.6338435Z     T=2048,
2025-05-07T20:31:42.6338979Z     D=7168,
2025-05-07T20:31:42.6339398Z     scale_ub=1200.0,
2025-05-07T20:31:42.6339851Z     contiguous=False,
2025-05-07T20:31:42.6340295Z     compiled=False,
2025-05-07T20:31:42.6340615Z )
2025-05-07T20:31:42.6340988Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.6341482Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:42.6341765Z 
2025-05-07T20:31:42.6341856Z     @given(
2025-05-07T20:31:42.6342091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.6342398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.6342706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.6343036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.6343366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.6343648Z     )
2025-05-07T20:31:42.6343998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.6344445Z     def test_silu_mul_quant(
2025-05-07T20:31:42.6344686Z         self,
2025-05-07T20:31:42.6344884Z         T: int,
2025-05-07T20:31:42.6345085Z         D: int,
2025-05-07T20:31:42.6345301Z         scale_ub: Optional[float],
2025-05-07T20:31:42.6345575Z         contiguous: bool,
2025-05-07T20:31:42.6345817Z         compiled: bool,
2025-05-07T20:31:42.6346040Z     ) -> None:
2025-05-07T20:31:42.6346256Z         torch.manual_seed(2025)
2025-05-07T20:31:42.6346637Z     
2025-05-07T20:31:42.6346913Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.6347258Z     
2025-05-07T20:31:42.6347455Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.6347745Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.6348055Z         x = x_sign * x_clamp
2025-05-07T20:31:42.6348297Z         x0 = x[:, :D]
2025-05-07T20:31:42.6348514Z         x1 = x[:, D:]
2025-05-07T20:31:42.6348727Z     
2025-05-07T20:31:42.6348917Z         if contiguous:
2025-05-07T20:31:42.6349154Z             x0 = x0.contiguous()
2025-05-07T20:31:42.6349409Z             x1 = x1.contiguous()
2025-05-07T20:31:42.6349646Z     
2025-05-07T20:31:42.6349838Z         if scale_ub is not None:
2025-05-07T20:31:42.6350106Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.6350441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.6350754Z             )
2025-05-07T20:31:42.6350945Z         else:
2025-05-07T20:31:42.6351169Z             scale_ub_tensor = None
2025-05-07T20:31:42.6351419Z     
2025-05-07T20:31:42.6351644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.6351960Z             op = silu_mul_quant
2025-05-07T20:31:42.6352215Z             if compiled:
2025-05-07T20:31:42.6352458Z                 op = torch.compile(op)
2025-05-07T20:31:42.6352752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.6353029Z     
2025-05-07T20:31:42.6353228Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.6353398Z 
2025-05-07T20:31:42.6353497Z moe/activation_test.py:117: 
2025-05-07T20:31:42.6353794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.6354125Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.6354401Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.6355096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.6355788Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.6356318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.6357001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.6357664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.6358283Z     kernel = self.compile(
2025-05-07T20:31:42.6358822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.6359477Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.6359876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.6360103Z 
2025-05-07T20:31:42.6360314Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbdb7850>
2025-05-07T20:31:42.6361392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.6362846Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbe2ad40>}
2025-05-07T20:31:42.6364190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.6365207Z context = <triton._C.libtriton.ir.context object at 0x7f51bbd03730>
2025-05-07T20:31:42.6365494Z 
2025-05-07T20:31:42.6365664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.6366262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.6366736Z                            module_map=module_map)
2025-05-07T20:31:42.6367105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.6367461Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.6367807Z E       ^
2025-05-07T20:31:42.6368271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.6368715Z 
2025-05-07T20:31:42.6369141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.6369650Z 
2025-05-07T20:31:42.6369762Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.6370171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.6370577Z     T=1,
2025-05-07T20:31:42.6370765Z     D=7168,
2025-05-07T20:31:42.6370955Z     scale_ub=None,
2025-05-07T20:31:42.6371172Z     contiguous=True,
2025-05-07T20:31:42.6371404Z     compiled=False,
2025-05-07T20:31:42.6371611Z )
2025-05-07T20:31:42.6371928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.6372409Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:42.6372668Z 
2025-05-07T20:31:42.6372750Z     @given(
2025-05-07T20:31:42.6372989Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.6373305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.6373621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.6373947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.6374274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.6374559Z     )
2025-05-07T20:31:42.6374904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.6375344Z     def test_silu_mul_quant(
2025-05-07T20:31:42.6375588Z         self,
2025-05-07T20:31:42.6375784Z         T: int,
2025-05-07T20:31:42.6375986Z         D: int,
2025-05-07T20:31:42.6376379Z         scale_ub: Optional[float],
2025-05-07T20:31:42.6376650Z         contiguous: bool,
2025-05-07T20:31:42.6376891Z         compiled: bool,
2025-05-07T20:31:42.6377115Z     ) -> None:
2025-05-07T20:31:42.6377329Z         torch.manual_seed(2025)
2025-05-07T20:31:42.6377577Z     
2025-05-07T20:31:42.6377847Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.6378282Z     
2025-05-07T20:31:42.6378472Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.6378765Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.6379073Z         x = x_sign * x_clamp
2025-05-07T20:31:42.6379311Z         x0 = x[:, :D]
2025-05-07T20:31:42.6379534Z         x1 = x[:, D:]
2025-05-07T20:31:42.6379743Z     
2025-05-07T20:31:42.6379939Z         if contiguous:
2025-05-07T20:31:42.6380213Z             x0 = x0.contiguous()
2025-05-07T20:31:42.6380474Z             x1 = x1.contiguous()
2025-05-07T20:31:42.6380715Z     
2025-05-07T20:31:42.6380911Z         if scale_ub is not None:
2025-05-07T20:31:42.6381185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.6381517Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.6381826Z             )
2025-05-07T20:31:42.6382021Z         else:
2025-05-07T20:31:42.6382229Z             scale_ub_tensor = None
2025-05-07T20:31:42.6382484Z     
2025-05-07T20:31:42.6382717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.6383038Z             op = silu_mul_quant
2025-05-07T20:31:42.6383285Z             if compiled:
2025-05-07T20:31:42.6383536Z                 op = torch.compile(op)
2025-05-07T20:31:42.6383834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.6384109Z     
2025-05-07T20:31:42.6384305Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.6384467Z 
2025-05-07T20:31:42.6384573Z moe/activation_test.py:117: 
2025-05-07T20:31:42.6384950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.6385288Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.6385571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.6386255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.6386948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.6387484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.6388177Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.6388840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.6389383Z     kernel = self.compile(
2025-05-07T20:31:42.6389944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.6390612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.6391005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.6391237Z 
2025-05-07T20:31:42.6391442Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbd46850>
2025-05-07T20:31:42.6392520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.6393884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbe2afc0>}
2025-05-07T20:31:42.6395227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.6396243Z context = <triton._C.libtriton.ir.context object at 0x7f51bbdce730>
2025-05-07T20:31:42.6396538Z 
2025-05-07T20:31:42.6396703Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.6397225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.6397686Z                            module_map=module_map)
2025-05-07T20:31:42.6398052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.6398522Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.6398785Z E       ^
2025-05-07T20:31:42.6399244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.6399696Z 
2025-05-07T20:31:42.6400106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.6400613Z 
2025-05-07T20:31:42.6400729Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.6401145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.6401546Z     T=16384,
2025-05-07T20:31:42.6401748Z     D=7168,
2025-05-07T20:31:42.6401946Z     scale_ub=1200.0,
2025-05-07T20:31:42.6402170Z     contiguous=False,
2025-05-07T20:31:42.6402400Z     compiled=True,
2025-05-07T20:31:43.0209175Z )
2025-05-07T20:31:43.0210042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0210665Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:43.0210959Z 
2025-05-07T20:31:43.0211045Z     @given(
2025-05-07T20:31:43.0211292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0211620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0211932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0212274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0212781Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0213075Z     )
2025-05-07T20:31:43.0213436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0213883Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0214129Z         self,
2025-05-07T20:31:43.0214338Z         T: int,
2025-05-07T20:31:43.0214547Z         D: int,
2025-05-07T20:31:43.0214772Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0215059Z         contiguous: bool,
2025-05-07T20:31:43.0215309Z         compiled: bool,
2025-05-07T20:31:43.0215540Z     ) -> None:
2025-05-07T20:31:43.0215766Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0216018Z     
2025-05-07T20:31:43.0216303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0216656Z     
2025-05-07T20:31:43.0216861Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0217164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0217488Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0217741Z         x0 = x[:, :D]
2025-05-07T20:31:43.0217969Z         x1 = x[:, D:]
2025-05-07T20:31:43.0218186Z     
2025-05-07T20:31:43.0218385Z         if contiguous:
2025-05-07T20:31:43.0218630Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0218894Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0219143Z     
2025-05-07T20:31:43.0219350Z         if scale_ub is not None:
2025-05-07T20:31:43.0219631Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0219982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0220305Z             )
2025-05-07T20:31:43.0220503Z         else:
2025-05-07T20:31:43.0220728Z             scale_ub_tensor = None
2025-05-07T20:31:43.0220988Z     
2025-05-07T20:31:43.0221231Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0221550Z             op = silu_mul_quant
2025-05-07T20:31:43.0221809Z             if compiled:
2025-05-07T20:31:43.0222078Z                 op = torch.compile(op)
2025-05-07T20:31:43.0222379Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0222663Z     
2025-05-07T20:31:43.0222863Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.0223031Z 
2025-05-07T20:31:43.0223137Z moe/activation_test.py:117: 
2025-05-07T20:31:43.0223443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0223787Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.0224201Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0224776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.0225349Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.0226045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.0226748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.0227298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.0227989Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.0228658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.0229195Z     kernel = self.compile(
2025-05-07T20:31:43.0229744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.0230413Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.0230820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0231054Z 
2025-05-07T20:31:43.0231265Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbfdd010>
2025-05-07T20:31:43.0232434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.0233822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbdf5300>}
2025-05-07T20:31:43.0235176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.0236214Z context = <triton._C.libtriton.ir.context object at 0x7f51bbf40e70>
2025-05-07T20:31:43.0236503Z 
2025-05-07T20:31:43.0236673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.0237203Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.0237680Z                            module_map=module_map)
2025-05-07T20:31:43.0238064Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.0238423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.0238693Z E       ^
2025-05-07T20:31:43.0239168Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.0239621Z 
2025-05-07T20:31:43.0240039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.0240565Z 
2025-05-07T20:31:43.0240673Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.0241095Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.0241506Z     T=1,
2025-05-07T20:31:43.0241697Z     D=7168,
2025-05-07T20:31:43.0241903Z     scale_ub=None,
2025-05-07T20:31:43.0242129Z     contiguous=False,
2025-05-07T20:31:43.0242363Z     compiled=False,
2025-05-07T20:31:43.0242579Z )
2025-05-07T20:31:43.0242917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0243408Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:43.0243681Z 
2025-05-07T20:31:43.0243764Z     @given(
2025-05-07T20:31:43.0244007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0244337Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0244653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0244998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0245458Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0252312Z     )
2025-05-07T20:31:43.0252717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0253178Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0253436Z         self,
2025-05-07T20:31:43.0253640Z         T: int,
2025-05-07T20:31:43.0253850Z         D: int,
2025-05-07T20:31:43.0254078Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0254359Z         contiguous: bool,
2025-05-07T20:31:43.0254605Z         compiled: bool,
2025-05-07T20:31:43.0254835Z     ) -> None:
2025-05-07T20:31:43.0255059Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0255312Z     
2025-05-07T20:31:43.0255592Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0255947Z     
2025-05-07T20:31:43.0256159Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0256455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0256777Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0257026Z         x0 = x[:, :D]
2025-05-07T20:31:43.0257253Z         x1 = x[:, D:]
2025-05-07T20:31:43.0257464Z     
2025-05-07T20:31:43.0257661Z         if contiguous:
2025-05-07T20:31:43.0257904Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0258166Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0258411Z     
2025-05-07T20:31:43.0258611Z         if scale_ub is not None:
2025-05-07T20:31:43.0258994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0259343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0259663Z             )
2025-05-07T20:31:43.0259861Z         else:
2025-05-07T20:31:43.0260083Z             scale_ub_tensor = None
2025-05-07T20:31:43.0260345Z     
2025-05-07T20:31:43.0260578Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0260899Z             op = silu_mul_quant
2025-05-07T20:31:43.0261160Z             if compiled:
2025-05-07T20:31:43.0261419Z                 op = torch.compile(op)
2025-05-07T20:31:43.0261727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0262011Z     
2025-05-07T20:31:43.0262214Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.0262384Z 
2025-05-07T20:31:43.0262489Z moe/activation_test.py:117: 
2025-05-07T20:31:43.0262790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0263129Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.0263420Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0264122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.0264828Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.0265371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.0266059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.0266735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.0267274Z     kernel = self.compile(
2025-05-07T20:31:43.0267820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.0268483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.0268891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0269120Z 
2025-05-07T20:31:43.0269337Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbf8a010>
2025-05-07T20:31:43.0270428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.0271889Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbdf60c0>}
2025-05-07T20:31:43.0273245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.0274280Z context = <triton._C.libtriton.ir.context object at 0x7f51bbf81ef0>
2025-05-07T20:31:43.0274575Z 
2025-05-07T20:31:43.0274748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.0275276Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.0275759Z                            module_map=module_map)
2025-05-07T20:31:43.0276132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.0276485Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.0276759Z E       ^
2025-05-07T20:31:43.0277234Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.0277686Z 
2025-05-07T20:31:43.0278111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.0278623Z 
2025-05-07T20:31:43.0278731Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.0279278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.0279691Z     T=2048,
2025-05-07T20:31:43.0279882Z     D=7168,
2025-05-07T20:31:43.0280087Z     scale_ub=None,
2025-05-07T20:31:43.0280310Z     contiguous=False,
2025-05-07T20:31:43.0280537Z     compiled=True,
2025-05-07T20:31:43.0280746Z )
2025-05-07T20:31:43.0952245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0952796Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.0953212Z 
2025-05-07T20:31:43.0953307Z     @given(
2025-05-07T20:31:43.0953552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0953871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0954216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0954554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0954885Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0955179Z     )
2025-05-07T20:31:43.0955543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0955989Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0956236Z         self,
2025-05-07T20:31:43.0956441Z         T: int,
2025-05-07T20:31:43.0956643Z         D: int,
2025-05-07T20:31:43.0956872Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0957149Z         contiguous: bool,
2025-05-07T20:31:43.0957395Z         compiled: bool,
2025-05-07T20:31:43.0957624Z     ) -> None:
2025-05-07T20:31:43.0957854Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0958107Z     
2025-05-07T20:31:43.0958384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0958735Z     
2025-05-07T20:31:43.0958936Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0959226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0959547Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0959792Z         x0 = x[:, :D]
2025-05-07T20:31:43.0960014Z         x1 = x[:, D:]
2025-05-07T20:31:43.0960230Z     
2025-05-07T20:31:43.0960425Z         if contiguous:
2025-05-07T20:31:43.0960665Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0960930Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0961174Z     
2025-05-07T20:31:43.0961375Z         if scale_ub is not None:
2025-05-07T20:31:43.0961660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0962004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0962500Z             )
2025-05-07T20:31:43.0962696Z         else:
2025-05-07T20:31:43.0962916Z             scale_ub_tensor = None
2025-05-07T20:31:43.0963173Z     
2025-05-07T20:31:43.0963414Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0963730Z             op = silu_mul_quant
2025-05-07T20:31:43.0963987Z             if compiled:
2025-05-07T20:31:43.0964240Z                 op = torch.compile(op)
2025-05-07T20:31:43.0964535Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0964816Z     
2025-05-07T20:31:43.0965019Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.0965184Z 
2025-05-07T20:31:43.0965288Z moe/activation_test.py:117: 
2025-05-07T20:31:43.0965587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0965928Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.0966213Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0966774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.0967344Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.0968087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.0968771Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.0969307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.0970105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.0970767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.0971293Z     kernel = self.compile(
2025-05-07T20:31:43.0971830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.0972486Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.0972882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0973113Z 
2025-05-07T20:31:43.0973317Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c81c4ad0>
2025-05-07T20:31:43.0974391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.0975756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbdf7560>}
2025-05-07T20:31:43.0977087Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.0978095Z context = <triton._C.libtriton.ir.context object at 0x7f51c81809b0>
2025-05-07T20:31:43.0978388Z 
2025-05-07T20:31:43.0978549Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.0979063Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.0979524Z                            module_map=module_map)
2025-05-07T20:31:43.0979878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.0980227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.0980492Z E       ^
2025-05-07T20:31:43.0980944Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.0981391Z 
2025-05-07T20:31:43.0981799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.0982307Z 
2025-05-07T20:31:43.0982410Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.0983321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.0983708Z     T=4096,
2025-05-07T20:31:43.0983894Z     D=7168,
2025-05-07T20:31:43.0984090Z     scale_ub=None,
2025-05-07T20:31:43.0984294Z     contiguous=False,
2025-05-07T20:31:43.0984514Z     compiled=True,
2025-05-07T20:31:43.0984711Z )
2025-05-07T20:31:43.0985021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0985505Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.0985778Z 
2025-05-07T20:31:43.0985863Z     @given(
2025-05-07T20:31:43.0986086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0986390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0986691Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0987019Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0987337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0987618Z     )
2025-05-07T20:31:43.0987963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0988391Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0988634Z         self,
2025-05-07T20:31:43.0988823Z         T: int,
2025-05-07T20:31:43.0989016Z         D: int,
2025-05-07T20:31:43.0989227Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0989496Z         contiguous: bool,
2025-05-07T20:31:43.0989723Z         compiled: bool,
2025-05-07T20:31:43.0989945Z     ) -> None:
2025-05-07T20:31:43.0990263Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0990521Z     
2025-05-07T20:31:43.0990786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0991123Z     
2025-05-07T20:31:43.0991323Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0991611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0991922Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0992160Z         x0 = x[:, :D]
2025-05-07T20:31:43.0992382Z         x1 = x[:, D:]
2025-05-07T20:31:43.0992592Z     
2025-05-07T20:31:43.0992779Z         if contiguous:
2025-05-07T20:31:43.0993010Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0993271Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0993511Z     
2025-05-07T20:31:43.0993703Z         if scale_ub is not None:
2025-05-07T20:31:43.0993976Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0994315Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0994623Z             )
2025-05-07T20:31:43.0994821Z         else:
2025-05-07T20:31:43.0995037Z             scale_ub_tensor = None
2025-05-07T20:31:43.0995286Z     
2025-05-07T20:31:43.0995520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0995837Z             op = silu_mul_quant
2025-05-07T20:31:43.0996084Z             if compiled:
2025-05-07T20:31:43.0996334Z                 op = torch.compile(op)
2025-05-07T20:31:43.0996633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0996914Z     
2025-05-07T20:31:43.0997107Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.0997273Z 
2025-05-07T20:31:43.0997371Z moe/activation_test.py:117: 
2025-05-07T20:31:43.0997667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0997997Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.0998276Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0998838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.0999389Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.1000043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.1000729Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.1001259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.1002043Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.1002699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.1003224Z     kernel = self.compile(
2025-05-07T20:31:43.1003762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.1004410Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.1004807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.1005035Z 
2025-05-07T20:31:43.1005250Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8125cd0>
2025-05-07T20:31:43.1006493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.1007939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c81d07c0>}
2025-05-07T20:31:43.1009265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.1010407Z context = <triton._C.libtriton.ir.context object at 0x7f51c81d9b70>
2025-05-07T20:31:43.1010695Z 
2025-05-07T20:31:43.1010861Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.1011371Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.1011827Z                            module_map=module_map)
2025-05-07T20:31:43.1012189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.1012552Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.1012811Z E       ^
2025-05-07T20:31:43.1013279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.1013723Z 
2025-05-07T20:31:43.1014139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.1014643Z 
2025-05-07T20:31:43.2273976Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2274444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2275023Z     T=16384,
2025-05-07T20:31:43.2275282Z     D=5120,
2025-05-07T20:31:43.2275534Z     scale_ub=1200.0,
2025-05-07T20:31:43.2275824Z     contiguous=False,
2025-05-07T20:31:43.2276113Z     compiled=False,
2025-05-07T20:31:43.2276372Z )
2025-05-07T20:31:43.2276700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.2277208Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.2277496Z 
2025-05-07T20:31:43.2277585Z     @given(
2025-05-07T20:31:43.2277822Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.2278144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.2278460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.2278795Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.2279127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.2279425Z     )
2025-05-07T20:31:43.2279780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.2280234Z     def test_silu_mul_quant(
2025-05-07T20:31:43.2280488Z         self,
2025-05-07T20:31:43.2280708Z         T: int,
2025-05-07T20:31:43.2280940Z         D: int,
2025-05-07T20:31:43.2281175Z         scale_ub: Optional[float],
2025-05-07T20:31:43.2281450Z         contiguous: bool,
2025-05-07T20:31:43.2281865Z         compiled: bool,
2025-05-07T20:31:43.2282097Z     ) -> None:
2025-05-07T20:31:43.2282322Z         torch.manual_seed(2025)
2025-05-07T20:31:43.2282565Z     
2025-05-07T20:31:43.2282848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.2283197Z     
2025-05-07T20:31:43.2283393Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.2283690Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.2284005Z         x = x_sign * x_clamp
2025-05-07T20:31:43.2284252Z         x0 = x[:, :D]
2025-05-07T20:31:43.2284481Z         x1 = x[:, D:]
2025-05-07T20:31:43.2284697Z     
2025-05-07T20:31:43.2284883Z         if contiguous:
2025-05-07T20:31:43.2285124Z             x0 = x0.contiguous()
2025-05-07T20:31:43.2285392Z             x1 = x1.contiguous()
2025-05-07T20:31:43.2285633Z     
2025-05-07T20:31:43.2285835Z         if scale_ub is not None:
2025-05-07T20:31:43.2286109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.2286448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.2286768Z             )
2025-05-07T20:31:43.2286970Z         else:
2025-05-07T20:31:43.2287189Z             scale_ub_tensor = None
2025-05-07T20:31:43.2287442Z     
2025-05-07T20:31:43.2287763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2288077Z             op = silu_mul_quant
2025-05-07T20:31:43.2288321Z             if compiled:
2025-05-07T20:31:43.2288564Z                 op = torch.compile(op)
2025-05-07T20:31:43.2288972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2289253Z     
2025-05-07T20:31:43.2289451Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.2289615Z 
2025-05-07T20:31:43.2289746Z moe/activation_test.py:117: 
2025-05-07T20:31:43.2290037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2290370Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.2290650Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2291339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.2292031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.2292565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.2293246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.2293905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.2294436Z     kernel = self.compile(
2025-05-07T20:31:43.2294972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.2295615Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.2296008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2296243Z 
2025-05-07T20:31:43.2296448Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8166650>
2025-05-07T20:31:43.2297525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.2298899Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c81d1620>}
2025-05-07T20:31:43.2300227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.2301244Z context = <triton._C.libtriton.ir.context object at 0x7f51c810a4f0>
2025-05-07T20:31:43.2301536Z 
2025-05-07T20:31:43.2301700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.2302305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.2302764Z                            module_map=module_map)
2025-05-07T20:31:43.2303125Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.2303477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.2303734Z E       ^
2025-05-07T20:31:43.2304200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.2304653Z 
2025-05-07T20:31:43.2305068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.2305573Z 
2025-05-07T20:31:43.2305850Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2306261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2306664Z     T=16384,
2025-05-07T20:31:43.2306864Z     D=5120,
2025-05-07T20:31:43.2307054Z     scale_ub=1200.0,
2025-05-07T20:31:43.2307275Z     contiguous=True,
2025-05-07T20:31:43.2307496Z     compiled=True,
2025-05-07T20:31:43.2307695Z )
2025-05-07T20:31:43.2308012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.2308511Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.2308784Z 
2025-05-07T20:31:43.2308867Z     @given(
2025-05-07T20:31:43.2309218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.2309536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.2309843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.2310165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.2310492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.2310782Z     )
2025-05-07T20:31:43.2311126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.2311571Z     def test_silu_mul_quant(
2025-05-07T20:31:43.2311814Z         self,
2025-05-07T20:31:43.2312007Z         T: int,
2025-05-07T20:31:43.2312207Z         D: int,
2025-05-07T20:31:43.2312427Z         scale_ub: Optional[float],
2025-05-07T20:31:43.2312698Z         contiguous: bool,
2025-05-07T20:31:43.2312934Z         compiled: bool,
2025-05-07T20:31:43.2313157Z     ) -> None:
2025-05-07T20:31:43.2313375Z         torch.manual_seed(2025)
2025-05-07T20:31:43.2313613Z     
2025-05-07T20:31:43.2313891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.2314238Z     
2025-05-07T20:31:43.2314428Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.2314716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.2315034Z         x = x_sign * x_clamp
2025-05-07T20:31:43.2315270Z         x0 = x[:, :D]
2025-05-07T20:31:43.2315485Z         x1 = x[:, D:]
2025-05-07T20:31:43.2315696Z     
2025-05-07T20:31:43.2315878Z         if contiguous:
2025-05-07T20:31:43.2316113Z             x0 = x0.contiguous()
2025-05-07T20:31:43.2316369Z             x1 = x1.contiguous()
2025-05-07T20:31:43.2316603Z     
2025-05-07T20:31:43.2316797Z         if scale_ub is not None:
2025-05-07T20:31:43.2317074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.2317409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.2317722Z             )
2025-05-07T20:31:43.2317923Z         else:
2025-05-07T20:31:43.2318145Z             scale_ub_tensor = None
2025-05-07T20:31:43.2318397Z     
2025-05-07T20:31:43.2318632Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2318951Z             op = silu_mul_quant
2025-05-07T20:31:43.2319202Z             if compiled:
2025-05-07T20:31:43.2319456Z                 op = torch.compile(op)
2025-05-07T20:31:43.2319757Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2320032Z     
2025-05-07T20:31:43.2320235Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.2320530Z 
2025-05-07T20:31:43.2320637Z moe/activation_test.py:117: 
2025-05-07T20:31:43.2320932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2321269Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.2321555Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2322121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.2322678Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.2323346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.2324048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.2324588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.2325419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.2326096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.2326640Z     kernel = self.compile(
2025-05-07T20:31:43.2334088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.2334799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.2335209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2335557Z 
2025-05-07T20:31:43.2335771Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbce0e50>
2025-05-07T20:31:43.2336856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.2338232Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c81d2a20>}
2025-05-07T20:31:43.2339585Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.2340606Z context = <triton._C.libtriton.ir.context object at 0x7f51bbc64d30>
2025-05-07T20:31:43.2340901Z 
2025-05-07T20:31:43.2341078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.2341601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.2342070Z                            module_map=module_map)
2025-05-07T20:31:43.2342438Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.2342797Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.2343062Z E       ^
2025-05-07T20:31:43.2343526Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.2343985Z 
2025-05-07T20:31:43.2344404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.2344920Z 
2025-05-07T20:31:43.5616936Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.5617527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.5618047Z     T=16384,
2025-05-07T20:31:43.5618286Z     D=5120,
2025-05-07T20:31:43.5618500Z     scale_ub=None,
2025-05-07T20:31:43.5618730Z     contiguous=False,
2025-05-07T20:31:43.5618972Z     compiled=True,
2025-05-07T20:31:43.5619198Z )
2025-05-07T20:31:43.5619525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.5620039Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.5620322Z 
2025-05-07T20:31:43.5620726Z     @given(
2025-05-07T20:31:43.5620969Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.5621286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.5621599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.5621937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.5622268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.5622572Z     )
2025-05-07T20:31:43.5622936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.5623379Z     def test_silu_mul_quant(
2025-05-07T20:31:43.5623632Z         self,
2025-05-07T20:31:43.5623839Z         T: int,
2025-05-07T20:31:43.5624038Z         D: int,
2025-05-07T20:31:43.5624266Z         scale_ub: Optional[float],
2025-05-07T20:31:43.5624549Z         contiguous: bool,
2025-05-07T20:31:43.5624794Z         compiled: bool,
2025-05-07T20:31:43.5625035Z     ) -> None:
2025-05-07T20:31:43.5625259Z         torch.manual_seed(2025)
2025-05-07T20:31:43.5625509Z     
2025-05-07T20:31:43.5625792Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.5626139Z     
2025-05-07T20:31:43.5626343Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.5626636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.5626955Z         x = x_sign * x_clamp
2025-05-07T20:31:43.5627205Z         x0 = x[:, :D]
2025-05-07T20:31:43.5627425Z         x1 = x[:, D:]
2025-05-07T20:31:43.5627651Z     
2025-05-07T20:31:43.5627987Z         if contiguous:
2025-05-07T20:31:43.5628226Z             x0 = x0.contiguous()
2025-05-07T20:31:43.5628499Z             x1 = x1.contiguous()
2025-05-07T20:31:43.5628753Z     
2025-05-07T20:31:43.5628952Z         if scale_ub is not None:
2025-05-07T20:31:43.5629248Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.5629600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.5629916Z             )
2025-05-07T20:31:43.5630125Z         else:
2025-05-07T20:31:43.5630400Z             scale_ub_tensor = None
2025-05-07T20:31:43.5630661Z     
2025-05-07T20:31:43.5630908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.5631233Z             op = silu_mul_quant
2025-05-07T20:31:43.5631494Z             if compiled:
2025-05-07T20:31:43.5631747Z                 op = torch.compile(op)
2025-05-07T20:31:43.5632051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5632336Z     
2025-05-07T20:31:43.5632535Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.5632715Z 
2025-05-07T20:31:43.5632818Z moe/activation_test.py:117: 
2025-05-07T20:31:43.5633127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5633469Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.5633777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5634353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.5634934Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.5635600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.5636300Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.5636845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.5637537Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.5638203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.5638744Z     kernel = self.compile(
2025-05-07T20:31:43.5639295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.5639956Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.5640364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5640729Z 
2025-05-07T20:31:43.5640941Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbc51e10>
2025-05-07T20:31:43.5642032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.5643443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c81d3c40>}
2025-05-07T20:31:43.5644794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.5645828Z context = <triton._C.libtriton.ir.context object at 0x7f51bbcadcf0>
2025-05-07T20:31:43.5646132Z 
2025-05-07T20:31:43.5646301Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.5646835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.5647307Z                            module_map=module_map)
2025-05-07T20:31:43.5647781Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.5648150Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.5648417Z E       ^
2025-05-07T20:31:43.5648977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.5649432Z 
2025-05-07T20:31:43.5649856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.5650370Z 
2025-05-07T20:31:43.5650475Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.5650915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.5651361Z     T=2048,
2025-05-07T20:31:43.5651555Z     D=5120,
2025-05-07T20:31:43.5651760Z     scale_ub=None,
2025-05-07T20:31:43.5651983Z     contiguous=False,
2025-05-07T20:31:43.5652206Z     compiled=True,
2025-05-07T20:31:43.5652415Z )
2025-05-07T20:31:43.6364719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.6365263Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.6365583Z 
2025-05-07T20:31:43.6365718Z     @given(
2025-05-07T20:31:43.6366037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.6366367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.6366687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.6367021Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.6367359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.6367737Z     )
2025-05-07T20:31:43.6368091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.6368551Z     def test_silu_mul_quant(
2025-05-07T20:31:43.6368803Z         self,
2025-05-07T20:31:43.6369003Z         T: int,
2025-05-07T20:31:43.6369213Z         D: int,
2025-05-07T20:31:43.6369440Z         scale_ub: Optional[float],
2025-05-07T20:31:43.6369714Z         contiguous: bool,
2025-05-07T20:31:43.6369965Z         compiled: bool,
2025-05-07T20:31:43.6370200Z     ) -> None:
2025-05-07T20:31:43.6370430Z         torch.manual_seed(2025)
2025-05-07T20:31:43.6370680Z     
2025-05-07T20:31:43.6370965Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.6371325Z     
2025-05-07T20:31:43.6371524Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.6371828Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.6372147Z         x = x_sign * x_clamp
2025-05-07T20:31:43.6372395Z         x0 = x[:, :D]
2025-05-07T20:31:43.6372622Z         x1 = x[:, D:]
2025-05-07T20:31:43.6373020Z     
2025-05-07T20:31:43.6373212Z         if contiguous:
2025-05-07T20:31:43.6373463Z             x0 = x0.contiguous()
2025-05-07T20:31:43.6373735Z             x1 = x1.contiguous()
2025-05-07T20:31:43.6373981Z     
2025-05-07T20:31:43.6374185Z         if scale_ub is not None:
2025-05-07T20:31:43.6374468Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.6374807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.6375128Z             )
2025-05-07T20:31:43.6375339Z         else:
2025-05-07T20:31:43.6375554Z             scale_ub_tensor = None
2025-05-07T20:31:43.6375813Z     
2025-05-07T20:31:43.6376075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6376401Z             op = silu_mul_quant
2025-05-07T20:31:43.6376661Z             if compiled:
2025-05-07T20:31:43.6376910Z                 op = torch.compile(op)
2025-05-07T20:31:43.6377219Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6377512Z     
2025-05-07T20:31:43.6377710Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.6377886Z 
2025-05-07T20:31:43.6377989Z moe/activation_test.py:117: 
2025-05-07T20:31:43.6378297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6378637Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.6378927Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6379499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.6380187Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.6380910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.6381623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.6382166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.6382849Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.6383524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.6384069Z     kernel = self.compile(
2025-05-07T20:31:43.6384622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.6385285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.6385699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6385932Z 
2025-05-07T20:31:43.6386148Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbbbadd0>
2025-05-07T20:31:43.6387247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.6388632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbbe47c0>}
2025-05-07T20:31:43.6389975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.6391004Z context = <triton._C.libtriton.ir.context object at 0x7f51bbba6c70>
2025-05-07T20:31:43.6391297Z 
2025-05-07T20:31:43.6391469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.6391991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.6392459Z                            module_map=module_map)
2025-05-07T20:31:43.6392827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.6393189Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.6393591Z E       ^
2025-05-07T20:31:43.6394064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.6394517Z 
2025-05-07T20:31:43.6394940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.6395448Z 
2025-05-07T20:31:43.6395553Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.6395977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.6396387Z     T=2048,
2025-05-07T20:31:43.6396579Z     D=5120,
2025-05-07T20:31:43.6396780Z     scale_ub=1200.0,
2025-05-07T20:31:43.6397010Z     contiguous=False,
2025-05-07T20:31:43.6397244Z     compiled=True,
2025-05-07T20:31:43.6397448Z )
2025-05-07T20:31:43.6397771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.6398270Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:43.6398547Z 
2025-05-07T20:31:43.6398626Z     @given(
2025-05-07T20:31:43.6398861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.6399179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.6399481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.6399812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.6400149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.6400434Z     )
2025-05-07T20:31:43.6400861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.6401304Z     def test_silu_mul_quant(
2025-05-07T20:31:43.6401547Z         self,
2025-05-07T20:31:43.6401740Z         T: int,
2025-05-07T20:31:43.6401942Z         D: int,
2025-05-07T20:31:43.6402161Z         scale_ub: Optional[float],
2025-05-07T20:31:43.6402428Z         contiguous: bool,
2025-05-07T20:31:43.6402670Z         compiled: bool,
2025-05-07T20:31:43.6402895Z     ) -> None:
2025-05-07T20:31:43.6403115Z         torch.manual_seed(2025)
2025-05-07T20:31:43.6403358Z     
2025-05-07T20:31:43.6403631Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.6403968Z     
2025-05-07T20:31:43.6404165Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.6404459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.6404772Z         x = x_sign * x_clamp
2025-05-07T20:31:43.6405013Z         x0 = x[:, :D]
2025-05-07T20:31:43.6405237Z         x1 = x[:, D:]
2025-05-07T20:31:43.6405451Z     
2025-05-07T20:31:43.6405804Z         if contiguous:
2025-05-07T20:31:43.6406039Z             x0 = x0.contiguous()
2025-05-07T20:31:43.6406301Z             x1 = x1.contiguous()
2025-05-07T20:31:43.6406541Z     
2025-05-07T20:31:43.6406740Z         if scale_ub is not None:
2025-05-07T20:31:43.6407017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.6407350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.6407728Z             )
2025-05-07T20:31:43.6407934Z         else:
2025-05-07T20:31:43.6408152Z             scale_ub_tensor = None
2025-05-07T20:31:43.6408414Z     
2025-05-07T20:31:43.6408659Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6408980Z             op = silu_mul_quant
2025-05-07T20:31:43.6409243Z             if compiled:
2025-05-07T20:31:43.6409504Z                 op = torch.compile(op)
2025-05-07T20:31:43.6409805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6410097Z     
2025-05-07T20:31:43.6410302Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.6410468Z 
2025-05-07T20:31:43.6410579Z moe/activation_test.py:117: 
2025-05-07T20:31:43.6410922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6411280Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.6411581Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6412139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.6412840Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.6413510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.6414210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.6414746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.6415444Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.6416113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.6416647Z     kernel = self.compile(
2025-05-07T20:31:43.6417198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.6417856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.6418265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6418498Z 
2025-05-07T20:31:43.6418707Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbb23850>
2025-05-07T20:31:43.6419792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.6421272Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbbe58a0>}
2025-05-07T20:31:43.6422624Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.6423660Z context = <triton._C.libtriton.ir.context object at 0x7f51bbba3730>
2025-05-07T20:31:43.6423955Z 
2025-05-07T20:31:43.6424124Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.6424654Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.6425123Z                            module_map=module_map)
2025-05-07T20:31:43.6425482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.6425839Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.6426109Z E       ^
2025-05-07T20:31:43.6426580Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.6427032Z 
2025-05-07T20:31:43.6427446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.6427966Z 
2025-05-07T20:31:43.7739931Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.7740410Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.7740866Z     T=4096,
2025-05-07T20:31:43.7741062Z     D=5120,
2025-05-07T20:31:43.7741274Z     scale_ub=1200.0,
2025-05-07T20:31:43.7741498Z     contiguous=True,
2025-05-07T20:31:43.7741730Z     compiled=True,
2025-05-07T20:31:43.7741946Z )
2025-05-07T20:31:43.7742265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.7742772Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.7743045Z 
2025-05-07T20:31:43.7743136Z     @given(
2025-05-07T20:31:43.7743371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.7743692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.7744003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.7744338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.7744664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.7745159Z     )
2025-05-07T20:31:43.7745509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.7745946Z     def test_silu_mul_quant(
2025-05-07T20:31:43.7746198Z         self,
2025-05-07T20:31:43.7746402Z         T: int,
2025-05-07T20:31:43.7746603Z         D: int,
2025-05-07T20:31:43.7746834Z         scale_ub: Optional[float],
2025-05-07T20:31:43.7747110Z         contiguous: bool,
2025-05-07T20:31:43.7747350Z         compiled: bool,
2025-05-07T20:31:43.7747588Z     ) -> None:
2025-05-07T20:31:43.7747812Z         torch.manual_seed(2025)
2025-05-07T20:31:43.7748057Z     
2025-05-07T20:31:43.7748339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.7748690Z     
2025-05-07T20:31:43.7748894Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.7749194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.7749509Z         x = x_sign * x_clamp
2025-05-07T20:31:43.7749758Z         x0 = x[:, :D]
2025-05-07T20:31:43.7749981Z         x1 = x[:, D:]
2025-05-07T20:31:43.7750198Z     
2025-05-07T20:31:43.7750392Z         if contiguous:
2025-05-07T20:31:43.7750624Z             x0 = x0.contiguous()
2025-05-07T20:31:43.7750898Z             x1 = x1.contiguous()
2025-05-07T20:31:43.7751145Z     
2025-05-07T20:31:43.7751342Z         if scale_ub is not None:
2025-05-07T20:31:43.7751620Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.7751960Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.7752397Z             )
2025-05-07T20:31:43.7752610Z         else:
2025-05-07T20:31:43.7752831Z             scale_ub_tensor = None
2025-05-07T20:31:43.7753084Z     
2025-05-07T20:31:43.7753321Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.7753640Z             op = silu_mul_quant
2025-05-07T20:31:43.7753889Z             if compiled:
2025-05-07T20:31:43.7754139Z                 op = torch.compile(op)
2025-05-07T20:31:43.7754439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.7754735Z     
2025-05-07T20:31:43.7754928Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.7755102Z 
2025-05-07T20:31:43.7755204Z moe/activation_test.py:117: 
2025-05-07T20:31:43.7755505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.7755843Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.7756124Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.7756693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.7757256Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.7757910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.7758600Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.7759138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.7759826Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.7760482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.7761059Z     kernel = self.compile(
2025-05-07T20:31:43.7761606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.7762260Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.7762659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.7762892Z 
2025-05-07T20:31:43.7763099Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb900810>
2025-05-07T20:31:43.7764177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.7765623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbbe6ac0>}
2025-05-07T20:31:43.7766960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.7768062Z context = <triton._C.libtriton.ir.context object at 0x7f51bb9dc6f0>
2025-05-07T20:31:43.7768349Z 
2025-05-07T20:31:43.7768522Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.7769048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.7769514Z                            module_map=module_map)
2025-05-07T20:31:43.7769888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.7770252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.7770516Z E       ^
2025-05-07T20:31:43.7771016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.7771490Z 
2025-05-07T20:31:43.7771908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.7772418Z 
2025-05-07T20:31:43.7772538Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.7773027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.7773439Z     T=128,
2025-05-07T20:31:43.7773640Z     D=5120,
2025-05-07T20:31:43.7773843Z     scale_ub=1200.0,
2025-05-07T20:31:43.7774074Z     contiguous=False,
2025-05-07T20:31:43.7774306Z     compiled=True,
2025-05-07T20:31:43.7774515Z )
2025-05-07T20:31:43.8603679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.8604288Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:43.8604729Z 
2025-05-07T20:31:43.8604816Z     @given(
2025-05-07T20:31:43.8605068Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.8605386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.8605983Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.8606325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.8606667Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.8606970Z     )
2025-05-07T20:31:43.8607329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.8616853Z     def test_silu_mul_quant(
2025-05-07T20:31:43.8617124Z         self,
2025-05-07T20:31:43.8617324Z         T: int,
2025-05-07T20:31:43.8617529Z         D: int,
2025-05-07T20:31:43.8617764Z         scale_ub: Optional[float],
2025-05-07T20:31:43.8618035Z         contiguous: bool,
2025-05-07T20:31:43.8618296Z         compiled: bool,
2025-05-07T20:31:43.8618522Z     ) -> None:
2025-05-07T20:31:43.8618750Z         torch.manual_seed(2025)
2025-05-07T20:31:43.8619007Z     
2025-05-07T20:31:43.8619285Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.8619641Z     
2025-05-07T20:31:43.8619848Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.8620154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.8620465Z         x = x_sign * x_clamp
2025-05-07T20:31:43.8620719Z         x0 = x[:, :D]
2025-05-07T20:31:43.8620936Z         x1 = x[:, D:]
2025-05-07T20:31:43.8621145Z     
2025-05-07T20:31:43.8621342Z         if contiguous:
2025-05-07T20:31:43.8621583Z             x0 = x0.contiguous()
2025-05-07T20:31:43.8621841Z             x1 = x1.contiguous()
2025-05-07T20:31:43.8622091Z     
2025-05-07T20:31:43.8622292Z         if scale_ub is not None:
2025-05-07T20:31:43.8622567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.8623094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.8623413Z             )
2025-05-07T20:31:43.8623606Z         else:
2025-05-07T20:31:43.8623822Z             scale_ub_tensor = None
2025-05-07T20:31:43.8624075Z     
2025-05-07T20:31:43.8624304Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.8624622Z             op = silu_mul_quant
2025-05-07T20:31:43.8624879Z             if compiled:
2025-05-07T20:31:43.8625130Z                 op = torch.compile(op)
2025-05-07T20:31:43.8625432Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8625718Z     
2025-05-07T20:31:43.8625917Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.8626085Z 
2025-05-07T20:31:43.8626186Z moe/activation_test.py:117: 
2025-05-07T20:31:43.8626481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8626827Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.8627103Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8627672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.8628232Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.8628891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.8629577Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.8630229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.8630915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.8631575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.8632107Z     kernel = self.compile(
2025-05-07T20:31:43.8632651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.8633316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.8633708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8633942Z 
2025-05-07T20:31:43.8634151Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb9d1790>
2025-05-07T20:31:43.8635239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.8636610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb970540>}
2025-05-07T20:31:43.8637944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.8638974Z context = <triton._C.libtriton.ir.context object at 0x7f51bb9f1f30>
2025-05-07T20:31:43.8639267Z 
2025-05-07T20:31:43.8639432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.8639955Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.8640419Z                            module_map=module_map)
2025-05-07T20:31:43.8640790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.8641150Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.8641415Z E       ^
2025-05-07T20:31:43.8641878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.8642330Z 
2025-05-07T20:31:43.8642745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.8643252Z 
2025-05-07T20:31:43.8643449Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.8643866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.8644264Z     T=16384,
2025-05-07T20:31:43.8644467Z     D=7168,
2025-05-07T20:31:43.8644666Z     scale_ub=1200.0,
2025-05-07T20:31:43.8644888Z     contiguous=True,
2025-05-07T20:31:43.8645115Z     compiled=True,
2025-05-07T20:31:43.8645324Z )
2025-05-07T20:31:43.8645639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.8646144Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.8646420Z 
2025-05-07T20:31:43.8646507Z     @given(
2025-05-07T20:31:43.8646736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.8647054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.8647364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.8647763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.8648092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.8648384Z     )
2025-05-07T20:31:43.8648740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.8649180Z     def test_silu_mul_quant(
2025-05-07T20:31:43.8649424Z         self,
2025-05-07T20:31:43.8649627Z         T: int,
2025-05-07T20:31:43.8649822Z         D: int,
2025-05-07T20:31:43.8650043Z         scale_ub: Optional[float],
2025-05-07T20:31:43.8650320Z         contiguous: bool,
2025-05-07T20:31:43.8650647Z         compiled: bool,
2025-05-07T20:31:43.8650876Z     ) -> None:
2025-05-07T20:31:43.8651093Z         torch.manual_seed(2025)
2025-05-07T20:31:43.8651332Z     
2025-05-07T20:31:43.8651605Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.8651952Z     
2025-05-07T20:31:43.8652145Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.8652435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.8652750Z         x = x_sign * x_clamp
2025-05-07T20:31:43.8652993Z         x0 = x[:, :D]
2025-05-07T20:31:43.8653206Z         x1 = x[:, D:]
2025-05-07T20:31:43.8653418Z     
2025-05-07T20:31:43.8653618Z         if contiguous:
2025-05-07T20:31:43.8653845Z             x0 = x0.contiguous()
2025-05-07T20:31:43.8654106Z             x1 = x1.contiguous()
2025-05-07T20:31:43.8654354Z     
2025-05-07T20:31:43.8654549Z         if scale_ub is not None:
2025-05-07T20:31:43.8654832Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.8655180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.8655490Z             )
2025-05-07T20:31:43.8655683Z         else:
2025-05-07T20:31:43.8655896Z             scale_ub_tensor = None
2025-05-07T20:31:43.8656150Z     
2025-05-07T20:31:43.8656384Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.8656696Z             op = silu_mul_quant
2025-05-07T20:31:43.8656946Z             if compiled:
2025-05-07T20:31:43.8657200Z                 op = torch.compile(op)
2025-05-07T20:31:43.8657492Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8657769Z     
2025-05-07T20:31:43.8657962Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.8658124Z 
2025-05-07T20:31:43.8658224Z moe/activation_test.py:117: 
2025-05-07T20:31:43.8658523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8658856Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.8659134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8659694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.8660254Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.8660915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.8661596Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.8662131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.8662896Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.8663548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.8664076Z     kernel = self.compile(
2025-05-07T20:31:43.8664616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.8665276Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.8665668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8665897Z 
2025-05-07T20:31:43.8666102Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bba56490>
2025-05-07T20:31:43.8667173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.8668540Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb971080>}
2025-05-07T20:31:43.8669960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.8670981Z context = <triton._C.libtriton.ir.context object at 0x7f51bba3e370>
2025-05-07T20:31:43.8671271Z 
2025-05-07T20:31:43.8671437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.8671957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.8672421Z                            module_map=module_map)
2025-05-07T20:31:43.8672784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.8673138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.8673398Z E       ^
2025-05-07T20:31:43.8673855Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.8674307Z 
2025-05-07T20:31:43.8674720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.8675232Z 
2025-05-07T20:31:43.9620565Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.9621117Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.9621697Z     T=16384,
2025-05-07T20:31:43.9621965Z     D=5120,
2025-05-07T20:31:43.9622239Z     scale_ub=1200.0,
2025-05-07T20:31:43.9622502Z     contiguous=True,
2025-05-07T20:31:43.9622724Z     compiled=False,
2025-05-07T20:31:43.9622927Z )
2025-05-07T20:31:43.9623249Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.9623756Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:43.9624038Z 
2025-05-07T20:31:43.9624114Z     @given(
2025-05-07T20:31:43.9624348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.9624656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.9624963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.9625295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.9625628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.9625909Z     )
2025-05-07T20:31:43.9626264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.9626704Z     def test_silu_mul_quant(
2025-05-07T20:31:43.9626944Z         self,
2025-05-07T20:31:43.9627140Z         T: int,
2025-05-07T20:31:43.9627341Z         D: int,
2025-05-07T20:31:43.9627551Z         scale_ub: Optional[float],
2025-05-07T20:31:43.9628002Z         contiguous: bool,
2025-05-07T20:31:43.9628239Z         compiled: bool,
2025-05-07T20:31:43.9628462Z     ) -> None:
2025-05-07T20:31:43.9628683Z         torch.manual_seed(2025)
2025-05-07T20:31:43.9628930Z     
2025-05-07T20:31:43.9629199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.9629545Z     
2025-05-07T20:31:43.9629741Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.9630036Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.9630348Z         x = x_sign * x_clamp
2025-05-07T20:31:43.9630634Z         x0 = x[:, :D]
2025-05-07T20:31:43.9630857Z         x1 = x[:, D:]
2025-05-07T20:31:43.9631064Z     
2025-05-07T20:31:43.9631248Z         if contiguous:
2025-05-07T20:31:43.9631487Z             x0 = x0.contiguous()
2025-05-07T20:31:43.9631737Z             x1 = x1.contiguous()
2025-05-07T20:31:43.9631981Z     
2025-05-07T20:31:43.9632174Z         if scale_ub is not None:
2025-05-07T20:31:43.9632450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.9632784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.9633099Z             )
2025-05-07T20:31:43.9633290Z         else:
2025-05-07T20:31:43.9633505Z             scale_ub_tensor = None
2025-05-07T20:31:43.9633756Z     
2025-05-07T20:31:43.9633988Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.9634301Z             op = silu_mul_quant
2025-05-07T20:31:43.9634546Z             if compiled:
2025-05-07T20:31:43.9634936Z                 op = torch.compile(op)
2025-05-07T20:31:43.9635232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9635504Z     
2025-05-07T20:31:43.9635698Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.9635862Z 
2025-05-07T20:31:43.9635962Z moe/activation_test.py:117: 
2025-05-07T20:31:43.9636264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9636599Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.9636886Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9637574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.9638276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.9638813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.9639506Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.9640185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.9640725Z     kernel = self.compile(
2025-05-07T20:31:43.9641268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.9641922Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.9642317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9642555Z 
2025-05-07T20:31:43.9642760Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bba930d0>
2025-05-07T20:31:43.9643843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.9645216Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb972660>}
2025-05-07T20:31:43.9646554Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.9647682Z context = <triton._C.libtriton.ir.context object at 0x7f51bba12f70>
2025-05-07T20:31:43.9648075Z 
2025-05-07T20:31:43.9648239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.9648768Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.9649224Z                            module_map=module_map)
2025-05-07T20:31:43.9649595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.9649946Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.9650208Z E       ^
2025-05-07T20:31:43.9650676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.9651132Z 
2025-05-07T20:31:43.9651546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.9652054Z 
2025-05-07T20:31:43.9652167Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.9652570Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.9652980Z     T=1,
2025-05-07T20:31:43.9653167Z     D=7168,
2025-05-07T20:31:43.9653357Z     scale_ub=1200.0,
2025-05-07T20:31:43.9653587Z     contiguous=False,
2025-05-07T20:31:43.9653809Z     compiled=False,
2025-05-07T20:31:43.9654008Z )
2025-05-07T20:31:43.9654328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.9654816Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.9655085Z 
2025-05-07T20:31:43.9655249Z     @given(
2025-05-07T20:31:43.9655483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.9655791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.9656104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.9656426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.9656758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.9657041Z     )
2025-05-07T20:31:43.9657393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.9657832Z     def test_silu_mul_quant(
2025-05-07T20:31:43.9658082Z         self,
2025-05-07T20:31:43.9658272Z         T: int,
2025-05-07T20:31:43.9658467Z         D: int,
2025-05-07T20:31:43.9658684Z         scale_ub: Optional[float],
2025-05-07T20:31:43.9658956Z         contiguous: bool,
2025-05-07T20:31:43.9659196Z         compiled: bool,
2025-05-07T20:31:43.9659415Z     ) -> None:
2025-05-07T20:31:43.9659640Z         torch.manual_seed(2025)
2025-05-07T20:31:43.9659874Z     
2025-05-07T20:31:43.9660149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.9660492Z     
2025-05-07T20:31:43.9660685Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.9660978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.9661285Z         x = x_sign * x_clamp
2025-05-07T20:31:43.9661525Z         x0 = x[:, :D]
2025-05-07T20:31:43.9661745Z         x1 = x[:, D:]
2025-05-07T20:31:43.9661963Z     
2025-05-07T20:31:43.9662142Z         if contiguous:
2025-05-07T20:31:43.9662378Z             x0 = x0.contiguous()
2025-05-07T20:31:43.9662639Z             x1 = x1.contiguous()
2025-05-07T20:31:43.9662867Z     
2025-05-07T20:31:43.9663059Z         if scale_ub is not None:
2025-05-07T20:31:43.9663342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.9663669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.9663987Z             )
2025-05-07T20:31:43.9664189Z         else:
2025-05-07T20:31:43.9664400Z             scale_ub_tensor = None
2025-05-07T20:31:43.9664646Z     
2025-05-07T20:31:43.9664876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.9665182Z             op = silu_mul_quant
2025-05-07T20:31:43.9665427Z             if compiled:
2025-05-07T20:31:43.9665675Z                 op = torch.compile(op)
2025-05-07T20:31:43.9665971Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9666329Z     
2025-05-07T20:31:43.9666525Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.9666685Z 
2025-05-07T20:31:43.9666788Z moe/activation_test.py:117: 
2025-05-07T20:31:43.9667082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9667407Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.9667694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9668378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.9669064Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.9669600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.9670278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.9670980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.9671515Z     kernel = self.compile(
2025-05-07T20:31:43.9672056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.9672700Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.9673100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9673332Z 
2025-05-07T20:31:43.9673535Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bba71c90>
2025-05-07T20:31:43.9674692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.9676051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb971d00>}
2025-05-07T20:31:43.9677387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.9678404Z context = <triton._C.libtriton.ir.context object at 0x7f51bba89b70>
2025-05-07T20:31:43.9678692Z 
2025-05-07T20:31:43.9678853Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.9679374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.9679826Z                            module_map=module_map)
2025-05-07T20:31:43.9680185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.9680535Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.9680790Z E       ^
2025-05-07T20:31:43.9681300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.9681757Z 
2025-05-07T20:31:43.9682170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.9682675Z 
2025-05-07T20:31:44.3148729Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3149404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3149967Z     T=4096,
2025-05-07T20:31:44.3150236Z     D=7168,
2025-05-07T20:31:44.3150706Z     scale_ub=1200.0,
2025-05-07T20:31:44.3151180Z     contiguous=False,
2025-05-07T20:31:44.3151628Z     compiled=True,
2025-05-07T20:31:44.3152038Z )
2025-05-07T20:31:44.3152664Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3153653Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.3154197Z 
2025-05-07T20:31:44.3154367Z     @given(
2025-05-07T20:31:44.3154823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3155778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3156385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3157042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3157689Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3158256Z     )
2025-05-07T20:31:44.3158951Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3159817Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3160301Z         self,
2025-05-07T20:31:44.3160631Z         T: int,
2025-05-07T20:31:44.3160852Z         D: int,
2025-05-07T20:31:44.3161096Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3161368Z         contiguous: bool,
2025-05-07T20:31:44.3161605Z         compiled: bool,
2025-05-07T20:31:44.3161833Z     ) -> None:
2025-05-07T20:31:44.3162067Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3162311Z     
2025-05-07T20:31:44.3162583Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3162938Z     
2025-05-07T20:31:44.3163133Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3163422Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3163734Z         x = x_sign * x_clamp
2025-05-07T20:31:44.3163976Z         x0 = x[:, :D]
2025-05-07T20:31:44.3164195Z         x1 = x[:, D:]
2025-05-07T20:31:44.3164401Z     
2025-05-07T20:31:44.3164590Z         if contiguous:
2025-05-07T20:31:44.3164823Z             x0 = x0.contiguous()
2025-05-07T20:31:44.3165192Z             x1 = x1.contiguous()
2025-05-07T20:31:44.3165438Z     
2025-05-07T20:31:44.3165633Z         if scale_ub is not None:
2025-05-07T20:31:44.3165901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.3166241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.3166554Z             )
2025-05-07T20:31:44.3166746Z         else:
2025-05-07T20:31:44.3166961Z             scale_ub_tensor = None
2025-05-07T20:31:44.3167215Z     
2025-05-07T20:31:44.3167445Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.3167873Z             op = silu_mul_quant
2025-05-07T20:31:44.3168127Z             if compiled:
2025-05-07T20:31:44.3168381Z                 op = torch.compile(op)
2025-05-07T20:31:44.3168673Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3168952Z     
2025-05-07T20:31:44.3169152Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.3169316Z 
2025-05-07T20:31:44.3169417Z moe/activation_test.py:117: 
2025-05-07T20:31:44.3169719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3170051Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.3177162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3177876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.3178449Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.3179259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.3179993Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.3180530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.3181219Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.3181888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.3182424Z     kernel = self.compile(
2025-05-07T20:31:44.3182966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.3183628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.3184034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3184265Z 
2025-05-07T20:31:44.3184603Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c85904d0>
2025-05-07T20:31:44.3185681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.3187058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c852ccc0>}
2025-05-07T20:31:44.3188401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.3189423Z context = <triton._C.libtriton.ir.context object at 0x7f51c85e43b0>
2025-05-07T20:31:44.3189711Z 
2025-05-07T20:31:44.3189889Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.3190416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.3190893Z                            module_map=module_map)
2025-05-07T20:31:44.3191267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.3191624Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.3191889Z E       ^
2025-05-07T20:31:44.3192437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.3192888Z 
2025-05-07T20:31:44.3193309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.3193821Z 
2025-05-07T20:31:44.3193930Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3194347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3194755Z     T=128,
2025-05-07T20:31:44.3194945Z     D=7168,
2025-05-07T20:31:44.3195156Z     scale_ub=1200.0,
2025-05-07T20:31:44.3195387Z     contiguous=False,
2025-05-07T20:31:44.3195618Z     compiled=True,
2025-05-07T20:31:44.3195823Z )
2025-05-07T20:31:44.3894282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3895021Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.3895431Z 
2025-05-07T20:31:44.3895550Z     @given(
2025-05-07T20:31:44.3895855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3896264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3896572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3896902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3897232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3897522Z     )
2025-05-07T20:31:44.3897870Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3898310Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3898560Z         self,
2025-05-07T20:31:44.3898759Z         T: int,
2025-05-07T20:31:44.3898951Z         D: int,
2025-05-07T20:31:44.3899171Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3899439Z         contiguous: bool,
2025-05-07T20:31:44.3899677Z         compiled: bool,
2025-05-07T20:31:44.3899905Z     ) -> None:
2025-05-07T20:31:44.3900119Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3900359Z     
2025-05-07T20:31:44.3900636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3900983Z     
2025-05-07T20:31:44.3901178Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3901471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3901781Z         x = x_sign * x_clamp
2025-05-07T20:31:44.3902019Z         x0 = x[:, :D]
2025-05-07T20:31:44.3902244Z         x1 = x[:, D:]
2025-05-07T20:31:44.3902455Z     
2025-05-07T20:31:44.3902649Z         if contiguous:
2025-05-07T20:31:44.3902884Z             x0 = x0.contiguous()
2025-05-07T20:31:44.3903326Z             x1 = x1.contiguous()
2025-05-07T20:31:44.3903570Z     
2025-05-07T20:31:44.3903760Z         if scale_ub is not None:
2025-05-07T20:31:44.3904039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.3904375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.3904686Z             )
2025-05-07T20:31:44.3904883Z         else:
2025-05-07T20:31:44.3905101Z             scale_ub_tensor = None
2025-05-07T20:31:44.3905352Z     
2025-05-07T20:31:44.3905762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.3906088Z             op = silu_mul_quant
2025-05-07T20:31:44.3906335Z             if compiled:
2025-05-07T20:31:44.3906587Z                 op = torch.compile(op)
2025-05-07T20:31:44.3906885Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3907157Z     
2025-05-07T20:31:44.3907351Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.3907518Z 
2025-05-07T20:31:44.3907627Z moe/activation_test.py:117: 
2025-05-07T20:31:44.3907925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3908256Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.3908541Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3909098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.3909651Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.3910435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.3911129Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.3911663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.3912337Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.3912994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.3913531Z     kernel = self.compile(
2025-05-07T20:31:44.3914064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.3914719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.3915113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3915341Z 
2025-05-07T20:31:44.3915557Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c854d690>
2025-05-07T20:31:44.3916623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.3917981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c852d580>}
2025-05-07T20:31:44.3919324Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.3920339Z context = <triton._C.libtriton.ir.context object at 0x7f51c85d8d70>
2025-05-07T20:31:44.3920626Z 
2025-05-07T20:31:44.3920791Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.3921316Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.3921782Z                            module_map=module_map)
2025-05-07T20:31:44.3922146Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.3922500Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.3922765Z E       ^
2025-05-07T20:31:44.3923233Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.3923808Z 
2025-05-07T20:31:44.3924220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.3924729Z 
2025-05-07T20:31:44.3924836Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3925243Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3925644Z     T=2048,
2025-05-07T20:31:44.3925838Z     D=7168,
2025-05-07T20:31:44.3926032Z     scale_ub=None,
2025-05-07T20:31:44.3926249Z     contiguous=True,
2025-05-07T20:31:44.3926473Z     compiled=True,
2025-05-07T20:31:44.3926674Z )
2025-05-07T20:31:44.3926992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3927479Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:44.3927803Z 
2025-05-07T20:31:44.3927886Z     @given(
2025-05-07T20:31:44.3928109Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3928428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3928732Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3929056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3929385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3929673Z     )
2025-05-07T20:31:44.3930015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3930927Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3931178Z         self,
2025-05-07T20:31:44.3931372Z         T: int,
2025-05-07T20:31:44.3931570Z         D: int,
2025-05-07T20:31:44.3931786Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3932057Z         contiguous: bool,
2025-05-07T20:31:44.3932312Z         compiled: bool,
2025-05-07T20:31:44.3932536Z     ) -> None:
2025-05-07T20:31:44.3932754Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3932989Z     
2025-05-07T20:31:44.3933265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3933608Z     
2025-05-07T20:31:44.3933802Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3934095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3934406Z         x = x_sign * x_clamp
2025-05-07T20:31:44.3934656Z         x0 = x[:, :D]
2025-05-07T20:31:44.3934873Z         x1 = x[:, D:]
2025-05-07T20:31:44.3935083Z     
2025-05-07T20:31:44.3935272Z         if contiguous:
2025-05-07T20:31:44.3935508Z             x0 = x0.contiguous()
2025-05-07T20:31:44.3935765Z             x1 = x1.contiguous()
2025-05-07T20:31:44.3936002Z     
2025-05-07T20:31:44.3936190Z         if scale_ub is not None:
2025-05-07T20:31:44.3936460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.3936791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.3937097Z             )
2025-05-07T20:31:44.3937293Z         else:
2025-05-07T20:31:44.3937506Z             scale_ub_tensor = None
2025-05-07T20:31:44.3937759Z     
2025-05-07T20:31:44.3937991Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.3938306Z             op = silu_mul_quant
2025-05-07T20:31:44.3938552Z             if compiled:
2025-05-07T20:31:44.3938799Z                 op = torch.compile(op)
2025-05-07T20:31:44.3939094Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3939369Z     
2025-05-07T20:31:44.3939559Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.3939728Z 
2025-05-07T20:31:44.3939833Z moe/activation_test.py:117: 
2025-05-07T20:31:44.3940129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3940457Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.3940737Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3941292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.3941844Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.3942587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.3943276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.3943806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.3944477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.3945138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.3945664Z     kernel = self.compile(
2025-05-07T20:31:44.3946199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.3946842Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.3947236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3947466Z 
2025-05-07T20:31:44.3947676Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb79dd10>
2025-05-07T20:31:44.3948742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.3950210Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c852e480>}
2025-05-07T20:31:44.3951546Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.3952563Z context = <triton._C.libtriton.ir.context object at 0x7f51bb7d9bb0>
2025-05-07T20:31:44.3952847Z 
2025-05-07T20:31:44.3953014Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.3953539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.3954005Z                            module_map=module_map)
2025-05-07T20:31:44.3954367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.3954716Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.3954976Z E       ^
2025-05-07T20:31:44.3955446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.3955892Z 
2025-05-07T20:31:44.3956310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.3956814Z 
2025-05-07T20:31:44.4569704Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4570213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4570760Z     T=16384,
2025-05-07T20:31:44.4571073Z     D=5120,
2025-05-07T20:31:44.4571400Z     scale_ub=None,
2025-05-07T20:31:44.4571703Z     contiguous=False,
2025-05-07T20:31:44.4572006Z     compiled=False,
2025-05-07T20:31:44.4572296Z )
2025-05-07T20:31:44.4572628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4573124Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.4573415Z 
2025-05-07T20:31:44.4573498Z     @given(
2025-05-07T20:31:44.4573737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4574057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4574356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4574692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4575023Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4575311Z     )
2025-05-07T20:31:44.4575663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4576284Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4576531Z         self,
2025-05-07T20:31:44.4576730Z         T: int,
2025-05-07T20:31:44.4576936Z         D: int,
2025-05-07T20:31:44.4577147Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4577425Z         contiguous: bool,
2025-05-07T20:31:44.4577666Z         compiled: bool,
2025-05-07T20:31:44.4577895Z     ) -> None:
2025-05-07T20:31:44.4578113Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4578367Z     
2025-05-07T20:31:44.4578643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4578998Z     
2025-05-07T20:31:44.4579198Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.4579500Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.4581531Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.4583420Z 
2025-05-07T20:31:44.4583548Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:44.4583764Z 
2025-05-07T20:31:44.4583993Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4584417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4584824Z     T=4096,
2025-05-07T20:31:44.4585013Z     D=7168,
2025-05-07T20:31:44.4585212Z     scale_ub=1200.0,
2025-05-07T20:31:44.4585435Z     contiguous=True,
2025-05-07T20:31:44.4585654Z     compiled=True,
2025-05-07T20:31:44.4585860Z )
2025-05-07T20:31:44.4586180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4586682Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.4586953Z 
2025-05-07T20:31:44.4587034Z     @given(
2025-05-07T20:31:44.4587265Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4587583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4587886Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4588215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4588555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4588837Z     )
2025-05-07T20:31:44.4589189Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4589630Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4589873Z         self,
2025-05-07T20:31:44.4590066Z         T: int,
2025-05-07T20:31:44.4590267Z         D: int,
2025-05-07T20:31:44.4590490Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4590760Z         contiguous: bool,
2025-05-07T20:31:44.4591001Z         compiled: bool,
2025-05-07T20:31:44.4591226Z     ) -> None:
2025-05-07T20:31:44.4591442Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4591682Z     
2025-05-07T20:31:44.4591955Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4592290Z     
2025-05-07T20:31:44.4592491Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.4592779Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.4594794Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.4596817Z 
2025-05-07T20:31:44.4596942Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:44.4597154Z 
2025-05-07T20:31:44.4597258Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4597669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4598072Z     T=16384,
2025-05-07T20:31:44.4598264Z     D=7168,
2025-05-07T20:31:44.4598462Z     scale_ub=None,
2025-05-07T20:31:44.4598686Z     contiguous=False,
2025-05-07T20:31:44.4598907Z     compiled=False,
2025-05-07T20:31:44.4599115Z )
2025-05-07T20:31:44.4599435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4599930Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.4600213Z 
2025-05-07T20:31:44.4600292Z     @given(
2025-05-07T20:31:44.4600520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4600846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4601150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4601482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4601813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4602096Z     )
2025-05-07T20:31:44.4602448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4602891Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4603214Z         self,
2025-05-07T20:31:44.4603417Z         T: int,
2025-05-07T20:31:44.4603622Z         D: int,
2025-05-07T20:31:44.4603835Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4604106Z         contiguous: bool,
2025-05-07T20:31:44.4604369Z         compiled: bool,
2025-05-07T20:31:44.4604595Z     ) -> None:
2025-05-07T20:31:44.4604806Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4605049Z     
2025-05-07T20:31:44.4605323Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4607663Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.4609544Z 
2025-05-07T20:31:44.4609672Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.4609881Z 
2025-05-07T20:31:44.4609985Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4610405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4610806Z     T=2048,
2025-05-07T20:31:44.4611001Z     D=7168,
2025-05-07T20:31:44.4611195Z     scale_ub=1200.0,
2025-05-07T20:31:44.4611425Z     contiguous=True,
2025-05-07T20:31:44.4611642Z     compiled=True,
2025-05-07T20:31:44.4611850Z )
2025-05-07T20:31:44.4612167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4612658Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.4612934Z 
2025-05-07T20:31:44.4613009Z     @given(
2025-05-07T20:31:44.4613243Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4613562Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4613866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4614194Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4614520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4614807Z     )
2025-05-07T20:31:44.4615157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4615730Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4615968Z         self,
2025-05-07T20:31:44.4616167Z         T: int,
2025-05-07T20:31:44.4616369Z         D: int,
2025-05-07T20:31:44.4616586Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4616859Z         contiguous: bool,
2025-05-07T20:31:44.4617099Z         compiled: bool,
2025-05-07T20:31:44.4617325Z     ) -> None:
2025-05-07T20:31:44.4617537Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4617780Z     
2025-05-07T20:31:44.4618055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4618402Z     
2025-05-07T20:31:44.4618598Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.4618888Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.4620889Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.4622757Z 
2025-05-07T20:31:44.4622879Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:44.4623094Z 
2025-05-07T20:31:44.4623310Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4623726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4624136Z     T=2048,
2025-05-07T20:31:44.4624321Z     D=7168,
2025-05-07T20:31:44.4624515Z     scale_ub=None,
2025-05-07T20:31:44.4624729Z     contiguous=True,
2025-05-07T20:31:44.4624944Z     compiled=False,
2025-05-07T20:31:44.4625155Z )
2025-05-07T20:31:44.5482884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5484284Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.5484963Z 
2025-05-07T20:31:44.5485159Z     @given(
2025-05-07T20:31:44.5485719Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5486469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5487131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5487862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5488474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5488998Z     )
2025-05-07T20:31:44.5489630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5490438Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5490869Z         self,
2025-05-07T20:31:44.5491100Z         T: int,
2025-05-07T20:31:44.5491305Z         D: int,
2025-05-07T20:31:44.5491532Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5491801Z         contiguous: bool,
2025-05-07T20:31:44.5492052Z         compiled: bool,
2025-05-07T20:31:44.5492284Z     ) -> None:
2025-05-07T20:31:44.5492499Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5492752Z     
2025-05-07T20:31:44.5493028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5493370Z     
2025-05-07T20:31:44.5493573Z >       x_sign = torch.sign(x)
2025-05-07T20:31:44.5495522Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.5497594Z 
2025-05-07T20:31:44.5497715Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:44.5497928Z 
2025-05-07T20:31:44.5498041Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5498454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5498874Z     T=1,
2025-05-07T20:31:44.5499066Z     D=7168,
2025-05-07T20:31:44.5499270Z     scale_ub=1200.0,
2025-05-07T20:31:44.5499492Z     contiguous=True,
2025-05-07T20:31:44.5499722Z     compiled=False,
2025-05-07T20:31:44.5499942Z )
2025-05-07T20:31:44.5500257Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5508554Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.5508836Z 
2025-05-07T20:31:44.5508926Z     @given(
2025-05-07T20:31:44.5509163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5509482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5509800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5510154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5510496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5510797Z     )
2025-05-07T20:31:44.5511159Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5511612Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5511866Z         self,
2025-05-07T20:31:44.5512079Z         T: int,
2025-05-07T20:31:44.5512291Z         D: int,
2025-05-07T20:31:44.5512689Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5512979Z         contiguous: bool,
2025-05-07T20:31:44.5513238Z         compiled: bool,
2025-05-07T20:31:44.5513470Z     ) -> None:
2025-05-07T20:31:44.5513701Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5513956Z     
2025-05-07T20:31:44.5514236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5514596Z     
2025-05-07T20:31:44.5514805Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.5515107Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.5515435Z         x = x_sign * x_clamp
2025-05-07T20:31:44.5515690Z         x0 = x[:, :D]
2025-05-07T20:31:44.5515925Z         x1 = x[:, D:]
2025-05-07T20:31:44.5516145Z     
2025-05-07T20:31:44.5516347Z         if contiguous:
2025-05-07T20:31:44.5516597Z             x0 = x0.contiguous()
2025-05-07T20:31:44.5516865Z             x1 = x1.contiguous()
2025-05-07T20:31:44.5517115Z     
2025-05-07T20:31:44.5517330Z         if scale_ub is not None:
2025-05-07T20:31:44.5517611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.5517965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.5518283Z             )
2025-05-07T20:31:44.5518487Z         else:
2025-05-07T20:31:44.5518713Z             scale_ub_tensor = None
2025-05-07T20:31:44.5518980Z     
2025-05-07T20:31:44.5519223Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.5519555Z             op = silu_mul_quant
2025-05-07T20:31:44.5519818Z             if compiled:
2025-05-07T20:31:44.5520075Z                 op = torch.compile(op)
2025-05-07T20:31:44.5520384Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5520670Z     
2025-05-07T20:31:44.5520874Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.5521043Z 
2025-05-07T20:31:44.5521151Z moe/activation_test.py:117: 
2025-05-07T20:31:44.5521457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5521816Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.5522103Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5522808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.5523514Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.5524064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.5524879Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.5525552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.5526105Z     kernel = self.compile(
2025-05-07T20:31:44.5526654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.5527326Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.5527787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5528019Z 
2025-05-07T20:31:44.5528237Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb513a90>
2025-05-07T20:31:44.5529322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.5530735Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb759d00>}
2025-05-07T20:31:44.5532113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.5533234Z context = <triton._C.libtriton.ir.context object at 0x7f51bb5ef930>
2025-05-07T20:31:44.5533529Z 
2025-05-07T20:31:44.5533707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.5534238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.5534718Z                            module_map=module_map)
2025-05-07T20:31:44.5535095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.5535461Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.5535908Z E       ^
2025-05-07T20:31:44.5536378Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.5536828Z 
2025-05-07T20:31:44.5537252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.5537763Z 
2025-05-07T20:31:44.5537872Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5538324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5538807Z     T=128,
2025-05-07T20:31:44.5539001Z     D=5120,
2025-05-07T20:31:44.5539273Z     scale_ub=None,
2025-05-07T20:31:44.5539496Z     contiguous=True,
2025-05-07T20:31:44.5539716Z     compiled=False,
2025-05-07T20:31:44.5539921Z )
2025-05-07T20:31:44.6064814Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6065583Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.6065960Z 
2025-05-07T20:31:44.6066078Z     @given(
2025-05-07T20:31:44.6066389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6066812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6067132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6067468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6067804Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6068104Z     )
2025-05-07T20:31:44.6068457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6068903Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6069151Z         self,
2025-05-07T20:31:44.6069352Z         T: int,
2025-05-07T20:31:44.6069560Z         D: int,
2025-05-07T20:31:44.6069784Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6070062Z         contiguous: bool,
2025-05-07T20:31:44.6070484Z         compiled: bool,
2025-05-07T20:31:44.6070715Z     ) -> None:
2025-05-07T20:31:44.6070939Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6071183Z     
2025-05-07T20:31:44.6071466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6071818Z     
2025-05-07T20:31:44.6072016Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.6072312Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.6072628Z         x = x_sign * x_clamp
2025-05-07T20:31:44.6072878Z         x0 = x[:, :D]
2025-05-07T20:31:44.6073110Z         x1 = x[:, D:]
2025-05-07T20:31:44.6073329Z     
2025-05-07T20:31:44.6073524Z         if contiguous:
2025-05-07T20:31:44.6073763Z             x0 = x0.contiguous()
2025-05-07T20:31:44.6074028Z             x1 = x1.contiguous()
2025-05-07T20:31:44.6074275Z     
2025-05-07T20:31:44.6074469Z         if scale_ub is not None:
2025-05-07T20:31:44.6074747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.6075089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.6075403Z             )
2025-05-07T20:31:44.6075608Z         else:
2025-05-07T20:31:44.6075826Z             scale_ub_tensor = None
2025-05-07T20:31:44.6076080Z     
2025-05-07T20:31:44.6076312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.6076632Z             op = silu_mul_quant
2025-05-07T20:31:44.6076891Z             if compiled:
2025-05-07T20:31:44.6077140Z                 op = torch.compile(op)
2025-05-07T20:31:44.6077555Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6077841Z     
2025-05-07T20:31:44.6078036Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.6078206Z 
2025-05-07T20:31:44.6078309Z moe/activation_test.py:117: 
2025-05-07T20:31:44.6078610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6078940Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.6079224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6079916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.6080610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.6081143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.6081825Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.6082493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.6083022Z     kernel = self.compile(
2025-05-07T20:31:44.6083563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.6084215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.6084612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6084848Z 
2025-05-07T20:31:44.6085055Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb5ce1d0>
2025-05-07T20:31:44.6086132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.6087616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb75ae80>}
2025-05-07T20:31:44.6088957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.6089978Z context = <triton._C.libtriton.ir.context object at 0x7f51bb5be070>
2025-05-07T20:31:44.6090265Z 
2025-05-07T20:31:44.6090429Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.6091044Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.6091562Z                            module_map=module_map)
2025-05-07T20:31:44.6091926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.6092284Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.6092547Z E       ^
2025-05-07T20:31:44.6093018Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.6093465Z 
2025-05-07T20:31:44.6093877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.6094389Z 
2025-05-07T20:31:44.6094494Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6094910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6095313Z     T=128,
2025-05-07T20:31:44.6095509Z     D=7168,
2025-05-07T20:31:44.6095713Z     scale_ub=None,
2025-05-07T20:31:44.6095934Z     contiguous=True,
2025-05-07T20:31:44.6096157Z     compiled=False,
2025-05-07T20:31:44.6096368Z )
2025-05-07T20:31:44.6096689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6097206Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.6097476Z 
2025-05-07T20:31:44.6097559Z     @given(
2025-05-07T20:31:44.6097876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6098191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6098492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6098824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6099154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6099435Z     )
2025-05-07T20:31:44.6099784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6100230Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6100474Z         self,
2025-05-07T20:31:44.6100667Z         T: int,
2025-05-07T20:31:44.6100868Z         D: int,
2025-05-07T20:31:44.6101119Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6101405Z         contiguous: bool,
2025-05-07T20:31:44.6101645Z         compiled: bool,
2025-05-07T20:31:44.6101868Z     ) -> None:
2025-05-07T20:31:44.6102086Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6102331Z     
2025-05-07T20:31:44.6102610Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6102951Z     
2025-05-07T20:31:44.6103152Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.6103448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.6103752Z         x = x_sign * x_clamp
2025-05-07T20:31:44.6103996Z         x0 = x[:, :D]
2025-05-07T20:31:44.6104217Z         x1 = x[:, D:]
2025-05-07T20:31:44.6104421Z     
2025-05-07T20:31:44.6104616Z         if contiguous:
2025-05-07T20:31:44.6104859Z             x0 = x0.contiguous()
2025-05-07T20:31:44.6105119Z             x1 = x1.contiguous()
2025-05-07T20:31:44.6105352Z     
2025-05-07T20:31:44.6105549Z         if scale_ub is not None:
2025-05-07T20:31:44.6106153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.6106486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.6106798Z             )
2025-05-07T20:31:44.6106993Z         else:
2025-05-07T20:31:44.6107208Z             scale_ub_tensor = None
2025-05-07T20:31:44.6107461Z     
2025-05-07T20:31:44.6107698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.6108012Z             op = silu_mul_quant
2025-05-07T20:31:44.6108264Z             if compiled:
2025-05-07T20:31:44.6108515Z                 op = torch.compile(op)
2025-05-07T20:31:44.6108804Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6109085Z     
2025-05-07T20:31:44.6109281Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.6109619Z 
2025-05-07T20:31:44.6109730Z moe/activation_test.py:117: 
2025-05-07T20:31:44.6110026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6110359Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.6110638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6111320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.6112012Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.6112544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.6113223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.6113880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.6114411Z     kernel = self.compile(
2025-05-07T20:31:44.6114956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.6115602Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.6115995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6116229Z 
2025-05-07T20:31:44.6116435Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb5d89d0>
2025-05-07T20:31:44.6117626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.6118983Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb75bec0>}
2025-05-07T20:31:44.6120315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.6121393Z context = <triton._C.libtriton.ir.context object at 0x7f51bb5fc870>
2025-05-07T20:31:44.6121683Z 
2025-05-07T20:31:44.6121852Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.6122376Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.6122842Z                            module_map=module_map)
2025-05-07T20:31:44.6123207Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.6123562Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.6123822Z E       ^
2025-05-07T20:31:44.6124290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.6124736Z 
2025-05-07T20:31:44.6125156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.6125671Z 
2025-05-07T20:31:44.6125783Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6126188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6126592Z     T=2048,
2025-05-07T20:31:44.6126787Z     D=7168,
2025-05-07T20:31:44.6126983Z     scale_ub=1200.0,
2025-05-07T20:31:44.6127208Z     contiguous=True,
2025-05-07T20:31:44.6127433Z     compiled=False,
2025-05-07T20:31:44.6127686Z )
2025-05-07T20:31:44.6786248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6786990Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.6787384Z 
2025-05-07T20:31:44.6787492Z     @given(
2025-05-07T20:31:44.6787801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6788220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6788644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6789167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6789493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6789786Z     )
2025-05-07T20:31:44.6790132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6790568Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6790810Z         self,
2025-05-07T20:31:44.6791013Z         T: int,
2025-05-07T20:31:44.6791216Z         D: int,
2025-05-07T20:31:44.6791439Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6791710Z         contiguous: bool,
2025-05-07T20:31:44.6791953Z         compiled: bool,
2025-05-07T20:31:44.6792181Z     ) -> None:
2025-05-07T20:31:44.6792400Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6792646Z     
2025-05-07T20:31:44.6792913Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6794957Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.6796929Z 
2025-05-07T20:31:44.6797050Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.6797262Z 
2025-05-07T20:31:44.6797371Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6797777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6798184Z     T=1,
2025-05-07T20:31:44.6798377Z     D=5120,
2025-05-07T20:31:44.6798577Z     scale_ub=1200.0,
2025-05-07T20:31:44.6798797Z     contiguous=True,
2025-05-07T20:31:44.6799030Z     compiled=False,
2025-05-07T20:31:44.6799236Z )
2025-05-07T20:31:44.6799554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6800043Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.6800308Z 
2025-05-07T20:31:44.6800395Z     @given(
2025-05-07T20:31:44.6800624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6800940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6801255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6801581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6801904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6802193Z     )
2025-05-07T20:31:44.6802548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6802991Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6803233Z         self,
2025-05-07T20:31:44.6803437Z         T: int,
2025-05-07T20:31:44.6803637Z         D: int,
2025-05-07T20:31:44.6803849Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6804124Z         contiguous: bool,
2025-05-07T20:31:44.6804367Z         compiled: bool,
2025-05-07T20:31:44.6804591Z     ) -> None:
2025-05-07T20:31:44.6804807Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6805048Z     
2025-05-07T20:31:44.6805326Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6805923Z     
2025-05-07T20:31:44.6806133Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.6806430Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.6806738Z         x = x_sign * x_clamp
2025-05-07T20:31:44.6806984Z         x0 = x[:, :D]
2025-05-07T20:31:44.6807203Z         x1 = x[:, D:]
2025-05-07T20:31:44.6807411Z     
2025-05-07T20:31:44.6807679Z         if contiguous:
2025-05-07T20:31:44.6807913Z             x0 = x0.contiguous()
2025-05-07T20:31:44.6808166Z             x1 = x1.contiguous()
2025-05-07T20:31:44.6808565Z     
2025-05-07T20:31:44.6808760Z         if scale_ub is not None:
2025-05-07T20:31:44.6809030Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.6809369Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.6809682Z             )
2025-05-07T20:31:44.6809881Z         else:
2025-05-07T20:31:44.6810098Z             scale_ub_tensor = None
2025-05-07T20:31:44.6810354Z     
2025-05-07T20:31:44.6810589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.6810907Z             op = silu_mul_quant
2025-05-07T20:31:44.6811161Z             if compiled:
2025-05-07T20:31:44.6811411Z                 op = torch.compile(op)
2025-05-07T20:31:44.6811703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6811985Z     
2025-05-07T20:31:44.6812189Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.6812352Z 
2025-05-07T20:31:44.6812453Z moe/activation_test.py:117: 
2025-05-07T20:31:44.6812751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6813092Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.6813370Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6814057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.6814751Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.6815401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.6816080Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.6816741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.6817273Z     kernel = self.compile(
2025-05-07T20:31:44.6817813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.6818465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.6818880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6819111Z 
2025-05-07T20:31:44.6819316Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb848390>
2025-05-07T20:31:44.6820398Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.6821757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb85d4e0>}
2025-05-07T20:31:44.6823095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.6824121Z context = <triton._C.libtriton.ir.context object at 0x7f51bb8bc270>
2025-05-07T20:31:44.6824410Z 
2025-05-07T20:31:44.6824581Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.6825104Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.6825565Z                            module_map=module_map)
2025-05-07T20:31:44.6825932Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.6826285Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.6826543Z E       ^
2025-05-07T20:31:44.6827008Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.6827453Z 
2025-05-07T20:31:44.6827868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.6828375Z 
2025-05-07T20:31:44.6828573Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6828981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6829511Z     T=2048,
2025-05-07T20:31:44.6829705Z     D=5120,
2025-05-07T20:31:44.6829894Z     scale_ub=None,
2025-05-07T20:31:44.6830111Z     contiguous=True,
2025-05-07T20:31:44.6830338Z     compiled=False,
2025-05-07T20:31:44.6830543Z )
2025-05-07T20:31:44.6830869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6831371Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.6831642Z 
2025-05-07T20:31:44.6831730Z     @given(
2025-05-07T20:31:44.6831966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6832280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6832588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6832913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6833257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6833554Z     )
2025-05-07T20:31:44.6833904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6834346Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6834598Z         self,
2025-05-07T20:31:44.6834790Z         T: int,
2025-05-07T20:31:44.6834997Z         D: int,
2025-05-07T20:31:44.6835225Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6835493Z         contiguous: bool,
2025-05-07T20:31:44.6835857Z         compiled: bool,
2025-05-07T20:31:44.6836088Z     ) -> None:
2025-05-07T20:31:44.6836315Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6836560Z     
2025-05-07T20:31:44.6836833Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6837180Z     
2025-05-07T20:31:44.6837374Z >       x_sign = torch.sign(x)
2025-05-07T20:31:44.6839339Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.6848568Z 
2025-05-07T20:31:44.6848758Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:44.6848989Z 
2025-05-07T20:31:44.6849103Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6849525Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6849940Z     T=16384,
2025-05-07T20:31:44.6850144Z     D=5120,
2025-05-07T20:31:44.6850349Z     scale_ub=None,
2025-05-07T20:31:44.6850564Z     contiguous=True,
2025-05-07T20:31:44.6850786Z     compiled=False,
2025-05-07T20:31:44.6851010Z )
2025-05-07T20:31:44.7543625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7544420Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.7544804Z 
2025-05-07T20:31:44.7544917Z     @given(
2025-05-07T20:31:44.7545223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7545567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7545884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7546208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7546536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7546821Z     )
2025-05-07T20:31:44.7547164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7547605Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7547849Z         self,
2025-05-07T20:31:44.7548044Z         T: int,
2025-05-07T20:31:44.7548894Z         D: int,
2025-05-07T20:31:44.7549115Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7549388Z         contiguous: bool,
2025-05-07T20:31:44.7549621Z         compiled: bool,
2025-05-07T20:31:44.7549843Z     ) -> None:
2025-05-07T20:31:44.7550058Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7550297Z     
2025-05-07T20:31:44.7550571Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7552658Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7554563Z 
2025-05-07T20:31:44.7554687Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7554902Z 
2025-05-07T20:31:44.7555017Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7555432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7555841Z     T=4096,
2025-05-07T20:31:44.7556036Z     D=5120,
2025-05-07T20:31:44.7556235Z     scale_ub=None,
2025-05-07T20:31:44.7556454Z     contiguous=True,
2025-05-07T20:31:44.7556802Z     compiled=False,
2025-05-07T20:31:44.7557016Z )
2025-05-07T20:31:44.7557334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7557835Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.7558107Z 
2025-05-07T20:31:44.7558197Z     @given(
2025-05-07T20:31:44.7558429Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7558746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7559061Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7559392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7559724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7560014Z     )
2025-05-07T20:31:44.7560372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7560813Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7561063Z         self,
2025-05-07T20:31:44.7561268Z         T: int,
2025-05-07T20:31:44.7561479Z         D: int,
2025-05-07T20:31:44.7561704Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7561982Z         contiguous: bool,
2025-05-07T20:31:44.7562224Z         compiled: bool,
2025-05-07T20:31:44.7562453Z     ) -> None:
2025-05-07T20:31:44.7562679Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7562924Z     
2025-05-07T20:31:44.7563199Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7565271Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7567140Z 
2025-05-07T20:31:44.7567264Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7567483Z 
2025-05-07T20:31:44.7567697Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7568111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7568527Z     T=2048,
2025-05-07T20:31:44.7568725Z     D=5120,
2025-05-07T20:31:44.7568926Z     scale_ub=None,
2025-05-07T20:31:44.7569247Z     contiguous=False,
2025-05-07T20:31:44.7569482Z     compiled=False,
2025-05-07T20:31:44.7569690Z )
2025-05-07T20:31:44.7570017Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7570521Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.7570794Z 
2025-05-07T20:31:44.7570881Z     @given(
2025-05-07T20:31:44.7571122Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7571450Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7571763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7572095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7572432Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7572725Z     )
2025-05-07T20:31:44.7573078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7573525Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7573784Z         self,
2025-05-07T20:31:44.7573981Z         T: int,
2025-05-07T20:31:44.7574186Z         D: int,
2025-05-07T20:31:44.7574415Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7574690Z         contiguous: bool,
2025-05-07T20:31:44.7574929Z         compiled: bool,
2025-05-07T20:31:44.7575160Z     ) -> None:
2025-05-07T20:31:44.7575382Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7575627Z     
2025-05-07T20:31:44.7575901Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7578030Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7579891Z 
2025-05-07T20:31:44.7580034Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7580247Z 
2025-05-07T20:31:44.7580361Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7580773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7581179Z     T=4096,
2025-05-07T20:31:44.7581372Z     D=7168,
2025-05-07T20:31:44.7581571Z     scale_ub=None,
2025-05-07T20:31:44.7581793Z     contiguous=True,
2025-05-07T20:31:44.7582025Z     compiled=True,
2025-05-07T20:31:44.7582239Z )
2025-05-07T20:31:44.7582558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7583051Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:44.7583325Z 
2025-05-07T20:31:44.7583406Z     @given(
2025-05-07T20:31:44.7583642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7583960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7584271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7584601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7584929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7585217Z     )
2025-05-07T20:31:44.7585567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7586012Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7586261Z         self,
2025-05-07T20:31:44.7586463Z         T: int,
2025-05-07T20:31:44.7586665Z         D: int,
2025-05-07T20:31:44.7586885Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7587162Z         contiguous: bool,
2025-05-07T20:31:44.7587406Z         compiled: bool,
2025-05-07T20:31:44.7587628Z     ) -> None:
2025-05-07T20:31:44.7587850Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7588097Z     
2025-05-07T20:31:44.7588373Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7590521Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7592387Z 
2025-05-07T20:31:44.7592507Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7592724Z 
2025-05-07T20:31:44.7592829Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7593241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7593643Z     T=2048,
2025-05-07T20:31:44.7593841Z     D=5120,
2025-05-07T20:31:44.7594038Z     scale_ub=1200.0,
2025-05-07T20:31:44.7594263Z     contiguous=False,
2025-05-07T20:31:44.7594494Z     compiled=False,
2025-05-07T20:31:44.7594704Z )
2025-05-07T20:31:44.7595021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7595519Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.7595796Z 
2025-05-07T20:31:44.7595883Z     @given(
2025-05-07T20:31:44.7596203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7596517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7596829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7597161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7597487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7597777Z     )
2025-05-07T20:31:44.7598128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7598578Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7598824Z         self,
2025-05-07T20:31:44.7599023Z         T: int,
2025-05-07T20:31:44.7599220Z         D: int,
2025-05-07T20:31:44.7599443Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7599718Z         contiguous: bool,
2025-05-07T20:31:44.7599958Z         compiled: bool,
2025-05-07T20:31:44.7600187Z     ) -> None:
2025-05-07T20:31:44.7600408Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7600652Z     
2025-05-07T20:31:44.7600933Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7602979Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7605081Z 
2025-05-07T20:31:44.7605244Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7605545Z 
2025-05-07T20:31:44.7605906Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7606332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7606741Z     T=4096,
2025-05-07T20:31:44.7606945Z     D=7168,
2025-05-07T20:31:44.7607142Z     scale_ub=1200.0,
2025-05-07T20:31:44.7607371Z     contiguous=True,
2025-05-07T20:31:44.7607658Z     compiled=False,
2025-05-07T20:31:44.7607869Z )
2025-05-07T20:31:44.8522616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8523404Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.8523801Z 
2025-05-07T20:31:44.8523909Z     @given(
2025-05-07T20:31:44.8524442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8524820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8525126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8525459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8525783Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8526077Z     )
2025-05-07T20:31:44.8526431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8526871Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8527111Z         self,
2025-05-07T20:31:44.8527310Z         T: int,
2025-05-07T20:31:44.8527611Z         D: int,
2025-05-07T20:31:44.8527828Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8528100Z         contiguous: bool,
2025-05-07T20:31:44.8528343Z         compiled: bool,
2025-05-07T20:31:44.8528568Z     ) -> None:
2025-05-07T20:31:44.8528788Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8529039Z     
2025-05-07T20:31:44.8529313Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8531517Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.8533401Z 
2025-05-07T20:31:44.8533521Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.8533737Z 
2025-05-07T20:31:44.8533841Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8534259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8534664Z     T=16384,
2025-05-07T20:31:44.8534860Z     D=7168,
2025-05-07T20:31:44.8535059Z     scale_ub=None,
2025-05-07T20:31:44.8535271Z     contiguous=False,
2025-05-07T20:31:44.8535500Z     compiled=True,
2025-05-07T20:31:44.8535709Z )
2025-05-07T20:31:44.8536024Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8536522Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:44.8536805Z 
2025-05-07T20:31:44.8536895Z     @given(
2025-05-07T20:31:44.8537127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8537434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8537741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8538071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8538395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8538683Z     )
2025-05-07T20:31:44.8539032Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8539471Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8539716Z         self,
2025-05-07T20:31:44.8539915Z         T: int,
2025-05-07T20:31:44.8540108Z         D: int,
2025-05-07T20:31:44.8540329Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8540600Z         contiguous: bool,
2025-05-07T20:31:44.8540834Z         compiled: bool,
2025-05-07T20:31:44.8541060Z     ) -> None:
2025-05-07T20:31:44.8541284Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8541529Z     
2025-05-07T20:31:44.8541795Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8543854Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.8545815Z 
2025-05-07T20:31:44.8545934Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.8546146Z 
2025-05-07T20:31:44.8546258Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8546673Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8547075Z     T=4096,
2025-05-07T20:31:44.8547266Z     D=7168,
2025-05-07T20:31:44.8547457Z     scale_ub=None,
2025-05-07T20:31:44.8547667Z     contiguous=True,
2025-05-07T20:31:44.8547892Z     compiled=False,
2025-05-07T20:31:44.8548098Z )
2025-05-07T20:31:44.8548412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8548906Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.8549208Z 
2025-05-07T20:31:44.8549290Z     @given(
2025-05-07T20:31:44.8549520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8549828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8550129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8550462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8550785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8551076Z     )
2025-05-07T20:31:44.8551515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8551959Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8552203Z         self,
2025-05-07T20:31:44.8552404Z         T: int,
2025-05-07T20:31:44.8552599Z         D: int,
2025-05-07T20:31:44.8552813Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8553085Z         contiguous: bool,
2025-05-07T20:31:44.8553323Z         compiled: bool,
2025-05-07T20:31:44.8553542Z     ) -> None:
2025-05-07T20:31:44.8553766Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8554007Z     
2025-05-07T20:31:44.8554277Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8556325Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.8558175Z 
2025-05-07T20:31:44.8558292Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.8558509Z 
2025-05-07T20:31:44.8558613Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8559028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8559424Z     T=16384,
2025-05-07T20:31:44.8559623Z     D=7168,
2025-05-07T20:31:44.8559815Z     scale_ub=None,
2025-05-07T20:31:44.8560026Z     contiguous=True,
2025-05-07T20:31:44.8560249Z     compiled=False,
2025-05-07T20:31:44.8560454Z )
2025-05-07T20:31:44.8560766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8561284Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.8561591Z 
2025-05-07T20:31:44.8561677Z     @given(
2025-05-07T20:31:44.8561905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8562213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8562517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8562845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8563166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8563537Z     )
2025-05-07T20:31:44.8563885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8564321Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8564565Z         self,
2025-05-07T20:31:44.8564764Z         T: int,
2025-05-07T20:31:44.8564961Z         D: int,
2025-05-07T20:31:44.8565179Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8565452Z         contiguous: bool,
2025-05-07T20:31:44.8565686Z         compiled: bool,
2025-05-07T20:31:44.8565912Z     ) -> None:
2025-05-07T20:31:44.8566126Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8566373Z     
2025-05-07T20:31:44.8566644Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8568760Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.8570614Z 
2025-05-07T20:31:44.8570746Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.8570993Z 
2025-05-07T20:31:44.8571097Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8571591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8571995Z     T=16384,
2025-05-07T20:31:44.8572184Z     D=7168,
2025-05-07T20:31:44.8572376Z     scale_ub=1200.0,
2025-05-07T20:31:44.8572600Z     contiguous=True,
2025-05-07T20:31:44.8572820Z     compiled=False,
2025-05-07T20:31:44.8573026Z )
2025-05-07T20:31:44.8573342Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8573837Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.8574120Z 
2025-05-07T20:31:44.8574198Z     @given(
2025-05-07T20:31:44.8574425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8574737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8575039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8575367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8575700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8575981Z     )
2025-05-07T20:31:44.8576326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8576767Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8577007Z         self,
2025-05-07T20:31:44.8577209Z         T: int,
2025-05-07T20:31:44.8577407Z         D: int,
2025-05-07T20:31:44.8577620Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8577891Z         contiguous: bool,
2025-05-07T20:31:44.8578135Z         compiled: bool,
2025-05-07T20:31:44.8578358Z     ) -> None:
2025-05-07T20:31:44.8578573Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8578816Z     
2025-05-07T20:31:44.8579087Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8581129Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.8582977Z 
2025-05-07T20:31:44.8583095Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.8583401Z 
2025-05-07T20:31:44.8583506Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8583916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8584316Z     T=128,
2025-05-07T20:31:44.8584508Z     D=5120,
2025-05-07T20:31:44.8584706Z     scale_ub=1200.0,
2025-05-07T20:31:44.8584931Z     contiguous=False,
2025-05-07T20:31:44.8585154Z     compiled=False,
2025-05-07T20:31:44.8585358Z )
2025-05-07T20:31:44.9600764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9601731Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.9602018Z 
2025-05-07T20:31:44.9602102Z     @given(
2025-05-07T20:31:44.9602344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9602654Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9602972Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9603310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9603651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9603943Z     )
2025-05-07T20:31:44.9604295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9604737Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9604982Z         self,
2025-05-07T20:31:44.9605185Z         T: int,
2025-05-07T20:31:44.9605387Z         D: int,
2025-05-07T20:31:44.9605840Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9606293Z         contiguous: bool,
2025-05-07T20:31:44.9606547Z         compiled: bool,
2025-05-07T20:31:44.9606775Z     ) -> None:
2025-05-07T20:31:44.9606999Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9607399Z     
2025-05-07T20:31:44.9607736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9608086Z     
2025-05-07T20:31:44.9608294Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9608591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9608913Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9609164Z         x0 = x[:, :D]
2025-05-07T20:31:44.9609396Z         x1 = x[:, D:]
2025-05-07T20:31:44.9609608Z     
2025-05-07T20:31:44.9609804Z         if contiguous:
2025-05-07T20:31:44.9610045Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9610305Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9610555Z     
2025-05-07T20:31:44.9610763Z         if scale_ub is not None:
2025-05-07T20:31:44.9611039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9611403Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9611719Z             )
2025-05-07T20:31:44.9611921Z         else:
2025-05-07T20:31:44.9612139Z             scale_ub_tensor = None
2025-05-07T20:31:44.9612397Z     
2025-05-07T20:31:44.9612638Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9612959Z             op = silu_mul_quant
2025-05-07T20:31:44.9613223Z             if compiled:
2025-05-07T20:31:44.9621141Z                 op = torch.compile(op)
2025-05-07T20:31:44.9621497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9621778Z     
2025-05-07T20:31:44.9621985Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9622156Z 
2025-05-07T20:31:44.9622267Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9622566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9622906Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9623202Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9623902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.9624607Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.9625159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.9625844Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.9626699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.9627235Z     kernel = self.compile(
2025-05-07T20:31:44.9627779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.9628431Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.9628838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9629067Z 
2025-05-07T20:31:44.9629281Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb35f750>
2025-05-07T20:31:44.9630362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.9631721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb368220>}
2025-05-07T20:31:44.9633067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.9634087Z context = <triton._C.libtriton.ir.context object at 0x7f51bb363370>
2025-05-07T20:31:44.9634371Z 
2025-05-07T20:31:44.9634628Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.9635142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.9635608Z                            module_map=module_map)
2025-05-07T20:31:44.9635978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.9636332Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.9636590Z E       ^
2025-05-07T20:31:44.9637060Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.9637506Z 
2025-05-07T20:31:44.9637926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.9638431Z 
2025-05-07T20:31:44.9638540Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9638947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9639361Z     T=2048,
2025-05-07T20:31:44.9639558Z     D=7168,
2025-05-07T20:31:44.9639752Z     scale_ub=None,
2025-05-07T20:31:44.9639972Z     contiguous=False,
2025-05-07T20:31:44.9640199Z     compiled=False,
2025-05-07T20:31:44.9640400Z )
2025-05-07T20:31:44.9640720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9641209Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.9641485Z 
2025-05-07T20:31:44.9641566Z     @given(
2025-05-07T20:31:44.9641795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9642107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9642415Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9642740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9643067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9643353Z     )
2025-05-07T20:31:44.9643700Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9644137Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9644380Z         self,
2025-05-07T20:31:44.9644571Z         T: int,
2025-05-07T20:31:44.9644773Z         D: int,
2025-05-07T20:31:44.9644995Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9645259Z         contiguous: bool,
2025-05-07T20:31:44.9645503Z         compiled: bool,
2025-05-07T20:31:44.9645732Z     ) -> None:
2025-05-07T20:31:44.9646038Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9646283Z     
2025-05-07T20:31:44.9646561Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9648710Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.9650554Z 
2025-05-07T20:31:44.9650682Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.9650895Z 
2025-05-07T20:31:44.9651000Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9651417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9651832Z     T=128,
2025-05-07T20:31:44.9652018Z     D=7168,
2025-05-07T20:31:44.9652223Z     scale_ub=1200.0,
2025-05-07T20:31:44.9652444Z     contiguous=True,
2025-05-07T20:31:44.9652662Z     compiled=True,
2025-05-07T20:31:44.9652872Z )
2025-05-07T20:31:44.9952112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9952746Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.9953322Z 
2025-05-07T20:31:44.9953445Z     @given(
2025-05-07T20:31:44.9953752Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9954124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9954433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9954765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9955095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9955390Z     )
2025-05-07T20:31:44.9955742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9956189Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9956438Z         self,
2025-05-07T20:31:44.9956639Z         T: int,
2025-05-07T20:31:44.9956840Z         D: int,
2025-05-07T20:31:44.9957067Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9957336Z         contiguous: bool,
2025-05-07T20:31:44.9957576Z         compiled: bool,
2025-05-07T20:31:44.9957803Z     ) -> None:
2025-05-07T20:31:44.9958029Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9958269Z     
2025-05-07T20:31:44.9958543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9958885Z     
2025-05-07T20:31:44.9959076Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9959366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9959675Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9959911Z         x0 = x[:, :D]
2025-05-07T20:31:44.9960138Z         x1 = x[:, D:]
2025-05-07T20:31:44.9960346Z     
2025-05-07T20:31:44.9960537Z         if contiguous:
2025-05-07T20:31:44.9960773Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9961060Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9961318Z     
2025-05-07T20:31:44.9961518Z         if scale_ub is not None:
2025-05-07T20:31:44.9961793Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9962134Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9962448Z             )
2025-05-07T20:31:44.9962639Z         else:
2025-05-07T20:31:44.9962856Z             scale_ub_tensor = None
2025-05-07T20:31:44.9963109Z     
2025-05-07T20:31:44.9963340Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9963651Z             op = silu_mul_quant
2025-05-07T20:31:44.9963910Z             if compiled:
2025-05-07T20:31:44.9964153Z                 op = torch.compile(op)
2025-05-07T20:31:44.9964448Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9964860Z     
2025-05-07T20:31:44.9965050Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9965218Z 
2025-05-07T20:31:44.9965318Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9965610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9965938Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9966223Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9966783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.9967348Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.9968118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.9968810Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.9969345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.9970030Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.9970686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.9971214Z     kernel = self.compile(
2025-05-07T20:31:44.9971751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.9972401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.9972880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9973116Z 
2025-05-07T20:31:44.9973321Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb3c2e90>
2025-05-07T20:31:44.9974397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.9975762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb368860>}
2025-05-07T20:31:44.9977094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.9978114Z context = <triton._C.libtriton.ir.context object at 0x7f51bb3babb0>
2025-05-07T20:31:44.9978400Z 
2025-05-07T20:31:44.9978569Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.9979088Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.9979550Z                            module_map=module_map)
2025-05-07T20:31:44.9979915Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.9980273Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.9980538Z E       ^
2025-05-07T20:31:44.9981002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.9981446Z 
2025-05-07T20:31:44.9981863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.9982369Z 
2025-05-07T20:31:44.9982480Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9982889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9983286Z     T=128,
2025-05-07T20:31:44.9983477Z     D=7168,
2025-05-07T20:31:44.9983666Z     scale_ub=1200.0,
2025-05-07T20:31:44.9983887Z     contiguous=True,
2025-05-07T20:31:44.9984108Z     compiled=False,
2025-05-07T20:31:44.9984309Z )
2025-05-07T20:31:44.9984631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9985120Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.9985481Z 
2025-05-07T20:31:44.9985566Z     @given(
2025-05-07T20:31:44.9985792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9986104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9986411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9986762Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9987083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9987379Z     )
2025-05-07T20:31:44.9987725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9988159Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9988406Z         self,
2025-05-07T20:31:44.9988604Z         T: int,
2025-05-07T20:31:44.9988801Z         D: int,
2025-05-07T20:31:44.9989021Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9989295Z         contiguous: bool,
2025-05-07T20:31:44.9989533Z         compiled: bool,
2025-05-07T20:31:44.9989766Z     ) -> None:
2025-05-07T20:31:44.9989980Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9990221Z     
2025-05-07T20:31:44.9990491Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9990842Z     
2025-05-07T20:31:44.9991071Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9991368Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9993466Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.9995310Z 
2025-05-07T20:31:44.9995428Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:44.9995638Z 
2025-05-07T20:31:44.9995745Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9996151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9996550Z     T=128,
2025-05-07T20:31:44.9996746Z     D=5120,
2025-05-07T20:31:44.9996941Z     scale_ub=1200.0,
2025-05-07T20:31:44.9997161Z     contiguous=True,
2025-05-07T20:31:44.9997386Z     compiled=True,
2025-05-07T20:31:44.9997600Z )
2025-05-07T20:31:44.9997913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9998401Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.9998665Z 
2025-05-07T20:31:44.9998754Z     @given(
2025-05-07T20:31:44.9998978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9999287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9999599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9999924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.0000252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.0000542Z     )
2025-05-07T20:31:45.0000888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.0001339Z     def test_silu_mul_quant(
2025-05-07T20:31:45.0001623Z         self,
2025-05-07T20:31:45.0001819Z         T: int,
2025-05-07T20:31:45.0002017Z         D: int,
2025-05-07T20:31:45.0002235Z         scale_ub: Optional[float],
2025-05-07T20:31:45.0002507Z         contiguous: bool,
2025-05-07T20:31:45.0002742Z         compiled: bool,
2025-05-07T20:31:45.0002965Z     ) -> None:
2025-05-07T20:31:45.0003183Z         torch.manual_seed(2025)
2025-05-07T20:31:45.0003425Z     
2025-05-07T20:31:45.0003695Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.0004037Z     
2025-05-07T20:31:45.0004316Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.0006554Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.0008502Z 
2025-05-07T20:31:45.0008625Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.0008834Z 
2025-05-07T20:31:45.0008935Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.0009340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.0009734Z     T=128,
2025-05-07T20:31:45.0009928Z     D=7168,
2025-05-07T20:31:45.0010128Z     scale_ub=None,
2025-05-07T20:31:45.0010336Z     contiguous=True,
2025-05-07T20:31:45.0010561Z     compiled=True,
2025-05-07T20:31:45.0010766Z )
2025-05-07T20:31:45.4890505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4891259Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4891533Z 
2025-05-07T20:31:45.4891618Z     @given(
2025-05-07T20:31:45.4891853Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4892333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4892655Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4892988Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4893314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4893604Z     )
2025-05-07T20:31:45.4893958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4894409Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4894654Z         self,
2025-05-07T20:31:45.4894853Z         T: int,
2025-05-07T20:31:45.4895055Z         D: int,
2025-05-07T20:31:45.4895275Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4895553Z         contiguous: bool,
2025-05-07T20:31:45.4895800Z         compiled: bool,
2025-05-07T20:31:45.4896023Z     ) -> None:
2025-05-07T20:31:45.4896242Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4896489Z     
2025-05-07T20:31:45.4896766Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4898811Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4900671Z 
2025-05-07T20:31:45.4900793Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4901012Z 
2025-05-07T20:31:45.4956906Z FAILED
2025-05-07T20:31:45.4957243Z 
2025-05-07T20:31:45.4957618Z =================================== FAILURES ===================================
2025-05-07T20:31:45.4958845Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:31:45.4960024Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:31:45.4961520Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:31:45.4962291Z   |     yield
2025-05-07T20:31:45.4962883Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:31:45.4963598Z   |     self._callTestMethod(testMethod)
2025-05-07T20:31:45.4964558Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:31:45.4965310Z   |     if method() is not None:
2025-05-07T20:31:45.4965646Z   |        ^^^^^^^^
2025-05-07T20:31:45.4966513Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:31:45.4967653Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4968068Z   |            ^^^^^^^
2025-05-07T20:31:45.4968831Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:31:45.4969693Z   |     raise the_error_hypothesis_found
2025-05-07T20:31:45.4970277Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:31:45.4970850Z   +-+---------------- 1 ----------------
2025-05-07T20:31:45.4971260Z     | Traceback (most recent call last):
2025-05-07T20:31:45.4972220Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:45.4973277Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4973775Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.4976668Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4978656Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:45.4979092Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4979502Z     |     T=128,
2025-05-07T20:31:45.4979706Z     |     D=7168,
2025-05-07T20:31:45.4979913Z     |     scale_ub=1200.0,
2025-05-07T20:31:45.4980161Z     |     contiguous=True,
2025-05-07T20:31:45.4980408Z     |     compiled=False,
2025-05-07T20:31:45.4980626Z     | )
2025-05-07T20:31:45.4980818Z     | 
2025-05-07T20:31:45.4981343Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:31:45.4981941Z     +---------------- 2 ----------------
2025-05-07T20:31:45.4982232Z     | Traceback (most recent call last):
2025-05-07T20:31:45.4982941Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:45.4983721Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4984095Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.4986077Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4988039Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:45.4988477Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4988967Z     |     T=128,
2025-05-07T20:31:45.4989164Z     |     D=7168,
2025-05-07T20:31:45.4989375Z     |     scale_ub=None,
2025-05-07T20:31:45.4989612Z     |     contiguous=True,
2025-05-07T20:31:45.4989847Z     |     compiled=True,
2025-05-07T20:31:45.4990070Z     | )
2025-05-07T20:31:45.4990251Z     | 
2025-05-07T20:31:45.4990788Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:45.4991419Z     +---------------- 3 ----------------
2025-05-07T20:31:45.4991705Z     | Traceback (most recent call last):
2025-05-07T20:31:45.4992405Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:45.4993649Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4994029Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.4996410Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4999170Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:45.4999765Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5000320Z     |     T=128,
2025-05-07T20:31:45.5000588Z     |     D=5120,
2025-05-07T20:31:45.5000874Z     |     scale_ub=1200.0,
2025-05-07T20:31:45.5001192Z     |     contiguous=True,
2025-05-07T20:31:45.5001525Z     |     compiled=True,
2025-05-07T20:31:45.5001834Z     | )
2025-05-07T20:31:45.5002076Z     | 
2025-05-07T20:31:45.5002794Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:31:45.5003620Z     +---------------- 4 ----------------
2025-05-07T20:31:45.5004007Z     | Traceback (most recent call last):
2025-05-07T20:31:45.5005000Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:31:45.5006189Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5006592Z     |                              ^^^^^^^^
2025-05-07T20:31:45.5007467Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:31:45.5025368Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5025888Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5026995Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:31:45.5028074Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5028953Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:31:45.5029980Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5030598Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5031470Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:31:45.5032539Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5033371Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5034049Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:31:45.5034849Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5035326Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5035972Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:31:45.5036668Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5037050Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5037859Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:31:45.5038727Z     |     fn()
2025-05-07T20:31:45.5039539Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:31:45.5040498Z     |     self.fn.run(
2025-05-07T20:31:45.5041100Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:31:45.5042161Z     |     kernel = self.compile(
2025-05-07T20:31:45.5042557Z     |              ^^^^^^^^^^^^^
2025-05-07T20:31:45.5043417Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:31:45.5044993Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5045560Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5046528Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:45.5047797Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5048472Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.5048976Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5049475Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5049846Z     | ^
2025-05-07T20:31:45.5050490Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5051300Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:45.5051869Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:31:45.5052599Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5053217Z     |     T=1,  # or any other generated value
2025-05-07T20:31:45.5053651Z     |     D=5120,  # or any other generated value
2025-05-07T20:31:45.5054130Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:31:45.5054638Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:31:45.5055162Z     |     compiled=True,  # or any other generated value
2025-05-07T20:31:45.5055594Z     | )
2025-05-07T20:31:45.5055851Z     | 
2025-05-07T20:31:45.5056603Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:45.5057464Z     +------------------------------------
2025-05-07T20:31:45.5057954Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:31:45.5058441Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5058999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5059688Z     T=1,
2025-05-07T20:31:45.5059952Z     D=5120,
2025-05-07T20:31:45.5060212Z     scale_ub=None,
2025-05-07T20:31:45.5060510Z     contiguous=True,
2025-05-07T20:31:45.5060853Z     compiled=True,
2025-05-07T20:31:45.5061172Z )
2025-05-07T20:31:45.5061632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5062328Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.5062696Z 
2025-05-07T20:31:45.5062809Z     @given(
2025-05-07T20:31:45.5063151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5063600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5064030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5064493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5064956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5065352Z     )
2025-05-07T20:31:45.5065845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5066472Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5066810Z         self,
2025-05-07T20:31:45.5067094Z         T: int,
2025-05-07T20:31:45.5067382Z         D: int,
2025-05-07T20:31:45.5067696Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5068068Z         contiguous: bool,
2025-05-07T20:31:45.5068406Z         compiled: bool,
2025-05-07T20:31:45.5068720Z     ) -> None:
2025-05-07T20:31:45.5069017Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5069458Z     
2025-05-07T20:31:45.5069845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5070320Z     
2025-05-07T20:31:45.5070603Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5071015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5071423Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5071741Z         x0 = x[:, :D]
2025-05-07T20:31:45.5072049Z         x1 = x[:, D:]
2025-05-07T20:31:45.5072348Z     
2025-05-07T20:31:45.5072627Z         if contiguous:
2025-05-07T20:31:45.5072960Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5073316Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5073653Z     
2025-05-07T20:31:45.5073933Z         if scale_ub is not None:
2025-05-07T20:31:45.5074324Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5074804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5075273Z             )
2025-05-07T20:31:45.5075576Z         else:
2025-05-07T20:31:45.5075907Z             scale_ub_tensor = None
2025-05-07T20:31:45.5076264Z     
2025-05-07T20:31:45.5076583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5077001Z             op = silu_mul_quant
2025-05-07T20:31:45.5077353Z             if compiled:
2025-05-07T20:31:45.5077707Z                 op = torch.compile(op)
2025-05-07T20:31:45.5078104Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5078493Z     
2025-05-07T20:31:45.5078779Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5079163Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5079570Z     
2025-05-07T20:31:45.5079897Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5080365Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5080781Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5081258Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5081787Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5082244Z     
2025-05-07T20:31:45.5082524Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5082820Z 
2025-05-07T20:31:45.5082972Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5083426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5083912Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5084375Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5085591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5086661Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5087423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5088500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5089469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5090418Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5091440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5092486Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5093521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5094380Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5095196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5095885Z     fn()
2025-05-07T20:31:45.5096627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5097386Z     self.fn.run(
2025-05-07T20:31:45.5097985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5098682Z     kernel = self.compile(
2025-05-07T20:31:45.5099350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5100202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5100696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5100976Z 
2025-05-07T20:31:45.5101233Z self = <triton.compiler.compiler.ASTSource object at 0x7f51f3f98e90>
2025-05-07T20:31:45.5102707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5104662Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f3937380>}
2025-05-07T20:31:45.5106791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5108236Z context = <triton._C.libtriton.ir.context object at 0x7f51f3ffbbb0>
2025-05-07T20:31:45.5108641Z 
2025-05-07T20:31:45.5108882Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5109611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5110268Z                            module_map=module_map)
2025-05-07T20:31:45.5110805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5111320Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5111692Z E       ^
2025-05-07T20:31:45.5112362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5112988Z 
2025-05-07T20:31:45.5113558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5114244Z 
2025-05-07T20:31:45.5114384Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5115158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5115721Z     T=2048,
2025-05-07T20:31:45.5115986Z     D=5120,
2025-05-07T20:31:45.5116281Z     scale_ub=1200.0,
2025-05-07T20:31:45.5116625Z     contiguous=True,
2025-05-07T20:31:45.5116944Z     compiled=False,
2025-05-07T20:31:45.5117227Z )
2025-05-07T20:31:45.5117653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5118346Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.5118733Z 
2025-05-07T20:31:45.5118847Z     @given(
2025-05-07T20:31:45.5119189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5119617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5120041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5120850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5121343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5121717Z     )
2025-05-07T20:31:45.5122427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5123039Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5123340Z         self,
2025-05-07T20:31:45.5123589Z         T: int,
2025-05-07T20:31:45.5123840Z         D: int,
2025-05-07T20:31:45.5124109Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5124456Z         contiguous: bool,
2025-05-07T20:31:45.5124766Z         compiled: bool,
2025-05-07T20:31:45.5125214Z     ) -> None:
2025-05-07T20:31:45.5125534Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5125885Z     
2025-05-07T20:31:45.5126289Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5126776Z     
2025-05-07T20:31:45.5127074Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5127603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5128054Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5128407Z         x0 = x[:, :D]
2025-05-07T20:31:45.5128730Z         x1 = x[:, D:]
2025-05-07T20:31:45.5129049Z     
2025-05-07T20:31:45.5129317Z         if contiguous:
2025-05-07T20:31:45.5129652Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5130026Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5130376Z     
2025-05-07T20:31:45.5130641Z         if scale_ub is not None:
2025-05-07T20:31:45.5130981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5131448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5131833Z             )
2025-05-07T20:31:45.5132073Z         else:
2025-05-07T20:31:45.5132333Z             scale_ub_tensor = None
2025-05-07T20:31:45.5132647Z     
2025-05-07T20:31:45.5132934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5133316Z             op = silu_mul_quant
2025-05-07T20:31:45.5133625Z             if compiled:
2025-05-07T20:31:45.5133933Z                 op = torch.compile(op)
2025-05-07T20:31:45.5134306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5134645Z     
2025-05-07T20:31:45.5134896Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5135095Z 
2025-05-07T20:31:45.5135223Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5135580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5135996Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5136349Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5137280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5138192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5138858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5139701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5140514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5144114Z     kernel = self.compile(
2025-05-07T20:31:45.5144797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5145612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5146096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5146384Z 
2025-05-07T20:31:45.5146638Z self = <triton.compiler.compiler.ASTSource object at 0x7f51cbf36510>
2025-05-07T20:31:45.5148051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5149947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f3756660>}
2025-05-07T20:31:45.5151796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5153200Z context = <triton._C.libtriton.ir.context object at 0x7f51f3e922b0>
2025-05-07T20:31:45.5153600Z 
2025-05-07T20:31:45.5153921Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5154640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5155273Z                            module_map=module_map)
2025-05-07T20:31:45.5155770Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5156246Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5156595Z E       ^
2025-05-07T20:31:45.5157241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5157883Z 
2025-05-07T20:31:45.5158466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5159178Z 
2025-05-07T20:31:45.5159336Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5159904Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5160466Z     T=2048,
2025-05-07T20:31:45.5160741Z     D=5120,
2025-05-07T20:31:45.5161008Z     scale_ub=1200.0,
2025-05-07T20:31:45.5161323Z     contiguous=True,
2025-05-07T20:31:45.5161632Z     compiled=True,
2025-05-07T20:31:45.5161923Z )
2025-05-07T20:31:45.5162364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5163036Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.5163408Z 
2025-05-07T20:31:45.5163523Z     @given(
2025-05-07T20:31:45.5163844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5164270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5164689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5165156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5165599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5165992Z     )
2025-05-07T20:31:45.5166468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5167112Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5167443Z         self,
2025-05-07T20:31:45.5167815Z         T: int,
2025-05-07T20:31:45.5168091Z         D: int,
2025-05-07T20:31:45.5168400Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5168778Z         contiguous: bool,
2025-05-07T20:31:45.5169110Z         compiled: bool,
2025-05-07T20:31:45.5169426Z     ) -> None:
2025-05-07T20:31:45.5169727Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5170060Z     
2025-05-07T20:31:45.5170541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5171065Z     
2025-05-07T20:31:45.5171328Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5171731Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5172168Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5172490Z         x0 = x[:, :D]
2025-05-07T20:31:45.5172793Z         x1 = x[:, D:]
2025-05-07T20:31:45.5173078Z     
2025-05-07T20:31:45.5173328Z         if contiguous:
2025-05-07T20:31:45.5173658Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5174029Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5174368Z     
2025-05-07T20:31:45.5174638Z         if scale_ub is not None:
2025-05-07T20:31:45.5175011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5175468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5175889Z             )
2025-05-07T20:31:45.5176158Z         else:
2025-05-07T20:31:45.5176462Z             scale_ub_tensor = None
2025-05-07T20:31:45.5176809Z     
2025-05-07T20:31:45.5177126Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5177562Z             op = silu_mul_quant
2025-05-07T20:31:45.5177903Z             if compiled:
2025-05-07T20:31:45.5178253Z                 op = torch.compile(op)
2025-05-07T20:31:45.5178664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5179039Z     
2025-05-07T20:31:45.5179309Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5179796Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5180202Z     
2025-05-07T20:31:45.5180542Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5181007Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5181399Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5181818Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5182296Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5182714Z     
2025-05-07T20:31:45.5183000Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5183283Z 
2025-05-07T20:31:45.5183429Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5183858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5184329Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5184793Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5185872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5186867Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5187583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5188501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5189417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5190405Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5191408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5192395Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5193373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5194212Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5195015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5195705Z     fn()
2025-05-07T20:31:45.5196372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5197240Z     self.fn.run(
2025-05-07T20:31:45.5197846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5198553Z     kernel = self.compile(
2025-05-07T20:31:45.5199283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5200149Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5200693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5200996Z 
2025-05-07T20:31:45.5201274Z self = <triton.compiler.compiler.ASTSource object at 0x7f51f288ded0>
2025-05-07T20:31:45.5202731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5204559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f38cbce0>}
2025-05-07T20:31:45.5206817Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5208555Z context = <triton._C.libtriton.ir.context object at 0x7f51f28634f0>
2025-05-07T20:31:45.5208972Z 
2025-05-07T20:31:45.5209214Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5209898Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5210534Z                            module_map=module_map)
2025-05-07T20:31:45.5211069Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5211542Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5211928Z E       ^
2025-05-07T20:31:45.5212545Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5213146Z 
2025-05-07T20:31:45.5213713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5214402Z 
2025-05-07T20:31:45.5214539Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5215107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5215649Z     T=16384,
2025-05-07T20:31:45.5215900Z     D=7168,
2025-05-07T20:31:45.5216165Z     scale_ub=1200.0,
2025-05-07T20:31:45.5216475Z     contiguous=False,
2025-05-07T20:31:45.5216781Z     compiled=False,
2025-05-07T20:31:45.5217067Z )
2025-05-07T20:31:45.5217509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5218205Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.5218575Z 
2025-05-07T20:31:45.5218683Z     @given(
2025-05-07T20:31:45.5218992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5219402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5219817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5220252Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5220708Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5221092Z     )
2025-05-07T20:31:45.5221604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5222213Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5222925Z         self,
2025-05-07T20:31:45.5223229Z         T: int,
2025-05-07T20:31:45.5223505Z         D: int,
2025-05-07T20:31:45.5223808Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5224181Z         contiguous: bool,
2025-05-07T20:31:45.5224718Z         compiled: bool,
2025-05-07T20:31:45.5225228Z     ) -> None:
2025-05-07T20:31:45.5225525Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5225867Z     
2025-05-07T20:31:45.5226236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5226671Z     
2025-05-07T20:31:45.5226936Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5227320Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5227742Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5228038Z         x0 = x[:, :D]
2025-05-07T20:31:45.5228344Z         x1 = x[:, D:]
2025-05-07T20:31:45.5228627Z     
2025-05-07T20:31:45.5228892Z         if contiguous:
2025-05-07T20:31:45.5229208Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5229563Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5229909Z     
2025-05-07T20:31:45.5230181Z         if scale_ub is not None:
2025-05-07T20:31:45.5242901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5243402Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5243845Z             )
2025-05-07T20:31:45.5244120Z         else:
2025-05-07T20:31:45.5244421Z             scale_ub_tensor = None
2025-05-07T20:31:45.5244784Z     
2025-05-07T20:31:45.5245112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5245557Z             op = silu_mul_quant
2025-05-07T20:31:45.5245913Z             if compiled:
2025-05-07T20:31:45.5246272Z                 op = torch.compile(op)
2025-05-07T20:31:45.5246689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5247219Z     
2025-05-07T20:31:45.5247629Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5247879Z 
2025-05-07T20:31:45.5248022Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5248451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5248925Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5249329Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5250311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5251307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5252072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5253067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5254023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5254764Z     kernel = self.compile(
2025-05-07T20:31:45.5255546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5256478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5257054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5257381Z 
2025-05-07T20:31:45.5257674Z self = <triton.compiler.compiler.ASTSource object at 0x7f51f2913310>
2025-05-07T20:31:45.5259201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5261168Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f3e0c720>}
2025-05-07T20:31:45.5263080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5264511Z context = <triton._C.libtriton.ir.context object at 0x7f51f291b0f0>
2025-05-07T20:31:45.5264921Z 
2025-05-07T20:31:45.5265158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5266036Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5266679Z                            module_map=module_map)
2025-05-07T20:31:45.5267193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5267698Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5268071Z E       ^
2025-05-07T20:31:45.5268745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5269383Z 
2025-05-07T20:31:45.5269936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5270633Z 
2025-05-07T20:31:45.5270794Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5271392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5271927Z     T=1,
2025-05-07T20:31:45.5272182Z     D=7168,
2025-05-07T20:31:45.5272457Z     scale_ub=None,
2025-05-07T20:31:45.5272739Z     contiguous=True,
2025-05-07T20:31:45.5273037Z     compiled=True,
2025-05-07T20:31:45.5273317Z )
2025-05-07T20:31:45.5273742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5274388Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.5274741Z 
2025-05-07T20:31:45.5274855Z     @given(
2025-05-07T20:31:45.5275175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5275714Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5276151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5276613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5277065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5277471Z     )
2025-05-07T20:31:45.5277955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5278408Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5278664Z         self,
2025-05-07T20:31:45.5278872Z         T: int,
2025-05-07T20:31:45.5279086Z         D: int,
2025-05-07T20:31:45.5279311Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5279594Z         contiguous: bool,
2025-05-07T20:31:45.5279844Z         compiled: bool,
2025-05-07T20:31:45.5280069Z     ) -> None:
2025-05-07T20:31:45.5280298Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5280553Z     
2025-05-07T20:31:45.5280837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5281189Z     
2025-05-07T20:31:45.5281394Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5281689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5282010Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5282263Z         x0 = x[:, :D]
2025-05-07T20:31:45.5282483Z         x1 = x[:, D:]
2025-05-07T20:31:45.5282707Z     
2025-05-07T20:31:45.5282903Z         if contiguous:
2025-05-07T20:31:45.5283143Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5283419Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5283673Z     
2025-05-07T20:31:45.5283871Z         if scale_ub is not None:
2025-05-07T20:31:45.5284157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5284504Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5284820Z             )
2025-05-07T20:31:45.5285016Z         else:
2025-05-07T20:31:45.5285238Z             scale_ub_tensor = None
2025-05-07T20:31:45.5285500Z     
2025-05-07T20:31:45.5285739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5286065Z             op = silu_mul_quant
2025-05-07T20:31:45.5286322Z             if compiled:
2025-05-07T20:31:45.5286573Z                 op = torch.compile(op)
2025-05-07T20:31:45.5286878Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5287163Z     
2025-05-07T20:31:45.5287354Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5287769Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5288164Z     
2025-05-07T20:31:45.5288410Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5288742Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5289038Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5289354Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5289714Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5290028Z     
2025-05-07T20:31:45.5290243Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5290437Z 
2025-05-07T20:31:45.5290540Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5290847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5291182Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5291512Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5292295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5293055Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5293603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5294283Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5295047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5295771Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5296524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5297264Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5297991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5298637Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5299242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5299756Z     fn()
2025-05-07T20:31:45.5300267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5300877Z     self.fn.run(
2025-05-07T20:31:45.5301367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5301902Z     kernel = self.compile(
2025-05-07T20:31:45.5302451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5303107Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5303499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5303738Z 
2025-05-07T20:31:45.5303949Z self = <triton.compiler.compiler.ASTSource object at 0x7f51cbde7fd0>
2025-05-07T20:31:45.5305029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5306833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f3e0e2a0>}
2025-05-07T20:31:45.5308173Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5309198Z context = <triton._C.libtriton.ir.context object at 0x7f51cb76a270>
2025-05-07T20:31:45.5309738Z 
2025-05-07T20:31:45.5309927Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5310548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5311094Z                            module_map=module_map)
2025-05-07T20:31:45.5311507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5311915Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5312211Z E       ^
2025-05-07T20:31:45.5312762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5313322Z 
2025-05-07T20:31:45.5313834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5314467Z 
2025-05-07T20:31:45.5314582Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5315059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5315530Z     T=4096,
2025-05-07T20:31:45.5315730Z     D=5120,
2025-05-07T20:31:45.5315927Z     scale_ub=None,
2025-05-07T20:31:45.5316143Z     contiguous=False,
2025-05-07T20:31:45.5316376Z     compiled=False,
2025-05-07T20:31:45.5316584Z )
2025-05-07T20:31:45.5316902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5317398Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.5317669Z 
2025-05-07T20:31:45.5317876Z     @given(
2025-05-07T20:31:45.5318105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5318425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5318733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5319064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5319392Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5319683Z     )
2025-05-07T20:31:45.5320040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5320485Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5320734Z         self,
2025-05-07T20:31:45.5320933Z         T: int,
2025-05-07T20:31:45.5321129Z         D: int,
2025-05-07T20:31:45.5321359Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5321631Z         contiguous: bool,
2025-05-07T20:31:45.5321872Z         compiled: bool,
2025-05-07T20:31:45.5322105Z     ) -> None:
2025-05-07T20:31:45.5322330Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5322574Z     
2025-05-07T20:31:45.5322852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5323195Z     
2025-05-07T20:31:45.5323390Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5323685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5323998Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5324249Z         x0 = x[:, :D]
2025-05-07T20:31:45.5324473Z         x1 = x[:, D:]
2025-05-07T20:31:45.5324695Z     
2025-05-07T20:31:45.5324891Z         if contiguous:
2025-05-07T20:31:45.5325354Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5325620Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5325867Z     
2025-05-07T20:31:45.5326058Z         if scale_ub is not None:
2025-05-07T20:31:45.5326338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5326683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5327000Z             )
2025-05-07T20:31:45.5327200Z         else:
2025-05-07T20:31:45.5327421Z             scale_ub_tensor = None
2025-05-07T20:31:45.5327754Z     
2025-05-07T20:31:45.5327986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5328306Z             op = silu_mul_quant
2025-05-07T20:31:45.5328561Z             if compiled:
2025-05-07T20:31:45.5328808Z                 op = torch.compile(op)
2025-05-07T20:31:45.5329107Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5329562Z     
2025-05-07T20:31:45.5329757Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5329930Z 
2025-05-07T20:31:45.5330030Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5330327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5330672Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5330956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5331652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5332343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5332881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5333569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5334231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5334773Z     kernel = self.compile(
2025-05-07T20:31:45.5335313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5335975Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5336373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5336601Z 
2025-05-07T20:31:45.5336809Z self = <triton.compiler.compiler.ASTSource object at 0x7f51cb7ff250>
2025-05-07T20:31:45.5338002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5339371Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f2b454e0>}
2025-05-07T20:31:45.5340718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5341794Z context = <triton._C.libtriton.ir.context object at 0x7f51cb7870b0>
2025-05-07T20:31:45.5342083Z 
2025-05-07T20:31:45.5342254Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5342783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5343250Z                            module_map=module_map)
2025-05-07T20:31:45.5343617Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5343971Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5344239Z E       ^
2025-05-07T20:31:45.5344708Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5345163Z 
2025-05-07T20:31:45.5345580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5346097Z 
2025-05-07T20:31:45.5346209Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5346625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5347030Z     T=4096,
2025-05-07T20:31:45.5347221Z     D=7168,
2025-05-07T20:31:45.5347422Z     scale_ub=None,
2025-05-07T20:31:45.5347646Z     contiguous=False,
2025-05-07T20:31:45.5347872Z     compiled=False,
2025-05-07T20:31:45.5348085Z )
2025-05-07T20:31:45.5348409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5348903Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.5349181Z 
2025-05-07T20:31:45.5349264Z     @given(
2025-05-07T20:31:45.5349499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5349906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5350213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5350550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5350911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5351219Z     )
2025-05-07T20:31:45.5351573Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5352018Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5352263Z         self,
2025-05-07T20:31:45.5352475Z         T: int,
2025-05-07T20:31:45.5352682Z         D: int,
2025-05-07T20:31:45.5352899Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5353173Z         contiguous: bool,
2025-05-07T20:31:45.5353421Z         compiled: bool,
2025-05-07T20:31:45.5353643Z     ) -> None:
2025-05-07T20:31:45.5353863Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5354119Z     
2025-05-07T20:31:45.5354394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5354747Z     
2025-05-07T20:31:45.5354944Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5355241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5355547Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5355791Z         x0 = x[:, :D]
2025-05-07T20:31:45.5356010Z         x1 = x[:, D:]
2025-05-07T20:31:45.5356216Z     
2025-05-07T20:31:45.5356408Z         if contiguous:
2025-05-07T20:31:45.5356642Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5356977Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5357222Z     
2025-05-07T20:31:45.5357421Z         if scale_ub is not None:
2025-05-07T20:31:45.5357688Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5358028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5358339Z             )
2025-05-07T20:31:45.5358533Z         else:
2025-05-07T20:31:45.5358749Z             scale_ub_tensor = None
2025-05-07T20:31:45.5359006Z     
2025-05-07T20:31:45.5359239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5359559Z             op = silu_mul_quant
2025-05-07T20:31:45.5359816Z             if compiled:
2025-05-07T20:31:45.5360068Z                 op = torch.compile(op)
2025-05-07T20:31:45.5360362Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5360640Z     
2025-05-07T20:31:45.5360837Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5361002Z 
2025-05-07T20:31:45.5361101Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5361410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5361748Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5362029Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5362720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5363416Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5363964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5364648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5365312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5365853Z     kernel = self.compile(
2025-05-07T20:31:45.5366389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5367055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5367454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5367743Z 
2025-05-07T20:31:45.5367957Z self = <triton.compiler.compiler.ASTSource object at 0x7f51cb7925d0>
2025-05-07T20:31:45.5369030Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5370486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f2b45580>}
2025-05-07T20:31:45.5371833Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5372857Z context = <triton._C.libtriton.ir.context object at 0x7f51cb7aa4f0>
2025-05-07T20:31:45.5373144Z 
2025-05-07T20:31:45.5373317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5373838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5374304Z                            module_map=module_map)
2025-05-07T20:31:45.5374675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5375024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5375289Z E       ^
2025-05-07T20:31:45.5375760Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5376209Z 
2025-05-07T20:31:45.5376630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5377138Z 
2025-05-07T20:31:45.5377326Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5377744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5378152Z     T=128,
2025-05-07T20:31:45.5378342Z     D=7168,
2025-05-07T20:31:45.5378543Z     scale_ub=None,
2025-05-07T20:31:45.5378764Z     contiguous=False,
2025-05-07T20:31:45.5378985Z     compiled=True,
2025-05-07T20:31:45.5379199Z )
2025-05-07T20:31:45.5379523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5380024Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5380295Z 
2025-05-07T20:31:45.5380377Z     @given(
2025-05-07T20:31:45.5380613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5380925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5381229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5381566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5381902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5382187Z     )
2025-05-07T20:31:45.5382542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5382988Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5383232Z         self,
2025-05-07T20:31:45.5383427Z         T: int,
2025-05-07T20:31:45.5383635Z         D: int,
2025-05-07T20:31:45.5383857Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5384132Z         contiguous: bool,
2025-05-07T20:31:45.5384382Z         compiled: bool,
2025-05-07T20:31:45.5384614Z     ) -> None:
2025-05-07T20:31:45.5384830Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5385077Z     
2025-05-07T20:31:45.5385358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5385696Z     
2025-05-07T20:31:45.5385903Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5386201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5386513Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5386759Z         x0 = x[:, :D]
2025-05-07T20:31:45.5386983Z         x1 = x[:, D:]
2025-05-07T20:31:45.5387191Z     
2025-05-07T20:31:45.5387386Z         if contiguous:
2025-05-07T20:31:45.5387621Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5387877Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5388121Z     
2025-05-07T20:31:45.5388321Z         if scale_ub is not None:
2025-05-07T20:31:45.5388690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5389021Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5389332Z             )
2025-05-07T20:31:45.5389531Z         else:
2025-05-07T20:31:45.5389742Z             scale_ub_tensor = None
2025-05-07T20:31:45.5389998Z     
2025-05-07T20:31:45.5390231Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5390542Z             op = silu_mul_quant
2025-05-07T20:31:45.5390806Z             if compiled:
2025-05-07T20:31:45.5391111Z                 op = torch.compile(op)
2025-05-07T20:31:45.5391404Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5391683Z     
2025-05-07T20:31:45.5391879Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5392162Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5392458Z     
2025-05-07T20:31:45.5392700Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5393042Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5393338Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5393657Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5394016Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5394321Z     
2025-05-07T20:31:45.5394529Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5394723Z 
2025-05-07T20:31:45.5394829Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5395243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5395584Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5395913Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5396699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5397447Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5397992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5398676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5399363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5400094Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5400854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5401657Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5402383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5403027Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5403630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5404162Z     fn()
2025-05-07T20:31:45.5411790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5412449Z     self.fn.run(
2025-05-07T20:31:45.5412934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5413487Z     kernel = self.compile(
2025-05-07T20:31:45.5414055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5414728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5415133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5415374Z 
2025-05-07T20:31:45.5415588Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca4ade10>
2025-05-07T20:31:45.5416862Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5418250Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51f2b46c00>}
2025-05-07T20:31:45.5419606Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5420648Z context = <triton._C.libtriton.ir.context object at 0x7f51ca3040b0>
2025-05-07T20:31:45.5420982Z 
2025-05-07T20:31:45.5421169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5421704Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5422181Z                            module_map=module_map)
2025-05-07T20:31:45.5422564Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5422942Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5423216Z E       ^
2025-05-07T20:31:45.5423696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5424160Z 
2025-05-07T20:31:45.5424725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5425244Z 
2025-05-07T20:31:45.5425359Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5425775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5426191Z     T=128,
2025-05-07T20:31:45.5426398Z     D=7168,
2025-05-07T20:31:45.5426598Z     scale_ub=None,
2025-05-07T20:31:45.5426831Z     contiguous=False,
2025-05-07T20:31:45.5427076Z     compiled=False,
2025-05-07T20:31:45.5427295Z )
2025-05-07T20:31:45.5427620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5428119Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.5428392Z 
2025-05-07T20:31:45.5428483Z     @given(
2025-05-07T20:31:45.5428722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5429045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5429372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5429705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5430041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5430335Z     )
2025-05-07T20:31:45.5430693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5431178Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5431442Z         self,
2025-05-07T20:31:45.5431648Z         T: int,
2025-05-07T20:31:45.5431857Z         D: int,
2025-05-07T20:31:45.5432085Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5432367Z         contiguous: bool,
2025-05-07T20:31:45.5432615Z         compiled: bool,
2025-05-07T20:31:45.5432851Z     ) -> None:
2025-05-07T20:31:45.5433082Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5433331Z     
2025-05-07T20:31:45.5433619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5433967Z     
2025-05-07T20:31:45.5434173Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5434475Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5434799Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5435048Z         x0 = x[:, :D]
2025-05-07T20:31:45.5435280Z         x1 = x[:, D:]
2025-05-07T20:31:45.5435502Z     
2025-05-07T20:31:45.5435695Z         if contiguous:
2025-05-07T20:31:45.5435935Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5436206Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5436546Z     
2025-05-07T20:31:45.5436743Z         if scale_ub is not None:
2025-05-07T20:31:45.5437029Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5437374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5437455Z             )
2025-05-07T20:31:45.5437536Z         else:
2025-05-07T20:31:45.5437646Z             scale_ub_tensor = None
2025-05-07T20:31:45.5437724Z     
2025-05-07T20:31:45.5437863Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5437976Z             op = silu_mul_quant
2025-05-07T20:31:45.5438067Z             if compiled:
2025-05-07T20:31:45.5438178Z                 op = torch.compile(op)
2025-05-07T20:31:45.5438295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5438374Z     
2025-05-07T20:31:45.5438469Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5438481Z 
2025-05-07T20:31:45.5438584Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5438719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5438839Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5438944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5439448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5439555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5439998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5440236Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5440579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5440680Z     kernel = self.compile(
2025-05-07T20:31:45.5441072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5441256Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5441389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5441394Z 
2025-05-07T20:31:45.5441611Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca9df790>
2025-05-07T20:31:45.5442398Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5442909Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51ca943100>}
2025-05-07T20:31:45.5443658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5443864Z context = <triton._C.libtriton.ir.context object at 0x7f51cb68df30>
2025-05-07T20:31:45.5443869Z 
2025-05-07T20:31:45.5444036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5444305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5444422Z                            module_map=module_map)
2025-05-07T20:31:45.5444590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5444698Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5444990Z E       ^
2025-05-07T20:31:45.5445350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5445355Z 
2025-05-07T20:31:45.5445777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5445782Z 
2025-05-07T20:31:45.5445977Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5446205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5446292Z     T=4096,
2025-05-07T20:31:45.5446375Z     D=5120,
2025-05-07T20:31:45.5446465Z     scale_ub=1200.0,
2025-05-07T20:31:45.5446566Z     contiguous=True,
2025-05-07T20:31:45.5446655Z     compiled=False,
2025-05-07T20:31:45.5446744Z )
2025-05-07T20:31:45.5446963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5447151Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.5447156Z 
2025-05-07T20:31:45.5447244Z     @given(
2025-05-07T20:31:45.5447371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5447475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5447656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5447778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5447906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5447987Z     )
2025-05-07T20:31:45.5448236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5448339Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5448426Z         self,
2025-05-07T20:31:45.5448509Z         T: int,
2025-05-07T20:31:45.5448591Z         D: int,
2025-05-07T20:31:45.5448699Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5448793Z         contiguous: bool,
2025-05-07T20:31:45.5448964Z         compiled: bool,
2025-05-07T20:31:45.5449054Z     ) -> None:
2025-05-07T20:31:45.5449156Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5449234Z     
2025-05-07T20:31:45.5449412Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5449491Z     
2025-05-07T20:31:45.5449593Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5449722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5449820Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5449911Z         x0 = x[:, :D]
2025-05-07T20:31:45.5449996Z         x1 = x[:, D:]
2025-05-07T20:31:45.5450074Z     
2025-05-07T20:31:45.5450168Z         if contiguous:
2025-05-07T20:31:45.5450264Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5450357Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5450444Z     
2025-05-07T20:31:45.5450539Z         if scale_ub is not None:
2025-05-07T20:31:45.5450648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5450796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5450876Z             )
2025-05-07T20:31:45.5450960Z         else:
2025-05-07T20:31:45.5451060Z             scale_ub_tensor = None
2025-05-07T20:31:45.5451138Z     
2025-05-07T20:31:45.5451279Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5451375Z             op = silu_mul_quant
2025-05-07T20:31:45.5451464Z             if compiled:
2025-05-07T20:31:45.5451572Z                 op = torch.compile(op)
2025-05-07T20:31:45.5451687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5451765Z     
2025-05-07T20:31:45.5451866Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5451871Z 
2025-05-07T20:31:45.5451972Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5452107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5452213Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5452315Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5452829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5452930Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5453292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5453521Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5453949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5454052Z     kernel = self.compile(
2025-05-07T20:31:45.5454435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5454613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5454750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5454760Z 
2025-05-07T20:31:45.5454968Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca380250>
2025-05-07T20:31:45.5455755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5456260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51cb66b560>}
2025-05-07T20:31:45.5457014Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5457216Z context = <triton._C.libtriton.ir.context object at 0x7f51ca3b8070>
2025-05-07T20:31:45.5457221Z 
2025-05-07T20:31:45.5457464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5457736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5457847Z                            module_map=module_map)
2025-05-07T20:31:45.5458013Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5458122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5458205Z E       ^
2025-05-07T20:31:45.5458561Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5458576Z 
2025-05-07T20:31:45.5458994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5458998Z 
2025-05-07T20:31:45.5459106Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5459336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5459417Z     T=1,
2025-05-07T20:31:45.5459503Z     D=5120,
2025-05-07T20:31:45.5459599Z     scale_ub=None,
2025-05-07T20:31:45.5459689Z     contiguous=True,
2025-05-07T20:31:45.5459780Z     compiled=True,
2025-05-07T20:31:45.5459863Z )
2025-05-07T20:31:45.5460083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5460258Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.5460263Z 
2025-05-07T20:31:45.5460344Z     @given(
2025-05-07T20:31:45.5460476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5460584Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5460703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5460825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5460947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5461026Z     )
2025-05-07T20:31:45.5461283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5461387Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5461468Z         self,
2025-05-07T20:31:45.5461556Z         T: int,
2025-05-07T20:31:45.5461636Z         D: int,
2025-05-07T20:31:45.5461737Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5461834Z         contiguous: bool,
2025-05-07T20:31:45.5461923Z         compiled: bool,
2025-05-07T20:31:45.5462005Z     ) -> None:
2025-05-07T20:31:45.5462108Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5462296Z     
2025-05-07T20:31:45.5462468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5462550Z     
2025-05-07T20:31:45.5462645Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5462771Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5462866Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5462950Z         x0 = x[:, :D]
2025-05-07T20:31:45.5463041Z         x1 = x[:, D:]
2025-05-07T20:31:45.5463117Z     
2025-05-07T20:31:45.5463209Z         if contiguous:
2025-05-07T20:31:45.5463314Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5463407Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5463482Z     
2025-05-07T20:31:45.5463581Z         if scale_ub is not None:
2025-05-07T20:31:45.5463691Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5463829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5463915Z             )
2025-05-07T20:31:45.5463995Z         else:
2025-05-07T20:31:45.5464098Z             scale_ub_tensor = None
2025-05-07T20:31:45.5464179Z     
2025-05-07T20:31:45.5464313Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5464405Z             op = silu_mul_quant
2025-05-07T20:31:45.5464505Z             if compiled:
2025-05-07T20:31:45.5464608Z                 op = torch.compile(op)
2025-05-07T20:31:45.5464724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5464800Z     
2025-05-07T20:31:45.5464893Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5465103Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5465180Z     
2025-05-07T20:31:45.5465319Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5465430Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5465534Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5465660Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5465808Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5465890Z     
2025-05-07T20:31:45.5465999Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5466003Z 
2025-05-07T20:31:45.5466106Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5466238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5466353Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5466490Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5467061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5467169Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5467529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5467757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5468129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5468395Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5468799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5469054Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5469440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5469607Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5469952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5470039Z     fn()
2025-05-07T20:31:45.5470439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5470607Z     self.fn.run(
2025-05-07T20:31:45.5470953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5471053Z     kernel = self.compile(
2025-05-07T20:31:45.5471438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5471625Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5471786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5471792Z 
2025-05-07T20:31:45.5472027Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca35ced0>
2025-05-07T20:31:45.5472805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5473322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51cbd077e0>}
2025-05-07T20:31:45.5474069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5474339Z context = <triton._C.libtriton.ir.context object at 0x7f51ca354eb0>
2025-05-07T20:31:45.5474350Z 
2025-05-07T20:31:45.5474518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5474786Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5474903Z                            module_map=module_map)
2025-05-07T20:31:45.5475068Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5475174Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5475266Z E       ^
2025-05-07T20:31:45.5475624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5475629Z 
2025-05-07T20:31:45.5476050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5476054Z 
2025-05-07T20:31:45.5476164Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5476400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5476488Z     T=2048,
2025-05-07T20:31:45.5476568Z     D=5120,
2025-05-07T20:31:45.5476655Z     scale_ub=None,
2025-05-07T20:31:45.5476752Z     contiguous=True,
2025-05-07T20:31:45.5476843Z     compiled=True,
2025-05-07T20:31:45.5476923Z )
2025-05-07T20:31:45.5477148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5477322Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.5477331Z 
2025-05-07T20:31:45.5477418Z     @given(
2025-05-07T20:31:45.5477541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5477644Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5477769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5477893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5478015Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5478100Z     )
2025-05-07T20:31:45.5478348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5478451Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5478531Z         self,
2025-05-07T20:31:45.5478611Z         T: int,
2025-05-07T20:31:45.5478695Z         D: int,
2025-05-07T20:31:45.5478797Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5478890Z         contiguous: bool,
2025-05-07T20:31:45.5478989Z         compiled: bool,
2025-05-07T20:31:45.5479154Z     ) -> None:
2025-05-07T20:31:45.5479261Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5479337Z     
2025-05-07T20:31:45.5479509Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5479590Z     
2025-05-07T20:31:45.5479685Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5479817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5479915Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5479998Z         x0 = x[:, :D]
2025-05-07T20:31:45.5480087Z         x1 = x[:, D:]
2025-05-07T20:31:45.5480165Z     
2025-05-07T20:31:45.5480253Z         if contiguous:
2025-05-07T20:31:45.5480348Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5480445Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5480522Z     
2025-05-07T20:31:45.5480615Z         if scale_ub is not None:
2025-05-07T20:31:45.5480729Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5480868Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5480957Z             )
2025-05-07T20:31:45.5481035Z         else:
2025-05-07T20:31:45.5481135Z             scale_ub_tensor = None
2025-05-07T20:31:45.5481215Z     
2025-05-07T20:31:45.5481347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5481441Z             op = silu_mul_quant
2025-05-07T20:31:45.5481534Z             if compiled:
2025-05-07T20:31:45.5481647Z                 op = torch.compile(op)
2025-05-07T20:31:45.5481862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5481953Z     
2025-05-07T20:31:45.5482049Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5482174Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5482255Z     
2025-05-07T20:31:45.5482393Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5482502Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5482603Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5482735Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5482887Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5482964Z     
2025-05-07T20:31:45.5483065Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5483070Z 
2025-05-07T20:31:45.5483176Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5483309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5483423Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5483564Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5484125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5484237Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5484598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5484829Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5485199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5485456Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5485859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5486118Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5486494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5486667Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5487009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5487178Z     fn()
2025-05-07T20:31:45.5487635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5487723Z     self.fn.run(
2025-05-07T20:31:45.5488065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5488164Z     kernel = self.compile(
2025-05-07T20:31:45.5488552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5488732Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5488862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5488867Z 
2025-05-07T20:31:45.5489078Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca16ee90>
2025-05-07T20:31:45.5489855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5490364Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51cbc36c00>}
2025-05-07T20:31:45.5491195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5491391Z context = <triton._C.libtriton.ir.context object at 0x7f51ca176530>
2025-05-07T20:31:45.5491396Z 
2025-05-07T20:31:45.5491568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5491834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5491944Z                            module_map=module_map)
2025-05-07T20:31:45.5492118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5492223Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5492312Z E       ^
2025-05-07T20:31:45.5492667Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5492672Z 
2025-05-07T20:31:45.5493090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5493100Z 
2025-05-07T20:31:45.5493211Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5493435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5493518Z     T=128,
2025-05-07T20:31:45.5493599Z     D=5120,
2025-05-07T20:31:45.5493689Z     scale_ub=None,
2025-05-07T20:31:45.5493782Z     contiguous=True,
2025-05-07T20:31:45.5493869Z     compiled=True,
2025-05-07T20:31:45.5493947Z )
2025-05-07T20:31:45.5494173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5494349Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.5494354Z 
2025-05-07T20:31:45.5494434Z     @given(
2025-05-07T20:31:45.5494563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5494667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5494791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5494916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5495033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5495121Z     )
2025-05-07T20:31:45.5495368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5495465Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5495549Z         self,
2025-05-07T20:31:45.5495629Z         T: int,
2025-05-07T20:31:45.5495709Z         D: int,
2025-05-07T20:31:45.5495819Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5496024Z         contiguous: bool,
2025-05-07T20:31:45.5496111Z         compiled: bool,
2025-05-07T20:31:45.5496194Z     ) -> None:
2025-05-07T20:31:45.5496292Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5496372Z     
2025-05-07T20:31:45.5496545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5496622Z     
2025-05-07T20:31:45.5496720Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5496848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5496944Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5497030Z         x0 = x[:, :D]
2025-05-07T20:31:45.5497113Z         x1 = x[:, D:]
2025-05-07T20:31:45.5497191Z     
2025-05-07T20:31:45.5497285Z         if contiguous:
2025-05-07T20:31:45.5497379Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5497473Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5497553Z     
2025-05-07T20:31:45.5497646Z         if scale_ub is not None:
2025-05-07T20:31:45.5497762Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5497908Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5497986Z             )
2025-05-07T20:31:45.5498069Z         else:
2025-05-07T20:31:45.5498166Z             scale_ub_tensor = None
2025-05-07T20:31:45.5498243Z     
2025-05-07T20:31:45.5498379Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5498473Z             op = silu_mul_quant
2025-05-07T20:31:45.5498561Z             if compiled:
2025-05-07T20:31:45.5498748Z                 op = torch.compile(op)
2025-05-07T20:31:45.5498858Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5498933Z     
2025-05-07T20:31:45.5499032Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5499156Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5499232Z     
2025-05-07T20:31:45.5499377Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5499482Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5499596Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5499723Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5499866Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5499949Z     
2025-05-07T20:31:45.5500057Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5500061Z 
2025-05-07T20:31:45.5500161Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5500304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5500412Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5500554Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5501115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5501219Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5501583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5501814Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5502181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5502442Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5502845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5503104Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5503477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5503646Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5503992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5504156Z     fn()
2025-05-07T20:31:45.5504559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5504646Z     self.fn.run(
2025-05-07T20:31:45.5504988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5505091Z     kernel = self.compile(
2025-05-07T20:31:45.5505478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5505847Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5506042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5506050Z 
2025-05-07T20:31:45.5506320Z self = <triton.compiler.compiler.ASTSource object at 0x7f51ca05f350>
2025-05-07T20:31:45.5507385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5507938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51cbb0aac0>}
2025-05-07T20:31:45.5508828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5509028Z context = <triton._C.libtriton.ir.context object at 0x7f51ca0629f0>
2025-05-07T20:31:45.5509032Z 
2025-05-07T20:31:45.5509200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5509472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5509589Z                            module_map=module_map)
2025-05-07T20:31:45.5509753Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5509866Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5509947Z E       ^
2025-05-07T20:31:45.5510309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5510314Z 
2025-05-07T20:31:45.5510737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5510741Z 
2025-05-07T20:31:45.5510849Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5511078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5511160Z     T=4096,
2025-05-07T20:31:45.5511248Z     D=5120,
2025-05-07T20:31:45.5511336Z     scale_ub=None,
2025-05-07T20:31:45.5511424Z     contiguous=True,
2025-05-07T20:31:45.5511521Z     compiled=True,
2025-05-07T20:31:45.5511600Z )
2025-05-07T20:31:45.5511821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5511999Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.5512004Z 
2025-05-07T20:31:45.5512088Z     @given(
2025-05-07T20:31:45.5512211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5512315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5512436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5512560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5512676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5512752Z     )
2025-05-07T20:31:45.5513003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5513098Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5513177Z         self,
2025-05-07T20:31:45.5513386Z         T: int,
2025-05-07T20:31:45.5513469Z         D: int,
2025-05-07T20:31:45.5513568Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5513664Z         contiguous: bool,
2025-05-07T20:31:45.5513751Z         compiled: bool,
2025-05-07T20:31:45.5513833Z     ) -> None:
2025-05-07T20:31:45.5513934Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5514013Z     
2025-05-07T20:31:45.5514186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5514261Z     
2025-05-07T20:31:45.5514362Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5514496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5514586Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5514666Z         x0 = x[:, :D]
2025-05-07T20:31:45.5514751Z         x1 = x[:, D:]
2025-05-07T20:31:45.5514824Z     
2025-05-07T20:31:45.5514910Z         if contiguous:
2025-05-07T20:31:45.5515008Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5515098Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5515181Z     
2025-05-07T20:31:45.5515277Z         if scale_ub is not None:
2025-05-07T20:31:45.5515384Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5515520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5515603Z             )
2025-05-07T20:31:45.5515685Z         else:
2025-05-07T20:31:45.5515787Z             scale_ub_tensor = None
2025-05-07T20:31:45.5515860Z     
2025-05-07T20:31:45.5515991Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5516168Z             op = silu_mul_quant
2025-05-07T20:31:45.5516256Z             if compiled:
2025-05-07T20:31:45.5516356Z                 op = torch.compile(op)
2025-05-07T20:31:45.5516466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5516543Z     
2025-05-07T20:31:45.5516632Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5516755Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5516831Z     
2025-05-07T20:31:45.5516974Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5517081Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5517181Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5517308Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5517446Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5517523Z     
2025-05-07T20:31:45.5517626Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5517630Z 
2025-05-07T20:31:45.5517733Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5517862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5517969Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5518102Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5518667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5518774Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5519132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5519358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5519724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5519985Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5520391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5520646Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5521022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5521274Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5521616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5521701Z     fn()
2025-05-07T20:31:45.5522097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5522183Z     self.fn.run(
2025-05-07T20:31:45.5522523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5522617Z     kernel = self.compile(
2025-05-07T20:31:45.5522998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5523173Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5523303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5523312Z 
2025-05-07T20:31:45.5523525Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c9339050>
2025-05-07T20:31:45.5524299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5524973Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9bea5c0>}
2025-05-07T20:31:45.5525722Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5525916Z context = <triton._C.libtriton.ir.context object at 0x7f51c933c6f0>
2025-05-07T20:31:45.5525921Z 
2025-05-07T20:31:45.5526086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5526357Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5526470Z                            module_map=module_map)
2025-05-07T20:31:45.5526632Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5526735Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5526821Z E       ^
2025-05-07T20:31:45.5527184Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5527188Z 
2025-05-07T20:31:45.5527728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5527734Z 
2025-05-07T20:31:45.5527841Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5528062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5528148Z     T=16384,
2025-05-07T20:31:45.5528234Z     D=5120,
2025-05-07T20:31:45.5528323Z     scale_ub=None,
2025-05-07T20:31:45.5528411Z     contiguous=True,
2025-05-07T20:31:45.5528497Z     compiled=True,
2025-05-07T20:31:45.5528578Z )
2025-05-07T20:31:45.5528797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5528973Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.5528978Z 
2025-05-07T20:31:45.5529059Z     @given(
2025-05-07T20:31:45.5529183Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5529285Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5529407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5529525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5529644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5529721Z     )
2025-05-07T20:31:45.5529966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5530147Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5530227Z         self,
2025-05-07T20:31:45.5530307Z         T: int,
2025-05-07T20:31:45.5530395Z         D: int,
2025-05-07T20:31:45.5530495Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5530586Z         contiguous: bool,
2025-05-07T20:31:45.5530676Z         compiled: bool,
2025-05-07T20:31:45.5530756Z     ) -> None:
2025-05-07T20:31:45.5530852Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5530935Z     
2025-05-07T20:31:45.5531113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5531190Z     
2025-05-07T20:31:45.5531284Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5531410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5531503Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5531585Z         x0 = x[:, :D]
2025-05-07T20:31:45.5531668Z         x1 = x[:, D:]
2025-05-07T20:31:45.5531749Z     
2025-05-07T20:31:45.5531835Z         if contiguous:
2025-05-07T20:31:45.5531932Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5532029Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5532105Z     
2025-05-07T20:31:45.5532197Z         if scale_ub is not None:
2025-05-07T20:31:45.5532307Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5532442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5532525Z             )
2025-05-07T20:31:45.5532608Z         else:
2025-05-07T20:31:45.5532704Z             scale_ub_tensor = None
2025-05-07T20:31:45.5532864Z     
2025-05-07T20:31:45.5532994Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5533089Z             op = silu_mul_quant
2025-05-07T20:31:45.5533177Z             if compiled:
2025-05-07T20:31:45.5533280Z                 op = torch.compile(op)
2025-05-07T20:31:45.5533387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5533467Z     
2025-05-07T20:31:45.5533559Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5533684Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5533763Z     
2025-05-07T20:31:45.5533899Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5534000Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5534100Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5534222Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5534365Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5534440Z     
2025-05-07T20:31:45.5534547Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5534552Z 
2025-05-07T20:31:45.5534653Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5534783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5534891Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5535028Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5535587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5535696Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5536053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5536274Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5536650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5536906Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5537304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5537557Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5537929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5538184Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5538524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5538604Z     fn()
2025-05-07T20:31:45.5539007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5539099Z     self.fn.run(
2025-05-07T20:31:45.5539445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5539543Z     kernel = self.compile(
2025-05-07T20:31:45.5539922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5540099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5540230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5540235Z 
2025-05-07T20:31:45.5540447Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8c6e850>
2025-05-07T20:31:45.5541221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5541799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51caa054e0>}
2025-05-07T20:31:45.5542555Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5542749Z context = <triton._C.libtriton.ir.context object at 0x7f51c8c71eb0>
2025-05-07T20:31:45.5542758Z 
2025-05-07T20:31:45.5542927Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5543191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5543299Z                            module_map=module_map)
2025-05-07T20:31:45.5543467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5543571Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5543652Z E       ^
2025-05-07T20:31:45.5544015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5544019Z 
2025-05-07T20:31:45.5544431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5544436Z 
2025-05-07T20:31:45.5544543Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5544764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5544850Z     T=1,
2025-05-07T20:31:45.5544937Z     D=5120,
2025-05-07T20:31:45.5545021Z     scale_ub=1200.0,
2025-05-07T20:31:45.5545110Z     contiguous=True,
2025-05-07T20:31:45.5545194Z     compiled=True,
2025-05-07T20:31:45.5545272Z )
2025-05-07T20:31:45.5545502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5550916Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.5550924Z 
2025-05-07T20:31:45.5551027Z     @given(
2025-05-07T20:31:45.5551164Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5551272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5551391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5551522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5551640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5551724Z     )
2025-05-07T20:31:45.5552088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5552187Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5552275Z         self,
2025-05-07T20:31:45.5552357Z         T: int,
2025-05-07T20:31:45.5552437Z         D: int,
2025-05-07T20:31:45.5552545Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5552643Z         contiguous: bool,
2025-05-07T20:31:45.5552733Z         compiled: bool,
2025-05-07T20:31:45.5552822Z     ) -> None:
2025-05-07T20:31:45.5552926Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5553004Z     
2025-05-07T20:31:45.5553189Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5553268Z     
2025-05-07T20:31:45.5553371Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5553502Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5553596Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5553683Z         x0 = x[:, :D]
2025-05-07T20:31:45.5553772Z         x1 = x[:, D:]
2025-05-07T20:31:45.5553848Z     
2025-05-07T20:31:45.5553939Z         if contiguous:
2025-05-07T20:31:45.5554034Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5554126Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5554208Z     
2025-05-07T20:31:45.5554304Z         if scale_ub is not None:
2025-05-07T20:31:45.5554416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5554560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5554640Z             )
2025-05-07T20:31:45.5554807Z         else:
2025-05-07T20:31:45.5554909Z             scale_ub_tensor = None
2025-05-07T20:31:45.5554986Z     
2025-05-07T20:31:45.5555126Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5555220Z             op = silu_mul_quant
2025-05-07T20:31:45.5555310Z             if compiled:
2025-05-07T20:31:45.5555417Z                 op = torch.compile(op)
2025-05-07T20:31:45.5555525Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5555607Z     
2025-05-07T20:31:45.5555705Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5555709Z 
2025-05-07T20:31:45.5555811Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5555947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5556055Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5556159Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5556549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5556646Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5557151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5557255Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5557619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5557850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5558199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5558298Z     kernel = self.compile(
2025-05-07T20:31:45.5558687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5558869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5559003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5559008Z 
2025-05-07T20:31:45.5559220Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c839fe10>
2025-05-07T20:31:45.5560008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5560602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51ca2f09a0>}
2025-05-07T20:31:45.5561358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5561559Z context = <triton._C.libtriton.ir.context object at 0x7f51c83afbb0>
2025-05-07T20:31:45.5561564Z 
2025-05-07T20:31:45.5561732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5562002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5562115Z                            module_map=module_map)
2025-05-07T20:31:45.5562280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5562382Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5562470Z E       ^
2025-05-07T20:31:45.5562829Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5562834Z 
2025-05-07T20:31:45.5563259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5563264Z 
2025-05-07T20:31:45.5563370Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5563699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5563784Z     T=1,
2025-05-07T20:31:45.5563866Z     D=5120,
2025-05-07T20:31:45.5563954Z     scale_ub=None,
2025-05-07T20:31:45.5564052Z     contiguous=False,
2025-05-07T20:31:45.5564140Z     compiled=True,
2025-05-07T20:31:45.5564220Z )
2025-05-07T20:31:45.5564443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5564612Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5564624Z 
2025-05-07T20:31:45.5564708Z     @given(
2025-05-07T20:31:45.5564831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5564937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5565059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5565180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5565295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5565380Z     )
2025-05-07T20:31:45.5565638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5565739Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5565820Z         self,
2025-05-07T20:31:45.5565900Z         T: int,
2025-05-07T20:31:45.5565985Z         D: int,
2025-05-07T20:31:45.5566089Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5566181Z         contiguous: bool,
2025-05-07T20:31:45.5566275Z         compiled: bool,
2025-05-07T20:31:45.5566364Z     ) -> None:
2025-05-07T20:31:45.5566466Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5566545Z     
2025-05-07T20:31:45.5566716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5566793Z     
2025-05-07T20:31:45.5566897Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5567025Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5567122Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5567205Z         x0 = x[:, :D]
2025-05-07T20:31:45.5567294Z         x1 = x[:, D:]
2025-05-07T20:31:45.5567378Z     
2025-05-07T20:31:45.5567468Z         if contiguous:
2025-05-07T20:31:45.5567631Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5567727Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5567804Z     
2025-05-07T20:31:45.5567898Z         if scale_ub is not None:
2025-05-07T20:31:45.5568015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5568154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5568322Z             )
2025-05-07T20:31:45.5568408Z         else:
2025-05-07T20:31:45.5568508Z             scale_ub_tensor = None
2025-05-07T20:31:45.5568591Z     
2025-05-07T20:31:45.5568726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5568819Z             op = silu_mul_quant
2025-05-07T20:31:45.5568915Z             if compiled:
2025-05-07T20:31:45.5569019Z                 op = torch.compile(op)
2025-05-07T20:31:45.5569128Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5569213Z     
2025-05-07T20:31:45.5569313Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5569438Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5569516Z     
2025-05-07T20:31:45.5569654Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5569759Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5569867Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5569994Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5570149Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5570226Z     
2025-05-07T20:31:45.5570332Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5570337Z 
2025-05-07T20:31:45.5570446Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5570583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5570694Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5570916Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5571533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5571645Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5572007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5572233Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5572613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5572874Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5573274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5573540Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5573919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5574093Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5574437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5574518Z     fn()
2025-05-07T20:31:45.5574932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5575026Z     self.fn.run(
2025-05-07T20:31:45.5575367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5575464Z     kernel = self.compile(
2025-05-07T20:31:45.5575847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5576032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5576164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5576168Z 
2025-05-07T20:31:45.5576375Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c832ba90>
2025-05-07T20:31:45.5577157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5577742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c97751c0>}
2025-05-07T20:31:45.5578502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5578696Z context = <triton._C.libtriton.ir.context object at 0x7f51c8346b70>
2025-05-07T20:31:45.5578700Z 
2025-05-07T20:31:45.5578876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5579142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5579252Z                            module_map=module_map)
2025-05-07T20:31:45.5579424Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5579531Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5579611Z E       ^
2025-05-07T20:31:45.5579972Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5579977Z 
2025-05-07T20:31:45.5580393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5580397Z 
2025-05-07T20:31:45.5580588Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5580815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5580895Z     T=1,
2025-05-07T20:31:45.5580994Z     D=5120,
2025-05-07T20:31:45.5581091Z     scale_ub=None,
2025-05-07T20:31:45.5581194Z     contiguous=True,
2025-05-07T20:31:45.5581295Z     compiled=False,
2025-05-07T20:31:45.5581375Z )
2025-05-07T20:31:45.5581598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5581771Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.5581776Z 
2025-05-07T20:31:45.5581858Z     @given(
2025-05-07T20:31:45.5581986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5582088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5582210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5582331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5582450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5582531Z     )
2025-05-07T20:31:45.5582779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5582875Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5582961Z         self,
2025-05-07T20:31:45.5583043Z         T: int,
2025-05-07T20:31:45.5583122Z         D: int,
2025-05-07T20:31:45.5583229Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5583326Z         contiguous: bool,
2025-05-07T20:31:45.5583416Z         compiled: bool,
2025-05-07T20:31:45.5583502Z     ) -> None:
2025-05-07T20:31:45.5583598Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5583679Z     
2025-05-07T20:31:45.5583853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5583929Z     
2025-05-07T20:31:45.5584030Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5584158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5584255Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5584344Z         x0 = x[:, :D]
2025-05-07T20:31:45.5584429Z         x1 = x[:, D:]
2025-05-07T20:31:45.5584504Z     
2025-05-07T20:31:45.5584594Z         if contiguous:
2025-05-07T20:31:45.5584688Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5584782Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5584863Z     
2025-05-07T20:31:45.5584959Z         if scale_ub is not None:
2025-05-07T20:31:45.5585071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5585301Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5585381Z             )
2025-05-07T20:31:45.5585463Z         else:
2025-05-07T20:31:45.5585564Z             scale_ub_tensor = None
2025-05-07T20:31:45.5585639Z     
2025-05-07T20:31:45.5585774Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5585868Z             op = silu_mul_quant
2025-05-07T20:31:45.5585957Z             if compiled:
2025-05-07T20:31:45.5586068Z                 op = torch.compile(op)
2025-05-07T20:31:45.5586179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5586256Z     
2025-05-07T20:31:45.5586352Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5586357Z 
2025-05-07T20:31:45.5586459Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5586596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5586701Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5586816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5587325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5587426Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5587788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5588016Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5588439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5588542Z     kernel = self.compile(
2025-05-07T20:31:45.5588925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5589104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5589236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5589246Z 
2025-05-07T20:31:45.5589453Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c83e20d0>
2025-05-07T20:31:45.5590236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5590744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9a50a40>}
2025-05-07T20:31:45.5591523Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5591744Z context = <triton._C.libtriton.ir.context object at 0x7f51c86affb0>
2025-05-07T20:31:45.5591754Z 
2025-05-07T20:31:45.5591922Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5592191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5592301Z                            module_map=module_map)
2025-05-07T20:31:45.5592468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5592575Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5592656Z E       ^
2025-05-07T20:31:45.5593019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5593024Z 
2025-05-07T20:31:45.5593439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5593444Z 
2025-05-07T20:31:45.5593552Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5593783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5593968Z     T=128,
2025-05-07T20:31:45.5594051Z     D=5120,
2025-05-07T20:31:45.5594142Z     scale_ub=None,
2025-05-07T20:31:45.5594234Z     contiguous=False,
2025-05-07T20:31:45.5594326Z     compiled=True,
2025-05-07T20:31:45.5594405Z )
2025-05-07T20:31:45.5594626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5594803Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5594808Z 
2025-05-07T20:31:45.5594895Z     @given(
2025-05-07T20:31:45.5595017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5595124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5595245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5595367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5595489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5595568Z     )
2025-05-07T20:31:45.5595819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5595920Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5596000Z         self,
2025-05-07T20:31:45.5596088Z         T: int,
2025-05-07T20:31:45.5596167Z         D: int,
2025-05-07T20:31:45.5596272Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5596373Z         contiguous: bool,
2025-05-07T20:31:45.5596461Z         compiled: bool,
2025-05-07T20:31:45.5596542Z     ) -> None:
2025-05-07T20:31:45.5596645Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5596802Z     
2025-05-07T20:31:45.5596979Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5597062Z     
2025-05-07T20:31:45.5597157Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5597289Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5597380Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5597463Z         x0 = x[:, :D]
2025-05-07T20:31:45.5597551Z         x1 = x[:, D:]
2025-05-07T20:31:45.5597633Z     
2025-05-07T20:31:45.5597720Z         if contiguous:
2025-05-07T20:31:45.5597819Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5597911Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5597987Z     
2025-05-07T20:31:45.5598084Z         if scale_ub is not None:
2025-05-07T20:31:45.5598195Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5598331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5598414Z             )
2025-05-07T20:31:45.5598499Z         else:
2025-05-07T20:31:45.5598598Z             scale_ub_tensor = None
2025-05-07T20:31:45.5598682Z     
2025-05-07T20:31:45.5598814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5598910Z             op = silu_mul_quant
2025-05-07T20:31:45.5598998Z             if compiled:
2025-05-07T20:31:45.5599101Z                 op = torch.compile(op)
2025-05-07T20:31:45.5599215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5599293Z     
2025-05-07T20:31:45.5599390Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5599395Z 
2025-05-07T20:31:45.5599499Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5599632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5599735Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5599841Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5600211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5600314Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5600813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5600913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5601275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5601498Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5601927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5602027Z     kernel = self.compile(
2025-05-07T20:31:45.5602409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5602588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5602723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5602727Z 
2025-05-07T20:31:45.5602934Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8672950>
2025-05-07T20:31:45.5603714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5604223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9a52c00>}
2025-05-07T20:31:45.5604975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5605173Z context = <triton._C.libtriton.ir.context object at 0x7f51c867a730>
2025-05-07T20:31:45.5605252Z 
2025-05-07T20:31:45.5605427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5605958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5606114Z                            module_map=module_map)
2025-05-07T20:31:45.5606286Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5606386Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5606471Z E       ^
2025-05-07T20:31:45.5606829Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5606834Z 
2025-05-07T20:31:45.5607247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5607251Z 
2025-05-07T20:31:45.5607358Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5607639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5607722Z     T=128,
2025-05-07T20:31:45.5607804Z     D=7168,
2025-05-07T20:31:45.5607891Z     scale_ub=1200.0,
2025-05-07T20:31:45.5607979Z     contiguous=False,
2025-05-07T20:31:45.5608066Z     compiled=False,
2025-05-07T20:31:45.5608141Z )
2025-05-07T20:31:45.5608362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5608540Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.5608549Z 
2025-05-07T20:31:45.5608627Z     @given(
2025-05-07T20:31:45.5608750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5608850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5608963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5609085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5609197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5609271Z     )
2025-05-07T20:31:45.5609528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5609623Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5609708Z         self,
2025-05-07T20:31:45.5609786Z         T: int,
2025-05-07T20:31:45.5609864Z         D: int,
2025-05-07T20:31:45.5609966Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5610055Z         contiguous: bool,
2025-05-07T20:31:45.5610142Z         compiled: bool,
2025-05-07T20:31:45.5610222Z     ) -> None:
2025-05-07T20:31:45.5610481Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5610558Z     
2025-05-07T20:31:45.5610729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5610805Z     
2025-05-07T20:31:45.5610902Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5611024Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5611114Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5611202Z         x0 = x[:, :D]
2025-05-07T20:31:45.5611282Z         x1 = x[:, D:]
2025-05-07T20:31:45.5611360Z     
2025-05-07T20:31:45.5611450Z         if contiguous:
2025-05-07T20:31:45.5611544Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5611635Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5611712Z     
2025-05-07T20:31:45.5611804Z         if scale_ub is not None:
2025-05-07T20:31:45.5611913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5612049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5612133Z             )
2025-05-07T20:31:45.5612213Z         else:
2025-05-07T20:31:45.5612308Z             scale_ub_tensor = None
2025-05-07T20:31:45.5612383Z     
2025-05-07T20:31:45.5612523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5612612Z             op = silu_mul_quant
2025-05-07T20:31:45.5612699Z             if compiled:
2025-05-07T20:31:45.5612804Z                 op = torch.compile(op)
2025-05-07T20:31:45.5612911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5613095Z     
2025-05-07T20:31:45.5613192Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5613196Z 
2025-05-07T20:31:45.5613292Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5613423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5613525Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5613626Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5614135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5614245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5614602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5614826Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5615166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5615270Z     kernel = self.compile(
2025-05-07T20:31:45.5615652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5615826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5615956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5615960Z 
2025-05-07T20:31:45.5616162Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c86decd0>
2025-05-07T20:31:45.5616948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5617449Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c971f380>}
2025-05-07T20:31:45.5618204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5618396Z context = <triton._C.libtriton.ir.context object at 0x7f51c86dabb0>
2025-05-07T20:31:45.5618401Z 
2025-05-07T20:31:45.5618565Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5618918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5619024Z                            module_map=module_map)
2025-05-07T20:31:45.5619186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5619289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5619365Z E       ^
2025-05-07T20:31:45.5619722Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5619732Z 
2025-05-07T20:31:45.5620145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5620149Z 
2025-05-07T20:31:45.5620254Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5620482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5620559Z     T=128,
2025-05-07T20:31:45.5620637Z     D=5120,
2025-05-07T20:31:45.5620726Z     scale_ub=None,
2025-05-07T20:31:45.5620842Z     contiguous=False,
2025-05-07T20:31:45.5620938Z     compiled=False,
2025-05-07T20:31:45.5621031Z )
2025-05-07T20:31:45.5621250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5621422Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.5621427Z 
2025-05-07T20:31:45.5621502Z     @given(
2025-05-07T20:31:45.5621620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5621802Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5621920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5622034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5622155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5622228Z     )
2025-05-07T20:31:45.5622477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5622567Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5622654Z         self,
2025-05-07T20:31:45.5622736Z         T: int,
2025-05-07T20:31:45.5622813Z         D: int,
2025-05-07T20:31:45.5622910Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5623006Z         contiguous: bool,
2025-05-07T20:31:45.5623092Z         compiled: bool,
2025-05-07T20:31:45.5623171Z     ) -> None:
2025-05-07T20:31:45.5623269Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5623344Z     
2025-05-07T20:31:45.5623511Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5623596Z     
2025-05-07T20:31:45.5623688Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5623818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5623907Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5623990Z         x0 = x[:, :D]
2025-05-07T20:31:45.5624072Z         x1 = x[:, D:]
2025-05-07T20:31:45.5624148Z     
2025-05-07T20:31:45.5624234Z         if contiguous:
2025-05-07T20:31:45.5624328Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5624422Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5624495Z     
2025-05-07T20:31:45.5624592Z         if scale_ub is not None:
2025-05-07T20:31:45.5624696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5624831Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5624910Z             )
2025-05-07T20:31:45.5624987Z         else:
2025-05-07T20:31:45.5625084Z             scale_ub_tensor = None
2025-05-07T20:31:45.5625158Z     
2025-05-07T20:31:45.5625290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5625384Z             op = silu_mul_quant
2025-05-07T20:31:45.5625466Z             if compiled:
2025-05-07T20:31:45.5625565Z                 op = torch.compile(op)
2025-05-07T20:31:45.5625672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5625745Z     
2025-05-07T20:31:45.5625834Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5625838Z 
2025-05-07T20:31:45.5625939Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5626177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5626276Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5626379Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5626876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5626978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5627340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5627563Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5627907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5627999Z     kernel = self.compile(
2025-05-07T20:31:45.5628381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5628563Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5628690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5628695Z 
2025-05-07T20:31:45.5628903Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c82b9a10>
2025-05-07T20:31:45.5629747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5630251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9777380>}
2025-05-07T20:31:45.5630997Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5631192Z context = <triton._C.libtriton.ir.context object at 0x7f51c82f6fb0>
2025-05-07T20:31:45.5631197Z 
2025-05-07T20:31:45.5631364Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5631624Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5631734Z                            module_map=module_map)
2025-05-07T20:31:45.5631899Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5631998Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5632079Z E       ^
2025-05-07T20:31:45.5632434Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5632438Z 
2025-05-07T20:31:45.5632853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5632866Z 
2025-05-07T20:31:45.5632968Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5633189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5633268Z     T=128,
2025-05-07T20:31:45.5633343Z     D=5120,
2025-05-07T20:31:45.5633426Z     scale_ub=1200.0,
2025-05-07T20:31:45.5633516Z     contiguous=True,
2025-05-07T20:31:45.5633601Z     compiled=False,
2025-05-07T20:31:45.5633677Z )
2025-05-07T20:31:45.5633901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5634071Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.5634076Z 
2025-05-07T20:31:45.5634157Z     @given(
2025-05-07T20:31:45.5634277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5634377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5634494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5634693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5634806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5634885Z     )
2025-05-07T20:31:45.5635127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5635221Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5635301Z         self,
2025-05-07T20:31:45.5635379Z         T: int,
2025-05-07T20:31:45.5635458Z         D: int,
2025-05-07T20:31:45.5635557Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5635653Z         contiguous: bool,
2025-05-07T20:31:45.5635742Z         compiled: bool,
2025-05-07T20:31:45.5635822Z     ) -> None:
2025-05-07T20:31:45.5635915Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5635997Z     
2025-05-07T20:31:45.5636168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5636244Z     
2025-05-07T20:31:45.5636338Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5636460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5636555Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5636641Z         x0 = x[:, :D]
2025-05-07T20:31:45.5636722Z         x1 = x[:, D:]
2025-05-07T20:31:45.5636795Z     
2025-05-07T20:31:45.5636881Z         if contiguous:
2025-05-07T20:31:45.5636972Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5637064Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5637141Z     
2025-05-07T20:31:45.5637231Z         if scale_ub is not None:
2025-05-07T20:31:45.5637419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5637557Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5637631Z             )
2025-05-07T20:31:45.5637714Z         else:
2025-05-07T20:31:45.5637809Z             scale_ub_tensor = None
2025-05-07T20:31:45.5637886Z     
2025-05-07T20:31:45.5638025Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5638116Z             op = silu_mul_quant
2025-05-07T20:31:45.5638204Z             if compiled:
2025-05-07T20:31:45.5638307Z                 op = torch.compile(op)
2025-05-07T20:31:45.5638414Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5638493Z     
2025-05-07T20:31:45.5638586Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5638591Z 
2025-05-07T20:31:45.5638685Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5638815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5638915Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5639016Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5639521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5639622Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5639982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5640206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5640549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5640648Z     kernel = self.compile(
2025-05-07T20:31:45.5641027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5641202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5641334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5641339Z 
2025-05-07T20:31:45.5641540Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8268750>
2025-05-07T20:31:45.5642322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5642957Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c97749a0>}
2025-05-07T20:31:45.5643705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5643895Z context = <triton._C.libtriton.ir.context object at 0x7f51c8224630>
2025-05-07T20:31:45.5643905Z 
2025-05-07T20:31:45.5644067Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5644333Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5644442Z                            module_map=module_map)
2025-05-07T20:31:45.5644606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5644704Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5644787Z E       ^
2025-05-07T20:31:45.5645142Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5645147Z 
2025-05-07T20:31:45.5645559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5645563Z 
2025-05-07T20:31:45.5645666Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5645965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5646047Z     T=1,
2025-05-07T20:31:45.5646130Z     D=7168,
2025-05-07T20:31:45.5646214Z     scale_ub=1200.0,
2025-05-07T20:31:45.5646298Z     contiguous=True,
2025-05-07T20:31:45.5646385Z     compiled=True,
2025-05-07T20:31:45.5646460Z )
2025-05-07T20:31:45.5646675Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5646845Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.5646855Z 
2025-05-07T20:31:45.5646935Z     @given(
2025-05-07T20:31:45.5647054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5647155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5647267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5647388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5647553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5647630Z     )
2025-05-07T20:31:45.5647881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5647975Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5648050Z         self,
2025-05-07T20:31:45.5648132Z         T: int,
2025-05-07T20:31:45.5648212Z         D: int,
2025-05-07T20:31:45.5648311Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5648402Z         contiguous: bool,
2025-05-07T20:31:45.5648489Z         compiled: bool,
2025-05-07T20:31:45.5648566Z     ) -> None:
2025-05-07T20:31:45.5648669Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5648740Z     
2025-05-07T20:31:45.5648912Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5648989Z     
2025-05-07T20:31:45.5649079Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5649204Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5649293Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5649373Z         x0 = x[:, :D]
2025-05-07T20:31:45.5649457Z         x1 = x[:, D:]
2025-05-07T20:31:45.5649532Z     
2025-05-07T20:31:45.5649617Z         if contiguous:
2025-05-07T20:31:45.5649711Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5649800Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5649874Z     
2025-05-07T20:31:45.5649968Z         if scale_ub is not None:
2025-05-07T20:31:45.5650074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5650213Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5650379Z             )
2025-05-07T20:31:45.5650457Z         else:
2025-05-07T20:31:45.5650554Z             scale_ub_tensor = None
2025-05-07T20:31:45.5650630Z     
2025-05-07T20:31:45.5650757Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5650849Z             op = silu_mul_quant
2025-05-07T20:31:45.5650932Z             if compiled:
2025-05-07T20:31:45.5651030Z                 op = torch.compile(op)
2025-05-07T20:31:45.5651138Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5651216Z     
2025-05-07T20:31:45.5651307Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5651311Z 
2025-05-07T20:31:45.5651410Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5651543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5651648Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5651746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5652114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5652218Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5652708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5652807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5653167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5653487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5653831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5653925Z     kernel = self.compile(
2025-05-07T20:31:45.5654305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5654483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5654613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5654618Z 
2025-05-07T20:31:45.5654827Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c82b7650>
2025-05-07T20:31:45.5655599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5656104Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9832840>}
2025-05-07T20:31:45.5656852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5657041Z context = <triton._C.libtriton.ir.context object at 0x7f51c8253530>
2025-05-07T20:31:45.5657051Z 
2025-05-07T20:31:45.5657221Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5657484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5657590Z                            module_map=module_map)
2025-05-07T20:31:45.5657756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5657854Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5657940Z E       ^
2025-05-07T20:31:45.5658291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5658295Z 
2025-05-07T20:31:45.5658707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5658712Z 
2025-05-07T20:31:45.5658819Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5659123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5659201Z     T=1,
2025-05-07T20:31:45.5659279Z     D=7168,
2025-05-07T20:31:45.5659362Z     scale_ub=1200.0,
2025-05-07T20:31:45.5659450Z     contiguous=False,
2025-05-07T20:31:45.5659538Z     compiled=True,
2025-05-07T20:31:45.5659613Z )
2025-05-07T20:31:45.5659833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5660005Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.5660009Z 
2025-05-07T20:31:45.5660087Z     @given(
2025-05-07T20:31:45.5660211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5660310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5660424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5660547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5660659Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5660743Z     )
2025-05-07T20:31:45.5660985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5661077Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5661162Z         self,
2025-05-07T20:31:45.5661240Z         T: int,
2025-05-07T20:31:45.5661314Z         D: int,
2025-05-07T20:31:45.5661416Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5661505Z         contiguous: bool,
2025-05-07T20:31:45.5661588Z         compiled: bool,
2025-05-07T20:31:45.5661824Z     ) -> None:
2025-05-07T20:31:45.5661919Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5661996Z     
2025-05-07T20:31:45.5662169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5662243Z     
2025-05-07T20:31:45.5662338Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5662463Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5662550Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5662632Z         x0 = x[:, :D]
2025-05-07T20:31:45.5662719Z         x1 = x[:, D:]
2025-05-07T20:31:45.5662792Z     
2025-05-07T20:31:45.5662878Z         if contiguous:
2025-05-07T20:31:45.5662972Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5663058Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5663132Z     
2025-05-07T20:31:45.5663224Z         if scale_ub is not None:
2025-05-07T20:31:45.5663328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5663466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5663551Z             )
2025-05-07T20:31:45.5663629Z         else:
2025-05-07T20:31:45.5663728Z             scale_ub_tensor = None
2025-05-07T20:31:45.5663803Z     
2025-05-07T20:31:45.5663935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5664027Z             op = silu_mul_quant
2025-05-07T20:31:45.5664111Z             if compiled:
2025-05-07T20:31:45.5664214Z                 op = torch.compile(op)
2025-05-07T20:31:45.5664321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5664400Z     
2025-05-07T20:31:45.5664495Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5664500Z 
2025-05-07T20:31:45.5664595Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5664723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5664828Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5664931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5665307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5665401Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5665893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5665995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5666350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5666652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5666996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5667088Z     kernel = self.compile(
2025-05-07T20:31:45.5667469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5667647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5667773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5667777Z 
2025-05-07T20:31:45.5667987Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8434dd0>
2025-05-07T20:31:45.5668761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5669269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8d37420>}
2025-05-07T20:31:45.5670012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5670280Z context = <triton._C.libtriton.ir.context object at 0x7f51c8468cb0>
2025-05-07T20:31:45.5670285Z 
2025-05-07T20:31:45.5670451Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5670712Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5670821Z                            module_map=module_map)
2025-05-07T20:31:45.5670984Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5671086Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5671167Z E       ^
2025-05-07T20:31:45.5671518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5671522Z 
2025-05-07T20:31:45.5671937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5671942Z 
2025-05-07T20:31:45.5672045Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5672271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5672350Z     T=1,
2025-05-07T20:31:45.5672426Z     D=7168,
2025-05-07T20:31:45.5672506Z     scale_ub=None,
2025-05-07T20:31:45.5672592Z     contiguous=False,
2025-05-07T20:31:45.5672675Z     compiled=True,
2025-05-07T20:31:45.5677737Z )
2025-05-07T20:31:45.5677983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5678170Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5678175Z 
2025-05-07T20:31:45.5678259Z     @given(
2025-05-07T20:31:45.5678387Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5678490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5678608Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5678733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5678847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5678933Z     )
2025-05-07T20:31:45.5679192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5679288Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5679371Z         self,
2025-05-07T20:31:45.5679451Z         T: int,
2025-05-07T20:31:45.5679532Z         D: int,
2025-05-07T20:31:45.5679635Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5679731Z         contiguous: bool,
2025-05-07T20:31:45.5679928Z         compiled: bool,
2025-05-07T20:31:45.5680016Z     ) -> None:
2025-05-07T20:31:45.5680113Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5680192Z     
2025-05-07T20:31:45.5680371Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5680447Z     
2025-05-07T20:31:45.5680543Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5680674Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5680766Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5680853Z         x0 = x[:, :D]
2025-05-07T20:31:45.5680938Z         x1 = x[:, D:]
2025-05-07T20:31:45.5681012Z     
2025-05-07T20:31:45.5681103Z         if contiguous:
2025-05-07T20:31:45.5681198Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5681292Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5681372Z     
2025-05-07T20:31:45.5681472Z         if scale_ub is not None:
2025-05-07T20:31:45.5681600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5681764Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5681861Z             )
2025-05-07T20:31:45.5681940Z         else:
2025-05-07T20:31:45.5682040Z             scale_ub_tensor = None
2025-05-07T20:31:45.5682116Z     
2025-05-07T20:31:45.5682250Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5682347Z             op = silu_mul_quant
2025-05-07T20:31:45.5682434Z             if compiled:
2025-05-07T20:31:45.5682539Z                 op = torch.compile(op)
2025-05-07T20:31:45.5682726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5682808Z     
2025-05-07T20:31:45.5682906Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.5683029Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.5683105Z     
2025-05-07T20:31:45.5683248Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5683352Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.5683457Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.5683590Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.5683732Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5683809Z     
2025-05-07T20:31:45.5683916Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.5683921Z 
2025-05-07T20:31:45.5684023Z moe/activation_test.py:126: 
2025-05-07T20:31:45.5684161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5684275Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.5684410Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.5684980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.5685086Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.5685453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5685684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5686054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.5686314Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5686715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.5686973Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.5687351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.5687606Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.5687952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.5688119Z     fn()
2025-05-07T20:31:45.5688521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.5688614Z     self.fn.run(
2025-05-07T20:31:45.5688954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5689052Z     kernel = self.compile(
2025-05-07T20:31:45.5689438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5689616Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5689755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5689760Z 
2025-05-07T20:31:45.5689968Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c84f6c10>
2025-05-07T20:31:45.5690745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5691259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8d36c00>}
2025-05-07T20:31:45.5692110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5692309Z context = <triton._C.libtriton.ir.context object at 0x7f51c8486af0>
2025-05-07T20:31:45.5692313Z 
2025-05-07T20:31:45.5692480Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5692753Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5692862Z                            module_map=module_map)
2025-05-07T20:31:45.5693030Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5693138Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.5693218Z E       ^
2025-05-07T20:31:45.5693575Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5693580Z 
2025-05-07T20:31:45.5694001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5694012Z 
2025-05-07T20:31:45.5694119Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5694349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5694431Z     T=1,
2025-05-07T20:31:45.5694510Z     D=5120,
2025-05-07T20:31:45.5694600Z     scale_ub=1200.0,
2025-05-07T20:31:45.5694692Z     contiguous=False,
2025-05-07T20:31:45.5694780Z     compiled=True,
2025-05-07T20:31:45.5694863Z )
2025-05-07T20:31:45.5695088Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5695256Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.5695265Z 
2025-05-07T20:31:45.5695345Z     @given(
2025-05-07T20:31:45.5695469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5695573Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5695690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5695812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5695931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5696011Z     )
2025-05-07T20:31:45.5696259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5696362Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5696440Z         self,
2025-05-07T20:31:45.5696521Z         T: int,
2025-05-07T20:31:45.5696604Z         D: int,
2025-05-07T20:31:45.5696706Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5696887Z         contiguous: bool,
2025-05-07T20:31:45.5696976Z         compiled: bool,
2025-05-07T20:31:45.5697057Z     ) -> None:
2025-05-07T20:31:45.5697159Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5697234Z     
2025-05-07T20:31:45.5697408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5697486Z     
2025-05-07T20:31:45.5697581Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5697714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5697807Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5697891Z         x0 = x[:, :D]
2025-05-07T20:31:45.5697976Z         x1 = x[:, D:]
2025-05-07T20:31:45.5698054Z     
2025-05-07T20:31:45.5698140Z         if contiguous:
2025-05-07T20:31:45.5698236Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5698328Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5698404Z     
2025-05-07T20:31:45.5698504Z         if scale_ub is not None:
2025-05-07T20:31:45.5698619Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5698756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5698843Z             )
2025-05-07T20:31:45.5698923Z         else:
2025-05-07T20:31:45.5699020Z             scale_ub_tensor = None
2025-05-07T20:31:45.5699100Z     
2025-05-07T20:31:45.5699232Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5699324Z             op = silu_mul_quant
2025-05-07T20:31:45.5699493Z             if compiled:
2025-05-07T20:31:45.5699598Z                 op = torch.compile(op)
2025-05-07T20:31:45.5699714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5699791Z     
2025-05-07T20:31:45.5699886Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5699891Z 
2025-05-07T20:31:45.5699993Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5700126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5700233Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5700347Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5700717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5700815Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5701367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5701468Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5701836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5702060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5702405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5702507Z     kernel = self.compile(
2025-05-07T20:31:45.5702891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5703075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5703206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5703210Z 
2025-05-07T20:31:45.5703420Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8b02ed0>
2025-05-07T20:31:45.5704205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5704711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8d351c0>}
2025-05-07T20:31:45.5705471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5705988Z context = <triton._C.libtriton.ir.context object at 0x7f51c8b7adb0>
2025-05-07T20:31:45.5705998Z 
2025-05-07T20:31:45.5706211Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5706485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5706603Z                            module_map=module_map)
2025-05-07T20:31:45.5706771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5706873Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5706956Z E       ^
2025-05-07T20:31:45.5707317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5707322Z 
2025-05-07T20:31:45.5707739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5707750Z 
2025-05-07T20:31:45.5707862Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5708087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5708167Z     T=1,
2025-05-07T20:31:45.5708250Z     D=5120,
2025-05-07T20:31:45.5708341Z     scale_ub=1200.0,
2025-05-07T20:31:45.5708433Z     contiguous=False,
2025-05-07T20:31:45.5708525Z     compiled=False,
2025-05-07T20:31:45.5708602Z )
2025-05-07T20:31:45.5708969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5709146Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.5709151Z 
2025-05-07T20:31:45.5709231Z     @given(
2025-05-07T20:31:45.5709357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5709459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5709578Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5709704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5709822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5709903Z     )
2025-05-07T20:31:45.5710152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5710250Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5710333Z         self,
2025-05-07T20:31:45.5710414Z         T: int,
2025-05-07T20:31:45.5710496Z         D: int,
2025-05-07T20:31:45.5710613Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5710710Z         contiguous: bool,
2025-05-07T20:31:45.5710798Z         compiled: bool,
2025-05-07T20:31:45.5710885Z     ) -> None:
2025-05-07T20:31:45.5710985Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5711063Z     
2025-05-07T20:31:45.5711240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5711317Z     
2025-05-07T20:31:45.5711411Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5711548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5711640Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5711726Z         x0 = x[:, :D]
2025-05-07T20:31:45.5711811Z         x1 = x[:, D:]
2025-05-07T20:31:45.5711886Z     
2025-05-07T20:31:45.5711979Z         if contiguous:
2025-05-07T20:31:45.5712075Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5712169Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5712248Z     
2025-05-07T20:31:45.5712347Z         if scale_ub is not None:
2025-05-07T20:31:45.5712455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5712598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5712677Z             )
2025-05-07T20:31:45.5712757Z         else:
2025-05-07T20:31:45.5712857Z             scale_ub_tensor = None
2025-05-07T20:31:45.5712933Z     
2025-05-07T20:31:45.5713068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5713163Z             op = silu_mul_quant
2025-05-07T20:31:45.5713376Z             if compiled:
2025-05-07T20:31:45.5713485Z                 op = torch.compile(op)
2025-05-07T20:31:45.5713595Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5713672Z     
2025-05-07T20:31:45.5713770Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5713775Z 
2025-05-07T20:31:45.5713876Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5714010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5714122Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5714225Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5714735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5714836Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5715196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5715429Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5715773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5715873Z     kernel = self.compile(
2025-05-07T20:31:45.5716259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5716437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5716651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5716656Z 
2025-05-07T20:31:45.5716865Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8be4490>
2025-05-07T20:31:45.5717643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5718156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c91994e0>}
2025-05-07T20:31:45.5718903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5719103Z context = <triton._C.libtriton.ir.context object at 0x7f51c8b68370>
2025-05-07T20:31:45.5719108Z 
2025-05-07T20:31:45.5719277Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5719548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5719660Z                            module_map=module_map)
2025-05-07T20:31:45.5719826Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5719938Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5720021Z E       ^
2025-05-07T20:31:45.5720380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5720385Z 
2025-05-07T20:31:45.5720807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5720811Z 
2025-05-07T20:31:45.5720920Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5721155Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5721239Z     T=16384,
2025-05-07T20:31:45.5721333Z     D=5120,
2025-05-07T20:31:45.5721440Z     scale_ub=1200.0,
2025-05-07T20:31:45.5721548Z     contiguous=False,
2025-05-07T20:31:45.5721646Z     compiled=True,
2025-05-07T20:31:45.5721728Z )
2025-05-07T20:31:45.5721951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5722134Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.5722249Z 
2025-05-07T20:31:45.5722332Z     @given(
2025-05-07T20:31:45.5722457Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5722564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5722685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5722807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5722930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5723015Z     )
2025-05-07T20:31:45.5723265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5723367Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5723448Z         self,
2025-05-07T20:31:45.5723534Z         T: int,
2025-05-07T20:31:45.5723619Z         D: int,
2025-05-07T20:31:45.5723721Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5723817Z         contiguous: bool,
2025-05-07T20:31:45.5723907Z         compiled: bool,
2025-05-07T20:31:45.5723996Z     ) -> None:
2025-05-07T20:31:45.5724100Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5724180Z     
2025-05-07T20:31:45.5724353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5724434Z     
2025-05-07T20:31:45.5724529Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5724658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5724754Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5724838Z         x0 = x[:, :D]
2025-05-07T20:31:45.5725001Z         x1 = x[:, D:]
2025-05-07T20:31:45.5725085Z     
2025-05-07T20:31:45.5725173Z         if contiguous:
2025-05-07T20:31:45.5725268Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5725363Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5725441Z     
2025-05-07T20:31:45.5725540Z         if scale_ub is not None:
2025-05-07T20:31:45.5725648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5725786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5725877Z             )
2025-05-07T20:31:45.5725960Z         else:
2025-05-07T20:31:45.5726058Z             scale_ub_tensor = None
2025-05-07T20:31:45.5726140Z     
2025-05-07T20:31:45.5726273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5726367Z             op = silu_mul_quant
2025-05-07T20:31:45.5726462Z             if compiled:
2025-05-07T20:31:45.5726567Z                 op = torch.compile(op)
2025-05-07T20:31:45.5726684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5726767Z     
2025-05-07T20:31:45.5726862Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5726867Z 
2025-05-07T20:31:45.5726969Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5727103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5727208Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5727314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5727747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5727850Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5728346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5728449Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5728811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5729042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5729384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5729487Z     kernel = self.compile(
2025-05-07T20:31:45.5729870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5730051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5730266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5730271Z 
2025-05-07T20:31:45.5730480Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c889aed0>
2025-05-07T20:31:45.5731268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5731773Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c9199580>}
2025-05-07T20:31:45.5732525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5732724Z context = <triton._C.libtriton.ir.context object at 0x7f51c88cadb0>
2025-05-07T20:31:45.5732728Z 
2025-05-07T20:31:45.5732898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5733169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5733282Z                            module_map=module_map)
2025-05-07T20:31:45.5733450Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5733629Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5733714Z E       ^
2025-05-07T20:31:45.5734076Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5734081Z 
2025-05-07T20:31:45.5734500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5734504Z 
2025-05-07T20:31:45.5734617Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5734851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5734932Z     T=2048,
2025-05-07T20:31:45.5735018Z     D=7168,
2025-05-07T20:31:45.5735105Z     scale_ub=1200.0,
2025-05-07T20:31:45.5735198Z     contiguous=False,
2025-05-07T20:31:45.5735290Z     compiled=True,
2025-05-07T20:31:45.5735370Z )
2025-05-07T20:31:45.5735591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5735780Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.5735785Z 
2025-05-07T20:31:45.5735866Z     @given(
2025-05-07T20:31:45.5735993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5736098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5736217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5736342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5736466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5736546Z     )
2025-05-07T20:31:45.5736804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5736901Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5736989Z         self,
2025-05-07T20:31:45.5737071Z         T: int,
2025-05-07T20:31:45.5737153Z         D: int,
2025-05-07T20:31:45.5737258Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5737350Z         contiguous: bool,
2025-05-07T20:31:45.5737447Z         compiled: bool,
2025-05-07T20:31:45.5737534Z     ) -> None:
2025-05-07T20:31:45.5737632Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5737713Z     
2025-05-07T20:31:45.5737891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5737968Z     
2025-05-07T20:31:45.5738067Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5738196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5738289Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5738460Z         x0 = x[:, :D]
2025-05-07T20:31:45.5738546Z         x1 = x[:, D:]
2025-05-07T20:31:45.5738624Z     
2025-05-07T20:31:45.5738716Z         if contiguous:
2025-05-07T20:31:45.5738814Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5738906Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5738986Z     
2025-05-07T20:31:45.5739080Z         if scale_ub is not None:
2025-05-07T20:31:45.5739189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5739333Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5739415Z             )
2025-05-07T20:31:45.5739496Z         else:
2025-05-07T20:31:45.5739598Z             scale_ub_tensor = None
2025-05-07T20:31:45.5739674Z     
2025-05-07T20:31:45.5739810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5739902Z             op = silu_mul_quant
2025-05-07T20:31:45.5739988Z             if compiled:
2025-05-07T20:31:45.5740094Z                 op = torch.compile(op)
2025-05-07T20:31:45.5740207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5740281Z     
2025-05-07T20:31:45.5740377Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5740381Z 
2025-05-07T20:31:45.5740478Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5740609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5740714Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5740815Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5741264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5741361Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5741854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5741956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5742314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5742540Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5742884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5742980Z     kernel = self.compile(
2025-05-07T20:31:45.5743362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5743993Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5744123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5744130Z 
2025-05-07T20:31:45.5744335Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c88ebc50>
2025-05-07T20:31:45.5745107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5745618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c919b060>}
2025-05-07T20:31:45.5746368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5746565Z context = <triton._C.libtriton.ir.context object at 0x7f51c88ddf30>
2025-05-07T20:31:45.5746570Z 
2025-05-07T20:31:45.5746736Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5747002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5747115Z                            module_map=module_map)
2025-05-07T20:31:45.5747280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5747467Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5747550Z E       ^
2025-05-07T20:31:45.5747906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5747911Z 
2025-05-07T20:31:45.5748326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5748330Z 
2025-05-07T20:31:45.5748441Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5748664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5748746Z     T=1,
2025-05-07T20:31:45.5748825Z     D=5120,
2025-05-07T20:31:45.5748915Z     scale_ub=None,
2025-05-07T20:31:45.5749004Z     contiguous=False,
2025-05-07T20:31:45.5749091Z     compiled=False,
2025-05-07T20:31:45.5749173Z )
2025-05-07T20:31:45.5749390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5749564Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.5749569Z 
2025-05-07T20:31:45.5749653Z     @given(
2025-05-07T20:31:45.5749777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5749876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5749997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5750114Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5750311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5750391Z     )
2025-05-07T20:31:45.5750637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5750736Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5750816Z         self,
2025-05-07T20:31:45.5750895Z         T: int,
2025-05-07T20:31:45.5750976Z         D: int,
2025-05-07T20:31:45.5751076Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5751168Z         contiguous: bool,
2025-05-07T20:31:45.5751263Z         compiled: bool,
2025-05-07T20:31:45.5751346Z     ) -> None:
2025-05-07T20:31:45.5751442Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5751524Z     
2025-05-07T20:31:45.5751694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5751772Z     
2025-05-07T20:31:45.5751870Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5751995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5752092Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5752173Z         x0 = x[:, :D]
2025-05-07T20:31:45.5752254Z         x1 = x[:, D:]
2025-05-07T20:31:45.5752332Z     
2025-05-07T20:31:45.5752417Z         if contiguous:
2025-05-07T20:31:45.5752508Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5752602Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5752673Z     
2025-05-07T20:31:45.5752764Z         if scale_ub is not None:
2025-05-07T20:31:45.5752873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5753014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5753093Z             )
2025-05-07T20:31:45.5753175Z         else:
2025-05-07T20:31:45.5753269Z             scale_ub_tensor = None
2025-05-07T20:31:45.5753347Z     
2025-05-07T20:31:45.5753476Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5753568Z             op = silu_mul_quant
2025-05-07T20:31:45.5753658Z             if compiled:
2025-05-07T20:31:45.5753763Z                 op = torch.compile(op)
2025-05-07T20:31:45.5753871Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5753951Z     
2025-05-07T20:31:45.5754042Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5754046Z 
2025-05-07T20:31:45.5754143Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5754278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5754382Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5754483Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5755089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5755190Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5755551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5755774Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5756119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5756220Z     kernel = self.compile(
2025-05-07T20:31:45.5756602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5756779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5756908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5756918Z 
2025-05-07T20:31:45.5757124Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c887d210>
2025-05-07T20:31:45.5757900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5758478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8e485e0>}
2025-05-07T20:31:45.5759230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5759423Z context = <triton._C.libtriton.ir.context object at 0x7f51c889d0f0>
2025-05-07T20:31:45.5759428Z 
2025-05-07T20:31:45.5759605Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5759870Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5759978Z                            module_map=module_map)
2025-05-07T20:31:45.5760146Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5760248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5760330Z E       ^
2025-05-07T20:31:45.5760691Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5760696Z 
2025-05-07T20:31:45.5761108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5761112Z 
2025-05-07T20:31:45.5761224Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5761447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5761533Z     T=4096,
2025-05-07T20:31:45.5761617Z     D=7168,
2025-05-07T20:31:45.5761704Z     scale_ub=1200.0,
2025-05-07T20:31:45.5761790Z     contiguous=False,
2025-05-07T20:31:45.5761880Z     compiled=False,
2025-05-07T20:31:45.5761957Z )
2025-05-07T20:31:45.5762176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5762356Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.5762361Z 
2025-05-07T20:31:45.5762443Z     @given(
2025-05-07T20:31:45.5762566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5762667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5762784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5762908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5763023Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5763099Z     )
2025-05-07T20:31:45.5763354Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5763535Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5763612Z         self,
2025-05-07T20:31:45.5763697Z         T: int,
2025-05-07T20:31:45.5763777Z         D: int,
2025-05-07T20:31:45.5763879Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5763971Z         contiguous: bool,
2025-05-07T20:31:45.5764060Z         compiled: bool,
2025-05-07T20:31:45.5764142Z     ) -> None:
2025-05-07T20:31:45.5764239Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5764321Z     
2025-05-07T20:31:45.5764496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5764575Z     
2025-05-07T20:31:45.5764669Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5764797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5764888Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5764971Z         x0 = x[:, :D]
2025-05-07T20:31:45.5765058Z         x1 = x[:, D:]
2025-05-07T20:31:45.5765138Z     
2025-05-07T20:31:45.5765230Z         if contiguous:
2025-05-07T20:31:45.5765323Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5765415Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5765492Z     
2025-05-07T20:31:45.5765585Z         if scale_ub is not None:
2025-05-07T20:31:45.5765690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5765829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5765907Z             )
2025-05-07T20:31:45.5765982Z         else:
2025-05-07T20:31:45.5766157Z             scale_ub_tensor = None
2025-05-07T20:31:45.5766231Z     
2025-05-07T20:31:45.5766362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5766457Z             op = silu_mul_quant
2025-05-07T20:31:45.5766544Z             if compiled:
2025-05-07T20:31:45.5766643Z                 op = torch.compile(op)
2025-05-07T20:31:45.5766754Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5766829Z     
2025-05-07T20:31:45.5766926Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5766930Z 
2025-05-07T20:31:45.5767029Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5767162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5767267Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5767365Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5767921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5768030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5768389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5768615Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5768956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5769058Z     kernel = self.compile(
2025-05-07T20:31:45.5769444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5769620Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5769748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5769756Z 
2025-05-07T20:31:45.5769962Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c871bb50>
2025-05-07T20:31:45.5770746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5771302Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8e499e0>}
2025-05-07T20:31:45.5772138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5772331Z context = <triton._C.libtriton.ir.context object at 0x7f51c878f9f0>
2025-05-07T20:31:45.5772336Z 
2025-05-07T20:31:45.5772503Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5772774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5772885Z                            module_map=module_map)
2025-05-07T20:31:45.5773049Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5773154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5773233Z E       ^
2025-05-07T20:31:45.5773587Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5773592Z 
2025-05-07T20:31:45.5774017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5774022Z 
2025-05-07T20:31:45.5774130Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5774356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5774436Z     T=16384,
2025-05-07T20:31:45.5774513Z     D=7168,
2025-05-07T20:31:45.5774605Z     scale_ub=None,
2025-05-07T20:31:45.5774694Z     contiguous=True,
2025-05-07T20:31:45.5774852Z     compiled=True,
2025-05-07T20:31:45.5774932Z )
2025-05-07T20:31:45.5775149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5775324Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.5775329Z 
2025-05-07T20:31:45.5775414Z     @given(
2025-05-07T20:31:45.5775537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5775639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5775763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5775882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5775999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5776079Z     )
2025-05-07T20:31:45.5776324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5776423Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5776505Z         self,
2025-05-07T20:31:45.5776589Z         T: int,
2025-05-07T20:31:45.5776671Z         D: int,
2025-05-07T20:31:45.5776771Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5776860Z         contiguous: bool,
2025-05-07T20:31:45.5776950Z         compiled: bool,
2025-05-07T20:31:45.5777032Z     ) -> None:
2025-05-07T20:31:45.5777129Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5777210Z     
2025-05-07T20:31:45.5777381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5777464Z     
2025-05-07T20:31:45.5777557Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5777685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5777777Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5777861Z         x0 = x[:, :D]
2025-05-07T20:31:45.5777944Z         x1 = x[:, D:]
2025-05-07T20:31:45.5778023Z     
2025-05-07T20:31:45.5778110Z         if contiguous:
2025-05-07T20:31:45.5778203Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5778303Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5778381Z     
2025-05-07T20:31:45.5778474Z         if scale_ub is not None:
2025-05-07T20:31:45.5778583Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5778719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5778798Z             )
2025-05-07T20:31:45.5778877Z         else:
2025-05-07T20:31:45.5778974Z             scale_ub_tensor = None
2025-05-07T20:31:45.5779056Z     
2025-05-07T20:31:45.5779185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5779360Z             op = silu_mul_quant
2025-05-07T20:31:45.5779447Z             if compiled:
2025-05-07T20:31:45.5779547Z                 op = torch.compile(op)
2025-05-07T20:31:45.5779655Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5779735Z     
2025-05-07T20:31:45.5779825Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5779830Z 
2025-05-07T20:31:45.5779927Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5780066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5780171Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5780274Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5780643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5780739Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5781235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5781340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5781699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5781924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5782264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5782462Z     kernel = self.compile(
2025-05-07T20:31:45.5782844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5783020Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5783153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5783157Z 
2025-05-07T20:31:45.5783363Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c87ca750>
2025-05-07T20:31:45.5784148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5784651Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8e4ab60>}
2025-05-07T20:31:45.5785403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5785599Z context = <triton._C.libtriton.ir.context object at 0x7f51c87d2770>
2025-05-07T20:31:45.5785604Z 
2025-05-07T20:31:45.5785773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5786039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5786150Z                            module_map=module_map)
2025-05-07T20:31:45.5786313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5786420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5786500Z E       ^
2025-05-07T20:31:45.5786856Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5786860Z 
2025-05-07T20:31:45.5787281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5787286Z 
2025-05-07T20:31:45.5787389Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5787617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5787697Z     T=4096,
2025-05-07T20:31:45.5787773Z     D=5120,
2025-05-07T20:31:45.5787862Z     scale_ub=None,
2025-05-07T20:31:45.5788032Z     contiguous=False,
2025-05-07T20:31:45.5788119Z     compiled=True,
2025-05-07T20:31:45.5788194Z )
2025-05-07T20:31:45.5788412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5788588Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5788593Z 
2025-05-07T20:31:45.5788669Z     @given(
2025-05-07T20:31:45.5788789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5788896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5789011Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5789127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5789245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5789322Z     )
2025-05-07T20:31:45.5789569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5789667Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5789754Z         self,
2025-05-07T20:31:45.5789840Z         T: int,
2025-05-07T20:31:45.5789920Z         D: int,
2025-05-07T20:31:45.5790018Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5790110Z         contiguous: bool,
2025-05-07T20:31:45.5790199Z         compiled: bool,
2025-05-07T20:31:45.5790280Z     ) -> None:
2025-05-07T20:31:45.5790379Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5790456Z     
2025-05-07T20:31:45.5790625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5790786Z     
2025-05-07T20:31:45.5790881Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5791009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5791101Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5791182Z         x0 = x[:, :D]
2025-05-07T20:31:45.5791267Z         x1 = x[:, D:]
2025-05-07T20:31:45.5791352Z     
2025-05-07T20:31:45.5791450Z         if contiguous:
2025-05-07T20:31:45.5791564Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5791670Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5791741Z     
2025-05-07T20:31:45.5791836Z         if scale_ub is not None:
2025-05-07T20:31:45.5791944Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5792081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5792165Z             )
2025-05-07T20:31:45.5792241Z         else:
2025-05-07T20:31:45.5792339Z             scale_ub_tensor = None
2025-05-07T20:31:45.5792416Z     
2025-05-07T20:31:45.5792551Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5792647Z             op = silu_mul_quant
2025-05-07T20:31:45.5792733Z             if compiled:
2025-05-07T20:31:45.5792835Z                 op = torch.compile(op)
2025-05-07T20:31:45.5792947Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5793018Z     
2025-05-07T20:31:45.5793109Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5793114Z 
2025-05-07T20:31:45.5793214Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5793349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5793452Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5793552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5793922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5794019Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5794520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5794620Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5794982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5795203Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5795548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5795807Z     kernel = self.compile(
2025-05-07T20:31:45.5796187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5796370Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5796497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5796502Z 
2025-05-07T20:31:45.5796713Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbe577d0>
2025-05-07T20:31:45.5797491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5797993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c8e4bd80>}
2025-05-07T20:31:45.5798746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5798942Z context = <triton._C.libtriton.ir.context object at 0x7f51bbe3f6b0>
2025-05-07T20:31:45.5798946Z 
2025-05-07T20:31:45.5799113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5799452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5799562Z                            module_map=module_map)
2025-05-07T20:31:45.5799727Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5799826Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5804959Z E       ^
2025-05-07T20:31:45.5805345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5805366Z 
2025-05-07T20:31:45.5806057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5806066Z 
2025-05-07T20:31:45.5806182Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5806414Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5806496Z     T=4096,
2025-05-07T20:31:45.5806576Z     D=5120,
2025-05-07T20:31:45.5806677Z     scale_ub=1200.0,
2025-05-07T20:31:45.5806768Z     contiguous=False,
2025-05-07T20:31:45.5806856Z     compiled=False,
2025-05-07T20:31:45.5806940Z )
2025-05-07T20:31:45.5807163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5807345Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.5807350Z 
2025-05-07T20:31:45.5807430Z     @given(
2025-05-07T20:31:45.5807612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5807725Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5807843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5807966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5808086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5808164Z     )
2025-05-07T20:31:45.5808413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5808512Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5808594Z         self,
2025-05-07T20:31:45.5808675Z         T: int,
2025-05-07T20:31:45.5808754Z         D: int,
2025-05-07T20:31:45.5808855Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5808951Z         contiguous: bool,
2025-05-07T20:31:45.5809040Z         compiled: bool,
2025-05-07T20:31:45.5809124Z     ) -> None:
2025-05-07T20:31:45.5809225Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5809305Z     
2025-05-07T20:31:45.5809480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5809735Z     
2025-05-07T20:31:45.5809832Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5809960Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5810055Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5810139Z         x0 = x[:, :D]
2025-05-07T20:31:45.5810226Z         x1 = x[:, D:]
2025-05-07T20:31:45.5810304Z     
2025-05-07T20:31:45.5810389Z         if contiguous:
2025-05-07T20:31:45.5810491Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5810583Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5810659Z     
2025-05-07T20:31:45.5810760Z         if scale_ub is not None:
2025-05-07T20:31:45.5810867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5811007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5811089Z             )
2025-05-07T20:31:45.5811167Z         else:
2025-05-07T20:31:45.5811264Z             scale_ub_tensor = None
2025-05-07T20:31:45.5811347Z     
2025-05-07T20:31:45.5811481Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5811574Z             op = silu_mul_quant
2025-05-07T20:31:45.5811665Z             if compiled:
2025-05-07T20:31:45.5811769Z                 op = torch.compile(op)
2025-05-07T20:31:45.5811882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5811956Z     
2025-05-07T20:31:45.5812048Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5812052Z 
2025-05-07T20:31:45.5812265Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5812403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5812506Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5812612Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5813120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5813223Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5813594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5813819Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5814165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5814263Z     kernel = self.compile(
2025-05-07T20:31:45.5814655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5814837Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5814968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5814972Z 
2025-05-07T20:31:45.5815183Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbe302d0>
2025-05-07T20:31:45.5815964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5816474Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbe28c20>}
2025-05-07T20:31:45.5817234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5817431Z context = <triton._C.libtriton.ir.context object at 0x7f51bbed8170>
2025-05-07T20:31:45.5817436Z 
2025-05-07T20:31:45.5817613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5817879Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5817991Z                            module_map=module_map)
2025-05-07T20:31:45.5818240Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5818343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5818428Z E       ^
2025-05-07T20:31:45.5818785Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5818790Z 
2025-05-07T20:31:45.5819211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5819216Z 
2025-05-07T20:31:45.5819326Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5819551Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5819635Z     T=4096,
2025-05-07T20:31:45.5819716Z     D=5120,
2025-05-07T20:31:45.5819804Z     scale_ub=1200.0,
2025-05-07T20:31:45.5819897Z     contiguous=False,
2025-05-07T20:31:45.5819983Z     compiled=True,
2025-05-07T20:31:45.5820066Z )
2025-05-07T20:31:45.5820290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5820465Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.5820470Z 
2025-05-07T20:31:45.5820549Z     @given(
2025-05-07T20:31:45.5820677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5820781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5820902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5821124Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5821242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5821324Z     )
2025-05-07T20:31:45.5821573Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5821667Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5821748Z         self,
2025-05-07T20:31:45.5821827Z         T: int,
2025-05-07T20:31:45.5821907Z         D: int,
2025-05-07T20:31:45.5822020Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5822112Z         contiguous: bool,
2025-05-07T20:31:45.5822199Z         compiled: bool,
2025-05-07T20:31:45.5822282Z     ) -> None:
2025-05-07T20:31:45.5822378Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5822459Z     
2025-05-07T20:31:45.5822633Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5822708Z     
2025-05-07T20:31:45.5822805Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5822940Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5823033Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5823120Z         x0 = x[:, :D]
2025-05-07T20:31:45.5823204Z         x1 = x[:, D:]
2025-05-07T20:31:45.5823278Z     
2025-05-07T20:31:45.5823367Z         if contiguous:
2025-05-07T20:31:45.5823459Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5823550Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5823630Z     
2025-05-07T20:31:45.5823725Z         if scale_ub is not None:
2025-05-07T20:31:45.5823841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5823977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5824054Z             )
2025-05-07T20:31:45.5824135Z         else:
2025-05-07T20:31:45.5824231Z             scale_ub_tensor = None
2025-05-07T20:31:45.5824309Z     
2025-05-07T20:31:45.5824449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5824541Z             op = silu_mul_quant
2025-05-07T20:31:45.5824636Z             if compiled:
2025-05-07T20:31:45.5824745Z                 op = torch.compile(op)
2025-05-07T20:31:45.5824853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5824929Z     
2025-05-07T20:31:45.5825029Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5825033Z 
2025-05-07T20:31:45.5825131Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5825267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5825455Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5825558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5825932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5826028Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5826523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5826632Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5826993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5827225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5827568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5827667Z     kernel = self.compile(
2025-05-07T20:31:45.5828065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5828243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5828374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5828382Z 
2025-05-07T20:31:45.5828591Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbdc99d0>
2025-05-07T20:31:45.5829460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5829975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbe29f80>}
2025-05-07T20:31:45.5830726Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5830929Z context = <triton._C.libtriton.ir.context object at 0x7f51bbd96a30>
2025-05-07T20:31:45.5830934Z 
2025-05-07T20:31:45.5831102Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5831369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5831487Z                            module_map=module_map)
2025-05-07T20:31:45.5831655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5831764Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5831845Z E       ^
2025-05-07T20:31:45.5832202Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5832206Z 
2025-05-07T20:31:45.5832628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5832638Z 
2025-05-07T20:31:45.5832745Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5832970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5833057Z     T=2048,
2025-05-07T20:31:45.5833138Z     D=7168,
2025-05-07T20:31:45.5833231Z     scale_ub=1200.0,
2025-05-07T20:31:45.5833320Z     contiguous=False,
2025-05-07T20:31:45.5833406Z     compiled=False,
2025-05-07T20:31:45.5833493Z )
2025-05-07T20:31:45.5833714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5833891Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.5833895Z 
2025-05-07T20:31:45.5833978Z     @given(
2025-05-07T20:31:45.5834099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5834203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5834320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5834520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5834640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5834716Z     )
2025-05-07T20:31:45.5834962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5835061Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5835139Z         self,
2025-05-07T20:31:45.5835220Z         T: int,
2025-05-07T20:31:45.5835304Z         D: int,
2025-05-07T20:31:45.5835405Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5835498Z         contiguous: bool,
2025-05-07T20:31:45.5835585Z         compiled: bool,
2025-05-07T20:31:45.5835665Z     ) -> None:
2025-05-07T20:31:45.5835765Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5835841Z     
2025-05-07T20:31:45.5836015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5836093Z     
2025-05-07T20:31:45.5836185Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5836318Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5836413Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5836498Z         x0 = x[:, :D]
2025-05-07T20:31:45.5836584Z         x1 = x[:, D:]
2025-05-07T20:31:45.5836660Z     
2025-05-07T20:31:45.5836745Z         if contiguous:
2025-05-07T20:31:45.5836842Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5836934Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5837008Z     
2025-05-07T20:31:45.5837186Z         if scale_ub is not None:
2025-05-07T20:31:45.5837294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5837432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5837514Z             )
2025-05-07T20:31:45.5837594Z         else:
2025-05-07T20:31:45.5837689Z             scale_ub_tensor = None
2025-05-07T20:31:45.5837767Z     
2025-05-07T20:31:45.5837898Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5837995Z             op = silu_mul_quant
2025-05-07T20:31:45.5838086Z             if compiled:
2025-05-07T20:31:45.5838187Z                 op = torch.compile(op)
2025-05-07T20:31:45.5838297Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5838371Z     
2025-05-07T20:31:45.5838463Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5838467Z 
2025-05-07T20:31:45.5838569Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5838701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5838809Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5838913Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5839414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5839518Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5839878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5840106Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5840453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5840547Z     kernel = self.compile(
2025-05-07T20:31:45.5840935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5841118Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5841246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5841251Z 
2025-05-07T20:31:45.5841483Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbdb7850>
2025-05-07T20:31:45.5842287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5842873Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbe2ad40>}
2025-05-07T20:31:45.5843625Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5843824Z context = <triton._C.libtriton.ir.context object at 0x7f51bbd03730>
2025-05-07T20:31:45.5843829Z 
2025-05-07T20:31:45.5843997Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5844263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5844374Z                            module_map=module_map)
2025-05-07T20:31:45.5844540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5844645Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5844729Z E       ^
2025-05-07T20:31:45.5845085Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5845089Z 
2025-05-07T20:31:45.5845505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5845510Z 
2025-05-07T20:31:45.5845619Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5845918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5846005Z     T=1,
2025-05-07T20:31:45.5846085Z     D=7168,
2025-05-07T20:31:45.5846171Z     scale_ub=None,
2025-05-07T20:31:45.5846262Z     contiguous=True,
2025-05-07T20:31:45.5846349Z     compiled=False,
2025-05-07T20:31:45.5846427Z )
2025-05-07T20:31:45.5846650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5846824Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.5846828Z 
2025-05-07T20:31:45.5846908Z     @given(
2025-05-07T20:31:45.5847035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5847139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5847260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5847381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5847552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5847642Z     )
2025-05-07T20:31:45.5847891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5847989Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5848074Z         self,
2025-05-07T20:31:45.5848154Z         T: int,
2025-05-07T20:31:45.5848235Z         D: int,
2025-05-07T20:31:45.5848339Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5848433Z         contiguous: bool,
2025-05-07T20:31:45.5848528Z         compiled: bool,
2025-05-07T20:31:45.5848611Z     ) -> None:
2025-05-07T20:31:45.5848711Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5848793Z     
2025-05-07T20:31:45.5848964Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5849040Z     
2025-05-07T20:31:45.5849140Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5849268Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5849360Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5849453Z         x0 = x[:, :D]
2025-05-07T20:31:45.5849537Z         x1 = x[:, D:]
2025-05-07T20:31:45.5849613Z     
2025-05-07T20:31:45.5849703Z         if contiguous:
2025-05-07T20:31:45.5849797Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5849889Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5849968Z     
2025-05-07T20:31:45.5850060Z         if scale_ub is not None:
2025-05-07T20:31:45.5850169Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5850311Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5850502Z             )
2025-05-07T20:31:45.5850583Z         else:
2025-05-07T20:31:45.5850682Z             scale_ub_tensor = None
2025-05-07T20:31:45.5850757Z     
2025-05-07T20:31:45.5850893Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5850985Z             op = silu_mul_quant
2025-05-07T20:31:45.5851074Z             if compiled:
2025-05-07T20:31:45.5851182Z                 op = torch.compile(op)
2025-05-07T20:31:45.5851296Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5851374Z     
2025-05-07T20:31:45.5851468Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5851472Z 
2025-05-07T20:31:45.5851571Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5851707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5851810Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5851911Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5852419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5852519Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5852877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5853103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5853521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5853626Z     kernel = self.compile(
2025-05-07T20:31:45.5854012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5854188Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5854320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5854332Z 
2025-05-07T20:31:45.5854538Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbd46850>
2025-05-07T20:31:45.5855317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5855824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbe2afc0>}
2025-05-07T20:31:45.5856573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5856768Z context = <triton._C.libtriton.ir.context object at 0x7f51bbdce730>
2025-05-07T20:31:45.5856773Z 
2025-05-07T20:31:45.5856940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5857217Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5857326Z                            module_map=module_map)
2025-05-07T20:31:45.5857489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5857596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5857679Z E       ^
2025-05-07T20:31:45.5858041Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5858049Z 
2025-05-07T20:31:45.5858463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5858468Z 
2025-05-07T20:31:45.5858574Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5858801Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5858883Z     T=16384,
2025-05-07T20:31:45.5859045Z     D=7168,
2025-05-07T20:31:45.5859136Z     scale_ub=1200.0,
2025-05-07T20:31:45.5859226Z     contiguous=False,
2025-05-07T20:31:45.5859316Z     compiled=True,
2025-05-07T20:31:45.5859396Z )
2025-05-07T20:31:45.5859618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5859802Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.5859807Z 
2025-05-07T20:31:45.5859887Z     @given(
2025-05-07T20:31:45.5860014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5860120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5860239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5860358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5860477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5860555Z     )
2025-05-07T20:31:45.5860809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5860915Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5860994Z         self,
2025-05-07T20:31:45.5861079Z         T: int,
2025-05-07T20:31:45.5861159Z         D: int,
2025-05-07T20:31:45.5861260Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5861358Z         contiguous: bool,
2025-05-07T20:31:45.5861449Z         compiled: bool,
2025-05-07T20:31:45.5861537Z     ) -> None:
2025-05-07T20:31:45.5861635Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5861716Z     
2025-05-07T20:31:45.5861970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5862050Z     
2025-05-07T20:31:45.5862144Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5862277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5862369Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5862453Z         x0 = x[:, :D]
2025-05-07T20:31:45.5862539Z         x1 = x[:, D:]
2025-05-07T20:31:45.5862616Z     
2025-05-07T20:31:45.5862707Z         if contiguous:
2025-05-07T20:31:45.5862805Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5862897Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5862978Z     
2025-05-07T20:31:45.5863072Z         if scale_ub is not None:
2025-05-07T20:31:45.5863180Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5863322Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5863401Z             )
2025-05-07T20:31:45.5863481Z         else:
2025-05-07T20:31:45.5863589Z             scale_ub_tensor = None
2025-05-07T20:31:45.5863666Z     
2025-05-07T20:31:45.5863797Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5863896Z             op = silu_mul_quant
2025-05-07T20:31:45.5863985Z             if compiled:
2025-05-07T20:31:45.5864085Z                 op = torch.compile(op)
2025-05-07T20:31:45.5864196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5864274Z     
2025-05-07T20:31:45.5864373Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5864382Z 
2025-05-07T20:31:45.5864482Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5864615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5864721Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5864825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5865194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5865294Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5865794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5865896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5866255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5866479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5867573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5867670Z     kernel = self.compile(
2025-05-07T20:31:45.5868054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5868236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5868365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5868374Z 
2025-05-07T20:31:45.5868584Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbfdd010>
2025-05-07T20:31:45.5869364Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5869871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbdf5300>}
2025-05-07T20:31:45.5870627Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5870820Z context = <triton._C.libtriton.ir.context object at 0x7f51bbf40e70>
2025-05-07T20:31:45.5870825Z 
2025-05-07T20:31:45.5871067Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5871335Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5871447Z                            module_map=module_map)
2025-05-07T20:31:45.5871610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5871711Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5871792Z E       ^
2025-05-07T20:31:45.5872154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5872158Z 
2025-05-07T20:31:45.5872574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5872583Z 
2025-05-07T20:31:45.5872688Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5872911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5873000Z     T=1,
2025-05-07T20:31:45.5873080Z     D=7168,
2025-05-07T20:31:45.5873166Z     scale_ub=None,
2025-05-07T20:31:45.5873261Z     contiguous=False,
2025-05-07T20:31:45.5873348Z     compiled=False,
2025-05-07T20:31:45.5873426Z )
2025-05-07T20:31:45.5873647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5873816Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.5873820Z 
2025-05-07T20:31:45.5873905Z     @given(
2025-05-07T20:31:45.5874029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5874134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5874256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5874374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5874489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5874569Z     )
2025-05-07T20:31:45.5874820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5874917Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5875006Z         self,
2025-05-07T20:31:45.5875085Z         T: int,
2025-05-07T20:31:45.5875166Z         D: int,
2025-05-07T20:31:45.5875270Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5875362Z         contiguous: bool,
2025-05-07T20:31:45.5875450Z         compiled: bool,
2025-05-07T20:31:45.5875537Z     ) -> None:
2025-05-07T20:31:45.5875635Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5875800Z     
2025-05-07T20:31:45.5875973Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5876048Z     
2025-05-07T20:31:45.5876147Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5876273Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5876364Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5876450Z         x0 = x[:, :D]
2025-05-07T20:31:45.5876534Z         x1 = x[:, D:]
2025-05-07T20:31:45.5876610Z     
2025-05-07T20:31:45.5876709Z         if contiguous:
2025-05-07T20:31:45.5876804Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5876896Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5876976Z     
2025-05-07T20:31:45.5877069Z         if scale_ub is not None:
2025-05-07T20:31:45.5877182Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5877319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5877397Z             )
2025-05-07T20:31:45.5877486Z         else:
2025-05-07T20:31:45.5877592Z             scale_ub_tensor = None
2025-05-07T20:31:45.5877668Z     
2025-05-07T20:31:45.5877803Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5877896Z             op = silu_mul_quant
2025-05-07T20:31:45.5877982Z             if compiled:
2025-05-07T20:31:45.5878087Z                 op = torch.compile(op)
2025-05-07T20:31:45.5878194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5878271Z     
2025-05-07T20:31:45.5878364Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5878449Z 
2025-05-07T20:31:45.5878550Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5878685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5878786Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5878887Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5879392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5879496Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5879855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5880083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5880425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5880525Z     kernel = self.compile(
2025-05-07T20:31:45.5880913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5881090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5881223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5881228Z 
2025-05-07T20:31:45.5881433Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbf8a010>
2025-05-07T20:31:45.5882269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5882775Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbdf60c0>}
2025-05-07T20:31:45.5883531Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5883727Z context = <triton._C.libtriton.ir.context object at 0x7f51bbf81ef0>
2025-05-07T20:31:45.5883732Z 
2025-05-07T20:31:45.5883898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5884169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5884383Z                            module_map=module_map)
2025-05-07T20:31:45.5884547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5884652Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5884733Z E       ^
2025-05-07T20:31:45.5885091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5885096Z 
2025-05-07T20:31:45.5885517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5885521Z 
2025-05-07T20:31:45.5885627Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5885853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5885936Z     T=2048,
2025-05-07T20:31:45.5886017Z     D=7168,
2025-05-07T20:31:45.5886105Z     scale_ub=None,
2025-05-07T20:31:45.5886194Z     contiguous=False,
2025-05-07T20:31:45.5886289Z     compiled=True,
2025-05-07T20:31:45.5886367Z )
2025-05-07T20:31:45.5886586Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5886765Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5886770Z 
2025-05-07T20:31:45.5886851Z     @given(
2025-05-07T20:31:45.5886975Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5887081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5887274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5887395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5887582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5887661Z     )
2025-05-07T20:31:45.5887914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5888012Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5888092Z         self,
2025-05-07T20:31:45.5888181Z         T: int,
2025-05-07T20:31:45.5888260Z         D: int,
2025-05-07T20:31:45.5888361Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5888456Z         contiguous: bool,
2025-05-07T20:31:45.5888544Z         compiled: bool,
2025-05-07T20:31:45.5888625Z     ) -> None:
2025-05-07T20:31:45.5888727Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5888805Z     
2025-05-07T20:31:45.5888976Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5889055Z     
2025-05-07T20:31:45.5889154Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5889287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5889380Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5889463Z         x0 = x[:, :D]
2025-05-07T20:31:45.5889551Z         x1 = x[:, D:]
2025-05-07T20:31:45.5889627Z     
2025-05-07T20:31:45.5889714Z         if contiguous:
2025-05-07T20:31:45.5889811Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5889903Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5889983Z     
2025-05-07T20:31:45.5890082Z         if scale_ub is not None:
2025-05-07T20:31:45.5890190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5890328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5890413Z             )
2025-05-07T20:31:45.5890492Z         else:
2025-05-07T20:31:45.5890588Z             scale_ub_tensor = None
2025-05-07T20:31:45.5890670Z     
2025-05-07T20:31:45.5890800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5890903Z             op = silu_mul_quant
2025-05-07T20:31:45.5890991Z             if compiled:
2025-05-07T20:31:45.5891092Z                 op = torch.compile(op)
2025-05-07T20:31:45.5891204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5891280Z     
2025-05-07T20:31:45.5891372Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5891376Z 
2025-05-07T20:31:45.5891483Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5891615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5891832Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5891957Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5892327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5892424Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5892925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5893025Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5893389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5893616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5893964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5894065Z     kernel = self.compile(
2025-05-07T20:31:45.5894447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5894626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5894757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5894762Z 
2025-05-07T20:31:45.5894970Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c81c4ad0>
2025-05-07T20:31:45.5895828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5896335Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbdf7560>}
2025-05-07T20:31:45.5897096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5897287Z context = <triton._C.libtriton.ir.context object at 0x7f51c81809b0>
2025-05-07T20:31:45.5897292Z 
2025-05-07T20:31:45.5897461Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5897729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5897840Z                            module_map=module_map)
2025-05-07T20:31:45.5898007Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5898109Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5898190Z E       ^
2025-05-07T20:31:45.5898548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5898558Z 
2025-05-07T20:31:45.5898972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5898977Z 
2025-05-07T20:31:45.5899087Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5899310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5899392Z     T=4096,
2025-05-07T20:31:45.5899473Z     D=7168,
2025-05-07T20:31:45.5899559Z     scale_ub=None,
2025-05-07T20:31:45.5899655Z     contiguous=False,
2025-05-07T20:31:45.5899744Z     compiled=True,
2025-05-07T20:31:45.5899821Z )
2025-05-07T20:31:45.5900043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5900217Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5900221Z 
2025-05-07T20:31:45.5900301Z     @given(
2025-05-07T20:31:45.5900431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5900534Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5900737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5900858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5900973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5901051Z     )
2025-05-07T20:31:45.5901301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5901398Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5901479Z         self,
2025-05-07T20:31:45.5901563Z         T: int,
2025-05-07T20:31:45.5901645Z         D: int,
2025-05-07T20:31:45.5901751Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5901844Z         contiguous: bool,
2025-05-07T20:31:45.5901934Z         compiled: bool,
2025-05-07T20:31:45.5902019Z     ) -> None:
2025-05-07T20:31:45.5902117Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5902194Z     
2025-05-07T20:31:45.5902366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5902447Z     
2025-05-07T20:31:45.5902542Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5902673Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5902765Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5902851Z         x0 = x[:, :D]
2025-05-07T20:31:45.5902936Z         x1 = x[:, D:]
2025-05-07T20:31:45.5903012Z     
2025-05-07T20:31:45.5903102Z         if contiguous:
2025-05-07T20:31:45.5903197Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5903374Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5903455Z     
2025-05-07T20:31:45.5903549Z         if scale_ub is not None:
2025-05-07T20:31:45.5903660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5903802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5903883Z             )
2025-05-07T20:31:45.5903963Z         else:
2025-05-07T20:31:45.5904062Z             scale_ub_tensor = None
2025-05-07T20:31:45.5904138Z     
2025-05-07T20:31:45.5904274Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5904371Z             op = silu_mul_quant
2025-05-07T20:31:45.5904460Z             if compiled:
2025-05-07T20:31:45.5904567Z                 op = torch.compile(op)
2025-05-07T20:31:45.5904674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5904749Z     
2025-05-07T20:31:45.5904845Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5904849Z 
2025-05-07T20:31:45.5904951Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5905087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5905193Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5905296Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5905881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5906025Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5906537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5906643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5906999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5907224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5907565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5907664Z     kernel = self.compile(
2025-05-07T20:31:45.5908048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5908221Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5908350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5908354Z 
2025-05-07T20:31:45.5908564Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8125cd0>
2025-05-07T20:31:45.5909479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5909983Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c81d07c0>}
2025-05-07T20:31:45.5910730Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5910922Z context = <triton._C.libtriton.ir.context object at 0x7f51c81d9b70>
2025-05-07T20:31:45.5910933Z 
2025-05-07T20:31:45.5911097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5911367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5911478Z                            module_map=module_map)
2025-05-07T20:31:45.5911641Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5911739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5911822Z E       ^
2025-05-07T20:31:45.5912175Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5912310Z 
2025-05-07T20:31:45.5912727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5912732Z 
2025-05-07T20:31:45.5912834Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5913053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5913133Z     T=16384,
2025-05-07T20:31:45.5913214Z     D=5120,
2025-05-07T20:31:45.5913308Z     scale_ub=1200.0,
2025-05-07T20:31:45.5913395Z     contiguous=False,
2025-05-07T20:31:45.5913480Z     compiled=False,
2025-05-07T20:31:45.5913556Z )
2025-05-07T20:31:45.5913779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5913958Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.5913962Z 
2025-05-07T20:31:45.5914042Z     @given(
2025-05-07T20:31:45.5914160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5914261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5914384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5914498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5914608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5914690Z     )
2025-05-07T20:31:45.5914933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5915026Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5915109Z         self,
2025-05-07T20:31:45.5915186Z         T: int,
2025-05-07T20:31:45.5915266Z         D: int,
2025-05-07T20:31:45.5915362Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5915451Z         contiguous: bool,
2025-05-07T20:31:45.5915542Z         compiled: bool,
2025-05-07T20:31:45.5915621Z     ) -> None:
2025-05-07T20:31:45.5915714Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5915791Z     
2025-05-07T20:31:45.5915964Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5916040Z     
2025-05-07T20:31:45.5916134Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5916258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5916348Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5916432Z         x0 = x[:, :D]
2025-05-07T20:31:45.5916514Z         x1 = x[:, D:]
2025-05-07T20:31:45.5916591Z     
2025-05-07T20:31:45.5916675Z         if contiguous:
2025-05-07T20:31:45.5916765Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5916943Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5917018Z     
2025-05-07T20:31:45.5917107Z         if scale_ub is not None:
2025-05-07T20:31:45.5917215Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5917349Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5917422Z             )
2025-05-07T20:31:45.5917503Z         else:
2025-05-07T20:31:45.5917598Z             scale_ub_tensor = None
2025-05-07T20:31:45.5917673Z     
2025-05-07T20:31:45.5917813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5917904Z             op = silu_mul_quant
2025-05-07T20:31:45.5917987Z             if compiled:
2025-05-07T20:31:45.5918092Z                 op = torch.compile(op)
2025-05-07T20:31:45.5918196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5918270Z     
2025-05-07T20:31:45.5918362Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5918366Z 
2025-05-07T20:31:45.5918464Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5918604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5918706Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5918803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5919304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5919402Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5919840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5920063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5920401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5920501Z     kernel = self.compile(
2025-05-07T20:31:45.5920878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5921056Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5921186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5921190Z 
2025-05-07T20:31:45.5921392Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c8166650>
2025-05-07T20:31:45.5922170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5922669Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c81d1620>}
2025-05-07T20:31:45.5923416Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5923615Z context = <triton._C.libtriton.ir.context object at 0x7f51c810a4f0>
2025-05-07T20:31:45.5923619Z 
2025-05-07T20:31:45.5923784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5924050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5924156Z                            module_map=module_map)
2025-05-07T20:31:45.5924322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5924420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5924496Z E       ^
2025-05-07T20:31:45.5924850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5924854Z 
2025-05-07T20:31:45.5925266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5925351Z 
2025-05-07T20:31:45.5925455Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5925682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5925760Z     T=16384,
2025-05-07T20:31:45.5925839Z     D=5120,
2025-05-07T20:31:45.5925924Z     scale_ub=1200.0,
2025-05-07T20:31:45.5926010Z     contiguous=True,
2025-05-07T20:31:45.5926095Z     compiled=True,
2025-05-07T20:31:45.5926168Z )
2025-05-07T20:31:45.5926387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5926561Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.5926566Z 
2025-05-07T20:31:45.5926646Z     @given(
2025-05-07T20:31:45.5926765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5926880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5926998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5932210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5932355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5932436Z     )
2025-05-07T20:31:45.5932690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5932792Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5932874Z         self,
2025-05-07T20:31:45.5932954Z         T: int,
2025-05-07T20:31:45.5933039Z         D: int,
2025-05-07T20:31:45.5933246Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5933339Z         contiguous: bool,
2025-05-07T20:31:45.5933432Z         compiled: bool,
2025-05-07T20:31:45.5933516Z     ) -> None:
2025-05-07T20:31:45.5933612Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5933692Z     
2025-05-07T20:31:45.5933870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5933955Z     
2025-05-07T20:31:45.5934051Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5934186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5934282Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5934364Z         x0 = x[:, :D]
2025-05-07T20:31:45.5934448Z         x1 = x[:, D:]
2025-05-07T20:31:45.5934528Z     
2025-05-07T20:31:45.5934614Z         if contiguous:
2025-05-07T20:31:45.5934707Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5934803Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5934877Z     
2025-05-07T20:31:45.5934971Z         if scale_ub is not None:
2025-05-07T20:31:45.5935091Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5935233Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5935312Z             )
2025-05-07T20:31:45.5935397Z         else:
2025-05-07T20:31:45.5935494Z             scale_ub_tensor = None
2025-05-07T20:31:45.5935574Z     
2025-05-07T20:31:45.5935710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5935805Z             op = silu_mul_quant
2025-05-07T20:31:45.5935901Z             if compiled:
2025-05-07T20:31:45.5936006Z                 op = torch.compile(op)
2025-05-07T20:31:45.5936115Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5936194Z     
2025-05-07T20:31:45.5936287Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5936293Z 
2025-05-07T20:31:45.5936393Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5936529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5936638Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5936744Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5937123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5937220Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5937723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5937822Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5938338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5938566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5938911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5939014Z     kernel = self.compile(
2025-05-07T20:31:45.5939407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5939586Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5939720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5939725Z 
2025-05-07T20:31:45.5939934Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbce0e50>
2025-05-07T20:31:45.5940720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5941233Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c81d2a20>}
2025-05-07T20:31:45.5942066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5942266Z context = <triton._C.libtriton.ir.context object at 0x7f51bbc64d30>
2025-05-07T20:31:45.5942270Z 
2025-05-07T20:31:45.5942438Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5942708Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5942823Z                            module_map=module_map)
2025-05-07T20:31:45.5942991Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5943095Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5943176Z E       ^
2025-05-07T20:31:45.5943539Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5943544Z 
2025-05-07T20:31:45.5943966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5943971Z 
2025-05-07T20:31:45.5944076Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5944309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5944388Z     T=16384,
2025-05-07T20:31:45.5944468Z     D=5120,
2025-05-07T20:31:45.5944554Z     scale_ub=None,
2025-05-07T20:31:45.5944644Z     contiguous=False,
2025-05-07T20:31:45.5944735Z     compiled=True,
2025-05-07T20:31:45.5944818Z )
2025-05-07T20:31:45.5945040Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5945222Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5945226Z 
2025-05-07T20:31:45.5945306Z     @given(
2025-05-07T20:31:45.5945430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5945536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5945654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5945777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5945895Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5945973Z     )
2025-05-07T20:31:45.5946223Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5946320Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5946401Z         self,
2025-05-07T20:31:45.5946485Z         T: int,
2025-05-07T20:31:45.5946648Z         D: int,
2025-05-07T20:31:45.5946749Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5946843Z         contiguous: bool,
2025-05-07T20:31:45.5946931Z         compiled: bool,
2025-05-07T20:31:45.5947010Z     ) -> None:
2025-05-07T20:31:45.5947111Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5947189Z     
2025-05-07T20:31:45.5947365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5947445Z     
2025-05-07T20:31:45.5947539Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5947676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5947768Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5947851Z         x0 = x[:, :D]
2025-05-07T20:31:45.5947936Z         x1 = x[:, D:]
2025-05-07T20:31:45.5948011Z     
2025-05-07T20:31:45.5948098Z         if contiguous:
2025-05-07T20:31:45.5948200Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5948293Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5948367Z     
2025-05-07T20:31:45.5948468Z         if scale_ub is not None:
2025-05-07T20:31:45.5948577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5948714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5948801Z             )
2025-05-07T20:31:45.5948880Z         else:
2025-05-07T20:31:45.5948976Z             scale_ub_tensor = None
2025-05-07T20:31:45.5949057Z     
2025-05-07T20:31:45.5949189Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5949285Z             op = silu_mul_quant
2025-05-07T20:31:45.5949477Z             if compiled:
2025-05-07T20:31:45.5949582Z                 op = torch.compile(op)
2025-05-07T20:31:45.5949695Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5949770Z     
2025-05-07T20:31:45.5949863Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5949867Z 
2025-05-07T20:31:45.5949973Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5950108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5950219Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5950322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5950692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5950792Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5951288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5951395Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5951761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5951985Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5952335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5952433Z     kernel = self.compile(
2025-05-07T20:31:45.5952821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5953002Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5953131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5953136Z 
2025-05-07T20:31:45.5953341Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbc51e10>
2025-05-07T20:31:45.5954126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5954630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c81d3c40>}
2025-05-07T20:31:45.5955383Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5955664Z context = <triton._C.libtriton.ir.context object at 0x7f51bbcadcf0>
2025-05-07T20:31:45.5955669Z 
2025-05-07T20:31:45.5955836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5956101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5956214Z                            module_map=module_map)
2025-05-07T20:31:45.5956383Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5956484Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5956563Z E       ^
2025-05-07T20:31:45.5956927Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5956932Z 
2025-05-07T20:31:45.5957346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5957356Z 
2025-05-07T20:31:45.5957464Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5957686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5957766Z     T=2048,
2025-05-07T20:31:45.5957853Z     D=5120,
2025-05-07T20:31:45.5957937Z     scale_ub=None,
2025-05-07T20:31:45.5958025Z     contiguous=False,
2025-05-07T20:31:45.5958113Z     compiled=True,
2025-05-07T20:31:45.5958265Z )
2025-05-07T20:31:45.5958486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5958665Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5958670Z 
2025-05-07T20:31:45.5958748Z     @given(
2025-05-07T20:31:45.5958877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5958978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5959100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5959222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5959337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5959414Z     )
2025-05-07T20:31:45.5959663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5959761Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5959847Z         self,
2025-05-07T20:31:45.5959927Z         T: int,
2025-05-07T20:31:45.5960010Z         D: int,
2025-05-07T20:31:45.5960115Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5960205Z         contiguous: bool,
2025-05-07T20:31:45.5960293Z         compiled: bool,
2025-05-07T20:31:45.5960377Z     ) -> None:
2025-05-07T20:31:45.5960474Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5960553Z     
2025-05-07T20:31:45.5960721Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5960795Z     
2025-05-07T20:31:45.5960893Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5961016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5961104Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5961185Z         x0 = x[:, :D]
2025-05-07T20:31:45.5961265Z         x1 = x[:, D:]
2025-05-07T20:31:45.5961336Z     
2025-05-07T20:31:45.5961421Z         if contiguous:
2025-05-07T20:31:45.5961512Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5961606Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5961685Z     
2025-05-07T20:31:45.5961780Z         if scale_ub is not None:
2025-05-07T20:31:45.5961884Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5962026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5962103Z             )
2025-05-07T20:31:45.5962185Z         else:
2025-05-07T20:31:45.5962279Z             scale_ub_tensor = None
2025-05-07T20:31:45.5962354Z     
2025-05-07T20:31:45.5962489Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5962667Z             op = silu_mul_quant
2025-05-07T20:31:45.5962750Z             if compiled:
2025-05-07T20:31:45.5962852Z                 op = torch.compile(op)
2025-05-07T20:31:45.5962956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5963031Z     
2025-05-07T20:31:45.5963126Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5963130Z 
2025-05-07T20:31:45.5963226Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5963360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5963460Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5963558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5963931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5964027Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5964519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5964624Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5964982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5965206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5965546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5965640Z     kernel = self.compile(
2025-05-07T20:31:45.5966098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5966274Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5966399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5966408Z 
2025-05-07T20:31:45.5966612Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbbbadd0>
2025-05-07T20:31:45.5967391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5967967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbbe47c0>}
2025-05-07T20:31:45.5968717Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5968913Z context = <triton._C.libtriton.ir.context object at 0x7f51bbba6c70>
2025-05-07T20:31:45.5968917Z 
2025-05-07T20:31:45.5969079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5969341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5969454Z                            module_map=module_map)
2025-05-07T20:31:45.5969615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5969717Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5969795Z E       ^
2025-05-07T20:31:45.5970147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5970152Z 
2025-05-07T20:31:45.5970574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5970578Z 
2025-05-07T20:31:45.5970681Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5970900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5970984Z     T=2048,
2025-05-07T20:31:45.5971060Z     D=5120,
2025-05-07T20:31:45.5971165Z     scale_ub=1200.0,
2025-05-07T20:31:45.5971255Z     contiguous=False,
2025-05-07T20:31:45.5971446Z     compiled=True,
2025-05-07T20:31:45.5971524Z )
2025-05-07T20:31:45.5971741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5971912Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.5971916Z 
2025-05-07T20:31:45.5971996Z     @given(
2025-05-07T20:31:45.5972114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5972212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5972333Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5972449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5972564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5972640Z     )
2025-05-07T20:31:45.5972884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5972981Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5973062Z         self,
2025-05-07T20:31:45.5973145Z         T: int,
2025-05-07T20:31:45.5973229Z         D: int,
2025-05-07T20:31:45.5973327Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5973415Z         contiguous: bool,
2025-05-07T20:31:45.5973505Z         compiled: bool,
2025-05-07T20:31:45.5973584Z     ) -> None:
2025-05-07T20:31:45.5973681Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5973762Z     
2025-05-07T20:31:45.5973931Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5974013Z     
2025-05-07T20:31:45.5974226Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5974353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5974444Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5974529Z         x0 = x[:, :D]
2025-05-07T20:31:45.5974608Z         x1 = x[:, D:]
2025-05-07T20:31:45.5974687Z     
2025-05-07T20:31:45.5974769Z         if contiguous:
2025-05-07T20:31:45.5974862Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5974953Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5975035Z     
2025-05-07T20:31:45.5975125Z         if scale_ub is not None:
2025-05-07T20:31:45.5975235Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5975367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5975450Z             )
2025-05-07T20:31:45.5975529Z         else:
2025-05-07T20:31:45.5975624Z             scale_ub_tensor = None
2025-05-07T20:31:45.5975702Z     
2025-05-07T20:31:45.5975838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5975926Z             op = silu_mul_quant
2025-05-07T20:31:45.5976013Z             if compiled:
2025-05-07T20:31:45.5976112Z                 op = torch.compile(op)
2025-05-07T20:31:45.5976217Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5976294Z     
2025-05-07T20:31:45.5976384Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5976388Z 
2025-05-07T20:31:45.5976486Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5976617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5976720Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5976819Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5977185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5977278Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5977781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5977878Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5978234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5978461Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5978799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5979004Z     kernel = self.compile(
2025-05-07T20:31:45.5979383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5979556Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5979685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5979690Z 
2025-05-07T20:31:45.5979900Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bbb23850>
2025-05-07T20:31:45.5980672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5981206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbbe58a0>}
2025-05-07T20:31:45.5981968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5982160Z context = <triton._C.libtriton.ir.context object at 0x7f51bbba3730>
2025-05-07T20:31:45.5982164Z 
2025-05-07T20:31:45.5982328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5982669Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5982775Z                            module_map=module_map)
2025-05-07T20:31:45.5982937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5983035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5983112Z E       ^
2025-05-07T20:31:45.5983469Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5983478Z 
2025-05-07T20:31:45.5983890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5983894Z 
2025-05-07T20:31:45.5983996Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5984219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5984296Z     T=4096,
2025-05-07T20:31:45.5984374Z     D=5120,
2025-05-07T20:31:45.5984461Z     scale_ub=1200.0,
2025-05-07T20:31:45.5984549Z     contiguous=True,
2025-05-07T20:31:45.5984637Z     compiled=True,
2025-05-07T20:31:45.5984714Z )
2025-05-07T20:31:45.5984932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5985105Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.5985110Z 
2025-05-07T20:31:45.5985185Z     @given(
2025-05-07T20:31:45.5985302Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5985413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5985528Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5985645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5985760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5985835Z     )
2025-05-07T20:31:45.5986081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5986175Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5986254Z         self,
2025-05-07T20:31:45.5986339Z         T: int,
2025-05-07T20:31:45.5986418Z         D: int,
2025-05-07T20:31:45.5986515Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5986608Z         contiguous: bool,
2025-05-07T20:31:45.5986693Z         compiled: bool,
2025-05-07T20:31:45.5986771Z     ) -> None:
2025-05-07T20:31:45.5986868Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5986944Z     
2025-05-07T20:31:45.5987113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5987276Z     
2025-05-07T20:31:45.5987368Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5987496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5987585Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5987665Z         x0 = x[:, :D]
2025-05-07T20:31:45.5987750Z         x1 = x[:, D:]
2025-05-07T20:31:45.5987823Z     
2025-05-07T20:31:45.5987914Z         if contiguous:
2025-05-07T20:31:45.5988005Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5988099Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5988179Z     
2025-05-07T20:31:45.5988276Z         if scale_ub is not None:
2025-05-07T20:31:45.5988380Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5988515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5988594Z             )
2025-05-07T20:31:45.5988672Z         else:
2025-05-07T20:31:45.5988770Z             scale_ub_tensor = None
2025-05-07T20:31:45.5988844Z     
2025-05-07T20:31:45.5988977Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5989071Z             op = silu_mul_quant
2025-05-07T20:31:45.5989158Z             if compiled:
2025-05-07T20:31:45.5989262Z                 op = torch.compile(op)
2025-05-07T20:31:45.5989366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5989440Z     
2025-05-07T20:31:45.5989532Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5989537Z 
2025-05-07T20:31:45.5989635Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5989846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5989955Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5990057Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5990425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5990523Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5991015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5991122Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5991478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5991699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5992053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5992146Z     kernel = self.compile(
2025-05-07T20:31:45.5992529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5992704Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5992830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5992835Z 
2025-05-07T20:31:45.5993050Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb900810>
2025-05-07T20:31:45.5993819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5994324Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bbbe6ac0>}
2025-05-07T20:31:45.5995066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5995256Z context = <triton._C.libtriton.ir.context object at 0x7f51bb9dc6f0>
2025-05-07T20:31:45.5995260Z 
2025-05-07T20:31:45.5995428Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5995775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5995884Z                            module_map=module_map)
2025-05-07T20:31:45.5996045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5996142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5996222Z E       ^
2025-05-07T20:31:45.5996579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5996584Z 
2025-05-07T20:31:45.5996995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5997002Z 
2025-05-07T20:31:45.5997104Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5997325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5997407Z     T=128,
2025-05-07T20:31:45.5997485Z     D=5120,
2025-05-07T20:31:45.5997575Z     scale_ub=1200.0,
2025-05-07T20:31:45.5997665Z     contiguous=False,
2025-05-07T20:31:45.5997747Z     compiled=True,
2025-05-07T20:31:45.5997823Z )
2025-05-07T20:31:45.5998041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5998211Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.5998216Z 
2025-05-07T20:31:45.5998298Z     @given(
2025-05-07T20:31:45.5998415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5998594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5998715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5998830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5998942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5999019Z     )
2025-05-07T20:31:45.5999265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5999360Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5999447Z         self,
2025-05-07T20:31:45.5999526Z         T: int,
2025-05-07T20:31:45.5999603Z         D: int,
2025-05-07T20:31:45.5999704Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5999795Z         contiguous: bool,
2025-05-07T20:31:45.5999882Z         compiled: bool,
2025-05-07T20:31:45.5999958Z     ) -> None:
2025-05-07T20:31:45.6000052Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6000129Z     
2025-05-07T20:31:45.6000302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6000379Z     
2025-05-07T20:31:45.6000475Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6000598Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6000686Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6000767Z         x0 = x[:, :D]
2025-05-07T20:31:45.6000847Z         x1 = x[:, D:]
2025-05-07T20:31:45.6000921Z     
2025-05-07T20:31:45.6001006Z         if contiguous:
2025-05-07T20:31:45.6001098Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6001197Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6001270Z     
2025-05-07T20:31:45.6001361Z         if scale_ub is not None:
2025-05-07T20:31:45.6001470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6001602Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6001676Z             )
2025-05-07T20:31:45.6001754Z         else:
2025-05-07T20:31:45.6001872Z             scale_ub_tensor = None
2025-05-07T20:31:45.6001948Z     
2025-05-07T20:31:45.6002108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6002199Z             op = silu_mul_quant
2025-05-07T20:31:45.6002282Z             if compiled:
2025-05-07T20:31:45.6002385Z                 op = torch.compile(op)
2025-05-07T20:31:45.6002489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6002568Z     
2025-05-07T20:31:45.6002659Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6002663Z 
2025-05-07T20:31:45.6002845Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6002978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6003078Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6003177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6003546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.6003639Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.6004135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6004239Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6004593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6004816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6005153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6005251Z     kernel = self.compile(
2025-05-07T20:31:45.6005821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6006069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6006208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6006213Z 
2025-05-07T20:31:45.6006558Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb9d1790>
2025-05-07T20:31:45.6007332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6007890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb970540>}
2025-05-07T20:31:45.6008643Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6008833Z context = <triton._C.libtriton.ir.context object at 0x7f51bb9f1f30>
2025-05-07T20:31:45.6008838Z 
2025-05-07T20:31:45.6009001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6009267Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6009381Z                            module_map=module_map)
2025-05-07T20:31:45.6009543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6009643Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6009718Z E       ^
2025-05-07T20:31:45.6010071Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6010081Z 
2025-05-07T20:31:45.6010497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6010502Z 
2025-05-07T20:31:45.6010606Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6010829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6010907Z     T=16384,
2025-05-07T20:31:45.6010983Z     D=7168,
2025-05-07T20:31:45.6011072Z     scale_ub=1200.0,
2025-05-07T20:31:45.6011157Z     contiguous=True,
2025-05-07T20:31:45.6011240Z     compiled=True,
2025-05-07T20:31:45.6011320Z )
2025-05-07T20:31:45.6011536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6011709Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.6011714Z 
2025-05-07T20:31:45.6011796Z     @given(
2025-05-07T20:31:45.6012064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6012165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6012279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6012394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6012510Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6012585Z     )
2025-05-07T20:31:45.6012830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6012937Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6013016Z         self,
2025-05-07T20:31:45.6013092Z         T: int,
2025-05-07T20:31:45.6013173Z         D: int,
2025-05-07T20:31:45.6013272Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6013361Z         contiguous: bool,
2025-05-07T20:31:45.6013452Z         compiled: bool,
2025-05-07T20:31:45.6013533Z     ) -> None:
2025-05-07T20:31:45.6013629Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6013705Z     
2025-05-07T20:31:45.6013879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6013959Z     
2025-05-07T20:31:45.6014049Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6014172Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6014263Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6014344Z         x0 = x[:, :D]
2025-05-07T20:31:45.6014423Z         x1 = x[:, D:]
2025-05-07T20:31:45.6014501Z     
2025-05-07T20:31:45.6014585Z         if contiguous:
2025-05-07T20:31:45.6014754Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6014848Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6014922Z     
2025-05-07T20:31:45.6015014Z         if scale_ub is not None:
2025-05-07T20:31:45.6015124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6015258Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6015338Z             )
2025-05-07T20:31:45.6015413Z         else:
2025-05-07T20:31:45.6015507Z             scale_ub_tensor = None
2025-05-07T20:31:45.6015589Z     
2025-05-07T20:31:45.6015718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6015806Z             op = silu_mul_quant
2025-05-07T20:31:45.6015894Z             if compiled:
2025-05-07T20:31:45.6015994Z                 op = torch.compile(op)
2025-05-07T20:31:45.6016098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6016177Z     
2025-05-07T20:31:45.6016268Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6016272Z 
2025-05-07T20:31:45.6016375Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6016509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6016609Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6016710Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6017074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.6017166Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.6017669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6017768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6018128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6018349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6018693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6018792Z     kernel = self.compile(
2025-05-07T20:31:45.6019174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6019346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6019476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6019561Z 
2025-05-07T20:31:45.6019765Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bba56490>
2025-05-07T20:31:45.6020540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6021041Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb971080>}
2025-05-07T20:31:45.6021790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6021980Z context = <triton._C.libtriton.ir.context object at 0x7f51bba3e370>
2025-05-07T20:31:45.6021985Z 
2025-05-07T20:31:45.6022152Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6022415Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6022522Z                            module_map=module_map)
2025-05-07T20:31:45.6022688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6022786Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6022861Z E       ^
2025-05-07T20:31:45.6023293Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6023298Z 
2025-05-07T20:31:45.6023712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6023717Z 
2025-05-07T20:31:45.6023819Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6024043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6024128Z     T=16384,
2025-05-07T20:31:45.6024210Z     D=5120,
2025-05-07T20:31:45.6024293Z     scale_ub=1200.0,
2025-05-07T20:31:45.6024375Z     contiguous=True,
2025-05-07T20:31:45.6024462Z     compiled=False,
2025-05-07T20:31:45.6024535Z )
2025-05-07T20:31:45.6024749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6024930Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6024935Z 
2025-05-07T20:31:45.6025013Z     @given(
2025-05-07T20:31:45.6025137Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6025237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6025350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6025473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6025585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6025661Z     )
2025-05-07T20:31:45.6025908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6026006Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6026085Z         self,
2025-05-07T20:31:45.6026168Z         T: int,
2025-05-07T20:31:45.6026247Z         D: int,
2025-05-07T20:31:45.6026345Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6026436Z         contiguous: bool,
2025-05-07T20:31:45.6026522Z         compiled: bool,
2025-05-07T20:31:45.6026599Z     ) -> None:
2025-05-07T20:31:45.6026698Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6026773Z     
2025-05-07T20:31:45.6026948Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6027021Z     
2025-05-07T20:31:45.6027114Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6027243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6027332Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6027413Z         x0 = x[:, :D]
2025-05-07T20:31:45.6027496Z         x1 = x[:, D:]
2025-05-07T20:31:45.6027568Z     
2025-05-07T20:31:45.6027735Z         if contiguous:
2025-05-07T20:31:45.6027831Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6027919Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6027994Z     
2025-05-07T20:31:45.6028090Z         if scale_ub is not None:
2025-05-07T20:31:45.6028193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6028330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6028406Z             )
2025-05-07T20:31:45.6028483Z         else:
2025-05-07T20:31:45.6028586Z             scale_ub_tensor = None
2025-05-07T20:31:45.6028658Z     
2025-05-07T20:31:45.6028787Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6028880Z             op = silu_mul_quant
2025-05-07T20:31:45.6028964Z             if compiled:
2025-05-07T20:31:45.6029064Z                 op = torch.compile(op)
2025-05-07T20:31:45.6029171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6029247Z     
2025-05-07T20:31:45.6029337Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6029347Z 
2025-05-07T20:31:45.6029447Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6029575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6029676Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6029775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6030270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6030450Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6030807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6031026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6031368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6031462Z     kernel = self.compile(
2025-05-07T20:31:45.6031848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6032020Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6032147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6032152Z 
2025-05-07T20:31:45.6032358Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bba930d0>
2025-05-07T20:31:45.6033134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6033636Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb972660>}
2025-05-07T20:31:45.6034380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6034574Z context = <triton._C.libtriton.ir.context object at 0x7f51bba12f70>
2025-05-07T20:31:45.6034584Z 
2025-05-07T20:31:45.6034748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6035012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6035121Z                            module_map=module_map)
2025-05-07T20:31:45.6035281Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6035384Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6035467Z E       ^
2025-05-07T20:31:45.6035821Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6035826Z 
2025-05-07T20:31:45.6036323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6036327Z 
2025-05-07T20:31:45.6036430Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6036650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6036736Z     T=1,
2025-05-07T20:31:45.6036812Z     D=7168,
2025-05-07T20:31:45.6036898Z     scale_ub=1200.0,
2025-05-07T20:31:45.6036989Z     contiguous=False,
2025-05-07T20:31:45.6037080Z     compiled=False,
2025-05-07T20:31:45.6037156Z )
2025-05-07T20:31:45.6037376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6037543Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.6037548Z 
2025-05-07T20:31:45.6037628Z     @given(
2025-05-07T20:31:45.6037746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6037844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6037970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6038086Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6038200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6038278Z     )
2025-05-07T20:31:45.6038521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6038615Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6038694Z         self,
2025-05-07T20:31:45.6038772Z         T: int,
2025-05-07T20:31:45.6038950Z         D: int,
2025-05-07T20:31:45.6039049Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6039138Z         contiguous: bool,
2025-05-07T20:31:45.6039226Z         compiled: bool,
2025-05-07T20:31:45.6039304Z     ) -> None:
2025-05-07T20:31:45.6039397Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6039480Z     
2025-05-07T20:31:45.6039652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6039727Z     
2025-05-07T20:31:45.6039826Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6039949Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6040039Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6040122Z         x0 = x[:, :D]
2025-05-07T20:31:45.6040202Z         x1 = x[:, D:]
2025-05-07T20:31:45.6040280Z     
2025-05-07T20:31:45.6040364Z         if contiguous:
2025-05-07T20:31:45.6040454Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6040547Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6040623Z     
2025-05-07T20:31:45.6040714Z         if scale_ub is not None:
2025-05-07T20:31:45.6040821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6040954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6041032Z             )
2025-05-07T20:31:45.6041111Z         else:
2025-05-07T20:31:45.6041205Z             scale_ub_tensor = None
2025-05-07T20:31:45.6041278Z     
2025-05-07T20:31:45.6041412Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6041509Z             op = silu_mul_quant
2025-05-07T20:31:45.6041598Z             if compiled:
2025-05-07T20:31:45.6041699Z                 op = torch.compile(op)
2025-05-07T20:31:45.6041802Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6041880Z     
2025-05-07T20:31:45.6041972Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6041977Z 
2025-05-07T20:31:45.6042074Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6042209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6042310Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6042407Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6042909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6043007Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6043366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6043672Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6044009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6044106Z     kernel = self.compile(
2025-05-07T20:31:45.6044486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6044668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6044795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6044799Z 
2025-05-07T20:31:45.6045002Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bba71c90>
2025-05-07T20:31:45.6045777Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6046283Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb971d00>}
2025-05-07T20:31:45.6047030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6047299Z context = <triton._C.libtriton.ir.context object at 0x7f51bba89b70>
2025-05-07T20:31:45.6047305Z 
2025-05-07T20:31:45.6047469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6047809Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6047915Z                            module_map=module_map)
2025-05-07T20:31:45.6048079Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6048183Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6048261Z E       ^
2025-05-07T20:31:45.6048613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6048617Z 
2025-05-07T20:31:45.6049030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6049034Z 
2025-05-07T20:31:45.6049143Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6049365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6049445Z     T=4096,
2025-05-07T20:31:45.6049526Z     D=7168,
2025-05-07T20:31:45.6049612Z     scale_ub=1200.0,
2025-05-07T20:31:45.6049696Z     contiguous=False,
2025-05-07T20:31:45.6049779Z     compiled=True,
2025-05-07T20:31:45.6049853Z )
2025-05-07T20:31:45.6050070Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6050257Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.6050261Z 
2025-05-07T20:31:45.6050343Z     @given(
2025-05-07T20:31:45.6050468Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6050569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6050683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6050802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6050917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6050994Z     )
2025-05-07T20:31:45.6051240Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6051331Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6051404Z         self,
2025-05-07T20:31:45.6051484Z         T: int,
2025-05-07T20:31:45.6051562Z         D: int,
2025-05-07T20:31:45.6051658Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6051753Z         contiguous: bool,
2025-05-07T20:31:45.6051924Z         compiled: bool,
2025-05-07T20:31:45.6057280Z     ) -> None:
2025-05-07T20:31:45.6057397Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6057480Z     
2025-05-07T20:31:45.6057660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6057744Z     
2025-05-07T20:31:45.6057839Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6057968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6058073Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6058155Z         x0 = x[:, :D]
2025-05-07T20:31:45.6058237Z         x1 = x[:, D:]
2025-05-07T20:31:45.6058316Z     
2025-05-07T20:31:45.6058404Z         if contiguous:
2025-05-07T20:31:45.6058497Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6058594Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6058670Z     
2025-05-07T20:31:45.6058763Z         if scale_ub is not None:
2025-05-07T20:31:45.6058878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6059024Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6059108Z             )
2025-05-07T20:31:45.6059186Z         else:
2025-05-07T20:31:45.6059283Z             scale_ub_tensor = None
2025-05-07T20:31:45.6059363Z     
2025-05-07T20:31:45.6059497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6059589Z             op = silu_mul_quant
2025-05-07T20:31:45.6059679Z             if compiled:
2025-05-07T20:31:45.6059888Z                 op = torch.compile(op)
2025-05-07T20:31:45.6060000Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6060082Z     
2025-05-07T20:31:45.6060177Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6060183Z 
2025-05-07T20:31:45.6060286Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6060426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6060530Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6060638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6061028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.6061126Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.6061635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6061738Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6062109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6062339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6062682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6062782Z     kernel = self.compile(
2025-05-07T20:31:45.6063169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6063356Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6063491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6063496Z 
2025-05-07T20:31:45.6063704Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c85904d0>
2025-05-07T20:31:45.6064498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6065005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c852ccc0>}
2025-05-07T20:31:45.6065766Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6066042Z context = <triton._C.libtriton.ir.context object at 0x7f51c85e43b0>
2025-05-07T20:31:45.6066047Z 
2025-05-07T20:31:45.6066216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6066487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6066600Z                            module_map=module_map)
2025-05-07T20:31:45.6066771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6066880Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6066961Z E       ^
2025-05-07T20:31:45.6067325Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6067330Z 
2025-05-07T20:31:45.6067747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6067756Z 
2025-05-07T20:31:45.6067863Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6068092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6068173Z     T=128,
2025-05-07T20:31:45.6068254Z     D=7168,
2025-05-07T20:31:45.6068342Z     scale_ub=1200.0,
2025-05-07T20:31:45.6068430Z     contiguous=False,
2025-05-07T20:31:45.6068523Z     compiled=True,
2025-05-07T20:31:45.6068598Z )
2025-05-07T20:31:45.6068955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6069134Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.6069139Z 
2025-05-07T20:31:45.6069220Z     @given(
2025-05-07T20:31:45.6069344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6069452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6069569Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6069689Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6069812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6069890Z     )
2025-05-07T20:31:45.6070145Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6070242Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6070322Z         self,
2025-05-07T20:31:45.6070404Z         T: int,
2025-05-07T20:31:45.6070482Z         D: int,
2025-05-07T20:31:45.6070583Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6070683Z         contiguous: bool,
2025-05-07T20:31:45.6070768Z         compiled: bool,
2025-05-07T20:31:45.6070846Z     ) -> None:
2025-05-07T20:31:45.6070942Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6071015Z     
2025-05-07T20:31:45.6071185Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6071266Z     
2025-05-07T20:31:45.6071359Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6071487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6071580Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6071660Z         x0 = x[:, :D]
2025-05-07T20:31:45.6071744Z         x1 = x[:, D:]
2025-05-07T20:31:45.6071816Z     
2025-05-07T20:31:45.6071900Z         if contiguous:
2025-05-07T20:31:45.6071996Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6072087Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6072162Z     
2025-05-07T20:31:45.6072258Z         if scale_ub is not None:
2025-05-07T20:31:45.6072365Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6072501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6072583Z             )
2025-05-07T20:31:45.6072663Z         else:
2025-05-07T20:31:45.6072758Z             scale_ub_tensor = None
2025-05-07T20:31:45.6072833Z     
2025-05-07T20:31:45.6072963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6073061Z             op = silu_mul_quant
2025-05-07T20:31:45.6073147Z             if compiled:
2025-05-07T20:31:45.6073332Z                 op = torch.compile(op)
2025-05-07T20:31:45.6073440Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6073513Z     
2025-05-07T20:31:45.6073604Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6073609Z 
2025-05-07T20:31:45.6073716Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6073847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6073950Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6074054Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6074422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.6074517Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.6075012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6075109Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6075474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6075696Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6076036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6076135Z     kernel = self.compile(
2025-05-07T20:31:45.6076599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6076779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6076908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6076912Z 
2025-05-07T20:31:45.6077116Z self = <triton.compiler.compiler.ASTSource object at 0x7f51c854d690>
2025-05-07T20:31:45.6077897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6078398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c852d580>}
2025-05-07T20:31:45.6079159Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6079348Z context = <triton._C.libtriton.ir.context object at 0x7f51c85d8d70>
2025-05-07T20:31:45.6079353Z 
2025-05-07T20:31:45.6079521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6079785Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6079894Z                            module_map=module_map)
2025-05-07T20:31:45.6080068Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6080169Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6080248Z E       ^
2025-05-07T20:31:45.6080604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6080609Z 
2025-05-07T20:31:45.6081022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6081031Z 
2025-05-07T20:31:45.6081140Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6081360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6081439Z     T=2048,
2025-05-07T20:31:45.6081520Z     D=7168,
2025-05-07T20:31:45.6081604Z     scale_ub=None,
2025-05-07T20:31:45.6081689Z     contiguous=True,
2025-05-07T20:31:45.6081778Z     compiled=True,
2025-05-07T20:31:45.6081852Z )
2025-05-07T20:31:45.6082154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6082324Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.6082329Z 
2025-05-07T20:31:45.6082408Z     @given(
2025-05-07T20:31:45.6082529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6082629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6082741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6082869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6082981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6083055Z     )
2025-05-07T20:31:45.6083302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6083396Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6083479Z         self,
2025-05-07T20:31:45.6083557Z         T: int,
2025-05-07T20:31:45.6083633Z         D: int,
2025-05-07T20:31:45.6083740Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6083834Z         contiguous: bool,
2025-05-07T20:31:45.6083920Z         compiled: bool,
2025-05-07T20:31:45.6084000Z     ) -> None:
2025-05-07T20:31:45.6084095Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6084168Z     
2025-05-07T20:31:45.6084341Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6084420Z     
2025-05-07T20:31:45.6084513Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6084723Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6084816Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6084899Z         x0 = x[:, :D]
2025-05-07T20:31:45.6084980Z         x1 = x[:, D:]
2025-05-07T20:31:45.6085054Z     
2025-05-07T20:31:45.6085141Z         if contiguous:
2025-05-07T20:31:45.6085233Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6085323Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6085402Z     
2025-05-07T20:31:45.6085495Z         if scale_ub is not None:
2025-05-07T20:31:45.6085616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6085749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6085826Z             )
2025-05-07T20:31:45.6085907Z         else:
2025-05-07T20:31:45.6086002Z             scale_ub_tensor = None
2025-05-07T20:31:45.6086076Z     
2025-05-07T20:31:45.6086212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6086303Z             op = silu_mul_quant
2025-05-07T20:31:45.6086393Z             if compiled:
2025-05-07T20:31:45.6086497Z                 op = torch.compile(op)
2025-05-07T20:31:45.6086601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6086677Z     
2025-05-07T20:31:45.6086772Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6086777Z 
2025-05-07T20:31:45.6086876Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6087009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6087110Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6087215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6087657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.6087753Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.6088245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6088348Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6088706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6088929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6089268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6089362Z     kernel = self.compile(
2025-05-07T20:31:45.6089744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6090026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6090158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6090162Z 
2025-05-07T20:31:45.6090367Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb79dd10>
2025-05-07T20:31:45.6091144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6091643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51c852e480>}
2025-05-07T20:31:45.6092387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6092583Z context = <triton._C.libtriton.ir.context object at 0x7f51bb7d9bb0>
2025-05-07T20:31:45.6092588Z 
2025-05-07T20:31:45.6092750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6093011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6093195Z                            module_map=module_map)
2025-05-07T20:31:45.6093360Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6093462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6093540Z E       ^
2025-05-07T20:31:45.6093894Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6093898Z 
2025-05-07T20:31:45.6094312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6094322Z 
2025-05-07T20:31:45.6094425Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6094650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6094732Z     T=16384,
2025-05-07T20:31:45.6094811Z     D=5120,
2025-05-07T20:31:45.6094900Z     scale_ub=None,
2025-05-07T20:31:45.6094987Z     contiguous=False,
2025-05-07T20:31:45.6095072Z     compiled=False,
2025-05-07T20:31:45.6095153Z )
2025-05-07T20:31:45.6095374Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6095551Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.6095556Z 
2025-05-07T20:31:45.6095639Z     @given(
2025-05-07T20:31:45.6095759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6095857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6095976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6096097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6096211Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6096288Z     )
2025-05-07T20:31:45.6096532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6096629Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6096707Z         self,
2025-05-07T20:31:45.6096785Z         T: int,
2025-05-07T20:31:45.6096864Z         D: int,
2025-05-07T20:31:45.6096968Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6097057Z         contiguous: bool,
2025-05-07T20:31:45.6097147Z         compiled: bool,
2025-05-07T20:31:45.6097228Z     ) -> None:
2025-05-07T20:31:45.6097325Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6097400Z     
2025-05-07T20:31:45.6097567Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6097653Z     
2025-05-07T20:31:45.6097748Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6097964Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6099771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6099777Z 
2025-05-07T20:31:45.6099896Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.6099901Z 
2025-05-07T20:31:45.6100007Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6100226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6100313Z     T=4096,
2025-05-07T20:31:45.6100397Z     D=7168,
2025-05-07T20:31:45.6100483Z     scale_ub=1200.0,
2025-05-07T20:31:45.6100576Z     contiguous=True,
2025-05-07T20:31:45.6100661Z     compiled=True,
2025-05-07T20:31:45.6100737Z )
2025-05-07T20:31:45.6100958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6101129Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.6101134Z 
2025-05-07T20:31:45.6101213Z     @given(
2025-05-07T20:31:45.6101412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6101510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6101625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6101746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6101858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6101942Z     )
2025-05-07T20:31:45.6102184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6102287Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6102369Z         self,
2025-05-07T20:31:45.6102448Z         T: int,
2025-05-07T20:31:45.6102525Z         D: int,
2025-05-07T20:31:45.6102628Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6102717Z         contiguous: bool,
2025-05-07T20:31:45.6102803Z         compiled: bool,
2025-05-07T20:31:45.6102887Z     ) -> None:
2025-05-07T20:31:45.6102981Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6103057Z     
2025-05-07T20:31:45.6103234Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6103312Z     
2025-05-07T20:31:45.6103403Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6103532Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6105317Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6105332Z 
2025-05-07T20:31:45.6105448Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.6105453Z 
2025-05-07T20:31:45.6105558Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6106045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6106142Z     T=16384,
2025-05-07T20:31:45.6106221Z     D=7168,
2025-05-07T20:31:45.6106306Z     scale_ub=None,
2025-05-07T20:31:45.6106395Z     contiguous=False,
2025-05-07T20:31:45.6106480Z     compiled=False,
2025-05-07T20:31:45.6106559Z )
2025-05-07T20:31:45.6106775Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6107102Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.6107108Z 
2025-05-07T20:31:45.6107187Z     @given(
2025-05-07T20:31:45.6107305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6107406Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6107518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6107634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6107753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6107829Z     )
2025-05-07T20:31:45.6108074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6108168Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6108248Z         self,
2025-05-07T20:31:45.6108330Z         T: int,
2025-05-07T20:31:45.6108407Z         D: int,
2025-05-07T20:31:45.6108504Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6108601Z         contiguous: bool,
2025-05-07T20:31:45.6108687Z         compiled: bool,
2025-05-07T20:31:45.6108766Z     ) -> None:
2025-05-07T20:31:45.6108864Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6108937Z     
2025-05-07T20:31:45.6109105Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6111003Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6111010Z 
2025-05-07T20:31:45.6111131Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6111141Z 
2025-05-07T20:31:45.6111248Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6111469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6111554Z     T=2048,
2025-05-07T20:31:45.6111629Z     D=7168,
2025-05-07T20:31:45.6111712Z     scale_ub=1200.0,
2025-05-07T20:31:45.6111801Z     contiguous=True,
2025-05-07T20:31:45.6111886Z     compiled=True,
2025-05-07T20:31:45.6111962Z )
2025-05-07T20:31:45.6112186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6112357Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.6112361Z 
2025-05-07T20:31:45.6112441Z     @given(
2025-05-07T20:31:45.6112562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6112658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6112778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6112901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6113016Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6113099Z     )
2025-05-07T20:31:45.6113341Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6113438Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6113518Z         self,
2025-05-07T20:31:45.6113596Z         T: int,
2025-05-07T20:31:45.6113674Z         D: int,
2025-05-07T20:31:45.6113779Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6113869Z         contiguous: bool,
2025-05-07T20:31:45.6113955Z         compiled: bool,
2025-05-07T20:31:45.6114043Z     ) -> None:
2025-05-07T20:31:45.6114137Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6114210Z     
2025-05-07T20:31:45.6114381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6114459Z     
2025-05-07T20:31:45.6114556Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6114679Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6116537Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6116543Z 
2025-05-07T20:31:45.6116662Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.6116666Z 
2025-05-07T20:31:45.6116774Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6116994Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6117071Z     T=2048,
2025-05-07T20:31:45.6117158Z     D=7168,
2025-05-07T20:31:45.6117242Z     scale_ub=None,
2025-05-07T20:31:45.6117329Z     contiguous=True,
2025-05-07T20:31:45.6117417Z     compiled=False,
2025-05-07T20:31:45.6117495Z )
2025-05-07T20:31:45.6117708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6117881Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6117885Z 
2025-05-07T20:31:45.6117965Z     @given(
2025-05-07T20:31:45.6118187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6118287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6118402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6118523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6118636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6118714Z     )
2025-05-07T20:31:45.6118960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6119061Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6119141Z         self,
2025-05-07T20:31:45.6119221Z         T: int,
2025-05-07T20:31:45.6119300Z         D: int,
2025-05-07T20:31:45.6119401Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6119493Z         contiguous: bool,
2025-05-07T20:31:45.6119579Z         compiled: bool,
2025-05-07T20:31:45.6119660Z     ) -> None:
2025-05-07T20:31:45.6119756Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6119828Z     
2025-05-07T20:31:45.6120003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6120081Z     
2025-05-07T20:31:45.6120172Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.6121939Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6121950Z 
2025-05-07T20:31:45.6122066Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.6122071Z 
2025-05-07T20:31:45.6122176Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6122397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6122480Z     T=1,
2025-05-07T20:31:45.6122560Z     D=7168,
2025-05-07T20:31:45.6122644Z     scale_ub=1200.0,
2025-05-07T20:31:45.6122731Z     contiguous=True,
2025-05-07T20:31:45.6122815Z     compiled=False,
2025-05-07T20:31:45.6122890Z )
2025-05-07T20:31:45.6123107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6123271Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6123363Z 
2025-05-07T20:31:45.6123445Z     @given(
2025-05-07T20:31:45.6123564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6123660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6123772Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6123890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6124002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6124077Z     )
2025-05-07T20:31:45.6124324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6124420Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6124502Z         self,
2025-05-07T20:31:45.6124580Z         T: int,
2025-05-07T20:31:45.6124657Z         D: int,
2025-05-07T20:31:45.6124759Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6124850Z         contiguous: bool,
2025-05-07T20:31:45.6124934Z         compiled: bool,
2025-05-07T20:31:45.6125022Z     ) -> None:
2025-05-07T20:31:45.6125117Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6125194Z     
2025-05-07T20:31:45.6125363Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6125438Z     
2025-05-07T20:31:45.6125532Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6125656Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6125748Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6125831Z         x0 = x[:, :D]
2025-05-07T20:31:45.6125989Z         x1 = x[:, D:]
2025-05-07T20:31:45.6126066Z     
2025-05-07T20:31:45.6126152Z         if contiguous:
2025-05-07T20:31:45.6126245Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6126336Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6126415Z     
2025-05-07T20:31:45.6126506Z         if scale_ub is not None:
2025-05-07T20:31:45.6126612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6126752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6126837Z             )
2025-05-07T20:31:45.6126917Z         else:
2025-05-07T20:31:45.6127012Z             scale_ub_tensor = None
2025-05-07T20:31:45.6127086Z     
2025-05-07T20:31:45.6127222Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6127313Z             op = silu_mul_quant
2025-05-07T20:31:45.6127400Z             if compiled:
2025-05-07T20:31:45.6127602Z                 op = torch.compile(op)
2025-05-07T20:31:45.6127709Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6127790Z     
2025-05-07T20:31:45.6127885Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6127889Z 
2025-05-07T20:31:45.6127986Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6128115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6128221Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6128319Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6128823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6128927Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6129285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6129511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6129853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6129951Z     kernel = self.compile(
2025-05-07T20:31:45.6130330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6130505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6130634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6130639Z 
2025-05-07T20:31:45.6130843Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb513a90>
2025-05-07T20:31:45.6131713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6132217Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb759d00>}
2025-05-07T20:31:45.6132961Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6133158Z context = <triton._C.libtriton.ir.context object at 0x7f51bb5ef930>
2025-05-07T20:31:45.6133162Z 
2025-05-07T20:31:45.6133324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6133594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6133708Z                            module_map=module_map)
2025-05-07T20:31:45.6133870Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6133971Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6134051Z E       ^
2025-05-07T20:31:45.6134403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6134490Z 
2025-05-07T20:31:45.6134907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6134912Z 
2025-05-07T20:31:45.6135015Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6135239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6135321Z     T=128,
2025-05-07T20:31:45.6135402Z     D=5120,
2025-05-07T20:31:45.6135498Z     scale_ub=None,
2025-05-07T20:31:45.6135585Z     contiguous=True,
2025-05-07T20:31:45.6135672Z     compiled=False,
2025-05-07T20:31:45.6135749Z )
2025-05-07T20:31:45.6135964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6136139Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6136144Z 
2025-05-07T20:31:45.6136222Z     @given(
2025-05-07T20:31:45.6136339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6136445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6136560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6136677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6136790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6136868Z     )
2025-05-07T20:31:45.6137112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6137209Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6137291Z         self,
2025-05-07T20:31:45.6137371Z         T: int,
2025-05-07T20:31:45.6137449Z         D: int,
2025-05-07T20:31:45.6137549Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6137642Z         contiguous: bool,
2025-05-07T20:31:45.6137729Z         compiled: bool,
2025-05-07T20:31:45.6137809Z     ) -> None:
2025-05-07T20:31:45.6137910Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6137985Z     
2025-05-07T20:31:45.6138156Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6138234Z     
2025-05-07T20:31:45.6138327Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6138451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6138543Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6138625Z         x0 = x[:, :D]
2025-05-07T20:31:45.6138714Z         x1 = x[:, D:]
2025-05-07T20:31:45.6138791Z     
2025-05-07T20:31:45.6138877Z         if contiguous:
2025-05-07T20:31:45.6138972Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6139151Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6139225Z     
2025-05-07T20:31:45.6139318Z         if scale_ub is not None:
2025-05-07T20:31:45.6139422Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6139556Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6139637Z             )
2025-05-07T20:31:45.6139715Z         else:
2025-05-07T20:31:45.6139808Z             scale_ub_tensor = None
2025-05-07T20:31:45.6139888Z     
2025-05-07T20:31:45.6140024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6140118Z             op = silu_mul_quant
2025-05-07T20:31:45.6140205Z             if compiled:
2025-05-07T20:31:45.6140306Z                 op = torch.compile(op)
2025-05-07T20:31:45.6140416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6140492Z     
2025-05-07T20:31:45.6140583Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6140588Z 
2025-05-07T20:31:45.6140696Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6140826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6140929Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6141034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6141531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6141631Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6142066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6142289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6142632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6142727Z     kernel = self.compile(
2025-05-07T20:31:45.6143105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6143285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6143413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6143417Z 
2025-05-07T20:31:45.6143625Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb5ce1d0>
2025-05-07T20:31:45.6144400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6144900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb75ae80>}
2025-05-07T20:31:45.6145650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6145846Z context = <triton._C.libtriton.ir.context object at 0x7f51bb5be070>
2025-05-07T20:31:45.6145850Z 
2025-05-07T20:31:45.6146017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6146279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6146392Z                            module_map=module_map)
2025-05-07T20:31:45.6146560Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6146664Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6146747Z E       ^
2025-05-07T20:31:45.6147098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6147103Z 
2025-05-07T20:31:45.6147513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6147599Z 
2025-05-07T20:31:45.6147708Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6147928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6148012Z     T=128,
2025-05-07T20:31:45.6148089Z     D=7168,
2025-05-07T20:31:45.6148174Z     scale_ub=None,
2025-05-07T20:31:45.6148262Z     contiguous=True,
2025-05-07T20:31:45.6148349Z     compiled=False,
2025-05-07T20:31:45.6148426Z )
2025-05-07T20:31:45.6148655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6148824Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6148828Z 
2025-05-07T20:31:45.6148907Z     @given(
2025-05-07T20:31:45.6149030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6149131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6149247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6149368Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6149481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6149564Z     )
2025-05-07T20:31:45.6149806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6149901Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6149985Z         self,
2025-05-07T20:31:45.6150063Z         T: int,
2025-05-07T20:31:45.6150141Z         D: int,
2025-05-07T20:31:45.6150344Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6150436Z         contiguous: bool,
2025-05-07T20:31:45.6150524Z         compiled: bool,
2025-05-07T20:31:45.6150606Z     ) -> None:
2025-05-07T20:31:45.6150704Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6150782Z     
2025-05-07T20:31:45.6150951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6151031Z     
2025-05-07T20:31:45.6151125Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6151254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6151344Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6151430Z         x0 = x[:, :D]
2025-05-07T20:31:45.6151512Z         x1 = x[:, D:]
2025-05-07T20:31:45.6151588Z     
2025-05-07T20:31:45.6151675Z         if contiguous:
2025-05-07T20:31:45.6151767Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6151858Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6151940Z     
2025-05-07T20:31:45.6152032Z         if scale_ub is not None:
2025-05-07T20:31:45.6152147Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6152281Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6152359Z             )
2025-05-07T20:31:45.6152442Z         else:
2025-05-07T20:31:45.6152536Z             scale_ub_tensor = None
2025-05-07T20:31:45.6152610Z     
2025-05-07T20:31:45.6152742Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6152834Z             op = silu_mul_quant
2025-05-07T20:31:45.6152923Z             if compiled:
2025-05-07T20:31:45.6153024Z                 op = torch.compile(op)
2025-05-07T20:31:45.6153128Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6153204Z     
2025-05-07T20:31:45.6153302Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6153307Z 
2025-05-07T20:31:45.6153404Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6153538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6153638Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6153744Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6154244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6154342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6154700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6154926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6155354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6155454Z     kernel = self.compile(
2025-05-07T20:31:45.6155834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6156011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6156148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6156153Z 
2025-05-07T20:31:45.6156358Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb5d89d0>
2025-05-07T20:31:45.6157134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6157636Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb75bec0>}
2025-05-07T20:31:45.6158380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6158649Z context = <triton._C.libtriton.ir.context object at 0x7f51bb5fc870>
2025-05-07T20:31:45.6158654Z 
2025-05-07T20:31:45.6158819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6159082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6159190Z                            module_map=module_map)
2025-05-07T20:31:45.6159350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6159453Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6159537Z E       ^
2025-05-07T20:31:45.6159891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6159901Z 
2025-05-07T20:31:45.6160313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6160318Z 
2025-05-07T20:31:45.6160423Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6160653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6160732Z     T=2048,
2025-05-07T20:31:45.6160810Z     D=7168,
2025-05-07T20:31:45.6160900Z     scale_ub=1200.0,
2025-05-07T20:31:45.6160987Z     contiguous=True,
2025-05-07T20:31:45.6161072Z     compiled=False,
2025-05-07T20:31:45.6161151Z )
2025-05-07T20:31:45.6161369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6161547Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6161556Z 
2025-05-07T20:31:45.6161635Z     @given(
2025-05-07T20:31:45.6161754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6161858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6161973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6162088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6162204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6162280Z     )
2025-05-07T20:31:45.6162534Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6162629Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6162708Z         self,
2025-05-07T20:31:45.6162789Z         T: int,
2025-05-07T20:31:45.6162867Z         D: int,
2025-05-07T20:31:45.6162964Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6163056Z         contiguous: bool,
2025-05-07T20:31:45.6163142Z         compiled: bool,
2025-05-07T20:31:45.6163301Z     ) -> None:
2025-05-07T20:31:45.6163404Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6163478Z     
2025-05-07T20:31:45.6163648Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6165429Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6165436Z 
2025-05-07T20:31:45.6165553Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6165560Z 
2025-05-07T20:31:45.6165663Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6165890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6165974Z     T=1,
2025-05-07T20:31:45.6166052Z     D=5120,
2025-05-07T20:31:45.6166134Z     scale_ub=1200.0,
2025-05-07T20:31:45.6166225Z     contiguous=True,
2025-05-07T20:31:45.6166313Z     compiled=False,
2025-05-07T20:31:45.6166391Z )
2025-05-07T20:31:45.6166610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6166851Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6166857Z 
2025-05-07T20:31:45.6166934Z     @given(
2025-05-07T20:31:45.6167056Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6167152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6167268Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6167384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6167548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6167634Z     )
2025-05-07T20:31:45.6167878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6167973Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6168054Z         self,
2025-05-07T20:31:45.6168133Z         T: int,
2025-05-07T20:31:45.6168209Z         D: int,
2025-05-07T20:31:45.6168308Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6168400Z         contiguous: bool,
2025-05-07T20:31:45.6168488Z         compiled: bool,
2025-05-07T20:31:45.6168574Z     ) -> None:
2025-05-07T20:31:45.6168669Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6168747Z     
2025-05-07T20:31:45.6168915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6168993Z     
2025-05-07T20:31:45.6169087Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6169212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6169303Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6169391Z         x0 = x[:, :D]
2025-05-07T20:31:45.6169472Z         x1 = x[:, D:]
2025-05-07T20:31:45.6169548Z     
2025-05-07T20:31:45.6169632Z         if contiguous:
2025-05-07T20:31:45.6169725Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6169815Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6169895Z     
2025-05-07T20:31:45.6169987Z         if scale_ub is not None:
2025-05-07T20:31:45.6170099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6170239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6170315Z             )
2025-05-07T20:31:45.6170397Z         else:
2025-05-07T20:31:45.6170491Z             scale_ub_tensor = None
2025-05-07T20:31:45.6170563Z     
2025-05-07T20:31:45.6170697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6170789Z             op = silu_mul_quant
2025-05-07T20:31:45.6170874Z             if compiled:
2025-05-07T20:31:45.6170976Z                 op = torch.compile(op)
2025-05-07T20:31:45.6171173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6171249Z     
2025-05-07T20:31:45.6171344Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6171349Z 
2025-05-07T20:31:45.6171448Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6171583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6171682Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6171782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6172287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6172386Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6172741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6172964Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6173303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6173407Z     kernel = self.compile(
2025-05-07T20:31:45.6173786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6173962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6174094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6174099Z 
2025-05-07T20:31:45.6174379Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb848390>
2025-05-07T20:31:45.6175155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6175652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb85d4e0>}
2025-05-07T20:31:45.6176403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6176598Z context = <triton._C.libtriton.ir.context object at 0x7f51bb8bc270>
2025-05-07T20:31:45.6176603Z 
2025-05-07T20:31:45.6176773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6177036Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6177144Z                            module_map=module_map)
2025-05-07T20:31:45.6177306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6177411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6177490Z E       ^
2025-05-07T20:31:45.6177850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6177859Z 
2025-05-07T20:31:45.6178273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6178278Z 
2025-05-07T20:31:45.6178382Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6178606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6178686Z     T=2048,
2025-05-07T20:31:45.6178761Z     D=5120,
2025-05-07T20:31:45.6178857Z     scale_ub=None,
2025-05-07T20:31:45.6178941Z     contiguous=True,
2025-05-07T20:31:45.6179030Z     compiled=False,
2025-05-07T20:31:45.6184198Z )
2025-05-07T20:31:45.6184455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6184636Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6184641Z 
2025-05-07T20:31:45.6184726Z     @given(
2025-05-07T20:31:45.6184987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6185090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6185211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6185331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6185444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6185525Z     )
2025-05-07T20:31:45.6185777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6185883Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6185964Z         self,
2025-05-07T20:31:45.6186044Z         T: int,
2025-05-07T20:31:45.6186128Z         D: int,
2025-05-07T20:31:45.6186231Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6186323Z         contiguous: bool,
2025-05-07T20:31:45.6186420Z         compiled: bool,
2025-05-07T20:31:45.6186502Z     ) -> None:
2025-05-07T20:31:45.6186599Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6186680Z     
2025-05-07T20:31:45.6186859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6186936Z     
2025-05-07T20:31:45.6187034Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.6188917Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6188925Z 
2025-05-07T20:31:45.6189051Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.6189056Z 
2025-05-07T20:31:45.6189165Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6189397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6189486Z     T=16384,
2025-05-07T20:31:45.6189565Z     D=5120,
2025-05-07T20:31:45.6189659Z     scale_ub=None,
2025-05-07T20:31:45.6189746Z     contiguous=True,
2025-05-07T20:31:45.6189835Z     compiled=False,
2025-05-07T20:31:45.6189918Z )
2025-05-07T20:31:45.6190138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6190316Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6190327Z 
2025-05-07T20:31:45.6190413Z     @given(
2025-05-07T20:31:45.6190535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6190640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6190756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6190874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6190993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6191076Z     )
2025-05-07T20:31:45.6191323Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6191423Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6191502Z         self,
2025-05-07T20:31:45.6191581Z         T: int,
2025-05-07T20:31:45.6191663Z         D: int,
2025-05-07T20:31:45.6191763Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6191855Z         contiguous: bool,
2025-05-07T20:31:45.6191947Z         compiled: bool,
2025-05-07T20:31:45.6192029Z     ) -> None:
2025-05-07T20:31:45.6192135Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6192210Z     
2025-05-07T20:31:45.6192380Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6194160Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6194264Z 
2025-05-07T20:31:45.6194385Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6194390Z 
2025-05-07T20:31:45.6194501Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6194728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6194808Z     T=4096,
2025-05-07T20:31:45.6194891Z     D=5120,
2025-05-07T20:31:45.6194977Z     scale_ub=None,
2025-05-07T20:31:45.6195064Z     contiguous=True,
2025-05-07T20:31:45.6195161Z     compiled=False,
2025-05-07T20:31:45.6195236Z )
2025-05-07T20:31:45.6195456Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6195628Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6195637Z 
2025-05-07T20:31:45.6195715Z     @given(
2025-05-07T20:31:45.6195841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6195944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6196058Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6196181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6196294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6196447Z     )
2025-05-07T20:31:45.6196699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6196796Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6196879Z         self,
2025-05-07T20:31:45.6196958Z         T: int,
2025-05-07T20:31:45.6197037Z         D: int,
2025-05-07T20:31:45.6197141Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6197232Z         contiguous: bool,
2025-05-07T20:31:45.6197328Z         compiled: bool,
2025-05-07T20:31:45.6197415Z     ) -> None:
2025-05-07T20:31:45.6197511Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6197586Z     
2025-05-07T20:31:45.6197759Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6199534Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6199540Z 
2025-05-07T20:31:45.6199665Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6199670Z 
2025-05-07T20:31:45.6199782Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6200007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6200087Z     T=2048,
2025-05-07T20:31:45.6200165Z     D=5120,
2025-05-07T20:31:45.6200253Z     scale_ub=None,
2025-05-07T20:31:45.6200341Z     contiguous=False,
2025-05-07T20:31:45.6200430Z     compiled=False,
2025-05-07T20:31:45.6200511Z )
2025-05-07T20:31:45.6200726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6200904Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.6200909Z 
2025-05-07T20:31:45.6200991Z     @given(
2025-05-07T20:31:45.6201113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6201221Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6201338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6201455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6201661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6201739Z     )
2025-05-07T20:31:45.6201986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6202087Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6202166Z         self,
2025-05-07T20:31:45.6202245Z         T: int,
2025-05-07T20:31:45.6202326Z         D: int,
2025-05-07T20:31:45.6202428Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6202521Z         contiguous: bool,
2025-05-07T20:31:45.6202618Z         compiled: bool,
2025-05-07T20:31:45.6202699Z     ) -> None:
2025-05-07T20:31:45.6202800Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6202877Z     
2025-05-07T20:31:45.6203047Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6204814Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6204827Z 
2025-05-07T20:31:45.6204947Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6204952Z 
2025-05-07T20:31:45.6205205Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6205430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6205510Z     T=4096,
2025-05-07T20:31:45.6207867Z     D=7168,
2025-05-07T20:31:45.6208005Z     scale_ub=None,
2025-05-07T20:31:45.6208088Z     contiguous=True,
2025-05-07T20:31:45.6208174Z     compiled=True,
2025-05-07T20:31:45.6208251Z )
2025-05-07T20:31:45.6208479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6208663Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.6208669Z 
2025-05-07T20:31:45.6208747Z     @given(
2025-05-07T20:31:45.6208870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6208967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6209084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6209202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6209317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6209389Z     )
2025-05-07T20:31:45.6209636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6209728Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6209808Z         self,
2025-05-07T20:31:45.6209883Z         T: int,
2025-05-07T20:31:45.6209958Z         D: int,
2025-05-07T20:31:45.6210057Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6210149Z         contiguous: bool,
2025-05-07T20:31:45.6210231Z         compiled: bool,
2025-05-07T20:31:45.6210313Z     ) -> None:
2025-05-07T20:31:45.6210406Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6210478Z     
2025-05-07T20:31:45.6210645Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6212478Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6212485Z 
2025-05-07T20:31:45.6212603Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6212862Z 
2025-05-07T20:31:45.6212968Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6213190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6213266Z     T=2048,
2025-05-07T20:31:45.6213344Z     D=5120,
2025-05-07T20:31:45.6213428Z     scale_ub=1200.0,
2025-05-07T20:31:45.6213514Z     contiguous=False,
2025-05-07T20:31:45.6213594Z     compiled=False,
2025-05-07T20:31:45.6213672Z )
2025-05-07T20:31:45.6213889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6214060Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.6214064Z 
2025-05-07T20:31:45.6214146Z     @given(
2025-05-07T20:31:45.6214261Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6214360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6214472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6214592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6214704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6214775Z     )
2025-05-07T20:31:45.6215016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6215111Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6215189Z         self,
2025-05-07T20:31:45.6215266Z         T: int,
2025-05-07T20:31:45.6215347Z         D: int,
2025-05-07T20:31:45.6215441Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6215643Z         contiguous: bool,
2025-05-07T20:31:45.6215730Z         compiled: bool,
2025-05-07T20:31:45.6215807Z     ) -> None:
2025-05-07T20:31:45.6215901Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6215974Z     
2025-05-07T20:31:45.6216138Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6217902Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6217916Z 
2025-05-07T20:31:45.6218036Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6218043Z 
2025-05-07T20:31:45.6218141Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6218358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6218436Z     T=4096,
2025-05-07T20:31:45.6218510Z     D=7168,
2025-05-07T20:31:45.6218591Z     scale_ub=1200.0,
2025-05-07T20:31:45.6218681Z     contiguous=True,
2025-05-07T20:31:45.6218763Z     compiled=False,
2025-05-07T20:31:45.6218841Z )
2025-05-07T20:31:45.6219055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6219222Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6219226Z 
2025-05-07T20:31:45.6219304Z     @given(
2025-05-07T20:31:45.6219423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6219520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6219631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6219750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6219860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6219938Z     )
2025-05-07T20:31:45.6220175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6220266Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6220344Z         self,
2025-05-07T20:31:45.6220415Z         T: int,
2025-05-07T20:31:45.6220490Z         D: int,
2025-05-07T20:31:45.6220671Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6220758Z         contiguous: bool,
2025-05-07T20:31:45.6220842Z         compiled: bool,
2025-05-07T20:31:45.6220920Z     ) -> None:
2025-05-07T20:31:45.6221012Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6221091Z     
2025-05-07T20:31:45.6221254Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6223027Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6223041Z 
2025-05-07T20:31:45.6223156Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6223160Z 
2025-05-07T20:31:45.6223259Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6223482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6223559Z     T=16384,
2025-05-07T20:31:45.6223634Z     D=7168,
2025-05-07T20:31:45.6223720Z     scale_ub=None,
2025-05-07T20:31:45.6223805Z     contiguous=False,
2025-05-07T20:31:45.6223886Z     compiled=True,
2025-05-07T20:31:45.6224041Z )
2025-05-07T20:31:45.6224255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6224431Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.6224436Z 
2025-05-07T20:31:45.6224514Z     @given(
2025-05-07T20:31:45.6224627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6224727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6224843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6224956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6225066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6225140Z     )
2025-05-07T20:31:45.6225383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6225475Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6225553Z         self,
2025-05-07T20:31:45.6225631Z         T: int,
2025-05-07T20:31:45.6225710Z         D: int,
2025-05-07T20:31:45.6225808Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6225900Z         contiguous: bool,
2025-05-07T20:31:45.6225985Z         compiled: bool,
2025-05-07T20:31:45.6226063Z     ) -> None:
2025-05-07T20:31:45.6226160Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6226232Z     
2025-05-07T20:31:45.6226396Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6228165Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6228176Z 
2025-05-07T20:31:45.6228293Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6228301Z 
2025-05-07T20:31:45.6228404Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6228624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6228705Z     T=4096,
2025-05-07T20:31:45.6228782Z     D=7168,
2025-05-07T20:31:45.6228865Z     scale_ub=None,
2025-05-07T20:31:45.6229039Z     contiguous=True,
2025-05-07T20:31:45.6229124Z     compiled=False,
2025-05-07T20:31:45.6229198Z )
2025-05-07T20:31:45.6229419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6229587Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6229592Z 
2025-05-07T20:31:45.6229670Z     @given(
2025-05-07T20:31:45.6229791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6229893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6230016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6230131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6230240Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6230318Z     )
2025-05-07T20:31:45.6230560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6230653Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6230736Z         self,
2025-05-07T20:31:45.6230820Z         T: int,
2025-05-07T20:31:45.6230897Z         D: int,
2025-05-07T20:31:45.6230999Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6231088Z         contiguous: bool,
2025-05-07T20:31:45.6231175Z         compiled: bool,
2025-05-07T20:31:45.6231254Z     ) -> None:
2025-05-07T20:31:45.6231346Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6231424Z     
2025-05-07T20:31:45.6231589Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6233433Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6233450Z 
2025-05-07T20:31:45.6233568Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6233572Z 
2025-05-07T20:31:45.6233674Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6233897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6233977Z     T=16384,
2025-05-07T20:31:45.6234054Z     D=7168,
2025-05-07T20:31:45.6234144Z     scale_ub=None,
2025-05-07T20:31:45.6234233Z     contiguous=True,
2025-05-07T20:31:45.6234316Z     compiled=False,
2025-05-07T20:31:45.6234391Z )
2025-05-07T20:31:45.6234604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6234780Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6234785Z 
2025-05-07T20:31:45.6234861Z     @given(
2025-05-07T20:31:45.6234978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6235088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6235200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6235313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6235427Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6235500Z     )
2025-05-07T20:31:45.6235744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6235837Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6235920Z         self,
2025-05-07T20:31:45.6236006Z         T: int,
2025-05-07T20:31:45.6236083Z         D: int,
2025-05-07T20:31:45.6236180Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6236270Z         contiguous: bool,
2025-05-07T20:31:45.6236355Z         compiled: bool,
2025-05-07T20:31:45.6236433Z     ) -> None:
2025-05-07T20:31:45.6236534Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6236606Z     
2025-05-07T20:31:45.6236770Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6238630Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6238636Z 
2025-05-07T20:31:45.6238751Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6238759Z 
2025-05-07T20:31:45.6238860Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6239077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6239156Z     T=16384,
2025-05-07T20:31:45.6239237Z     D=7168,
2025-05-07T20:31:45.6239324Z     scale_ub=1200.0,
2025-05-07T20:31:45.6239410Z     contiguous=True,
2025-05-07T20:31:45.6239493Z     compiled=False,
2025-05-07T20:31:45.6239566Z )
2025-05-07T20:31:45.6239780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6239951Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6239956Z 
2025-05-07T20:31:45.6240033Z     @given(
2025-05-07T20:31:45.6240258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6240359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6240474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6240588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6240700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6240780Z     )
2025-05-07T20:31:45.6241023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6241122Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6241201Z         self,
2025-05-07T20:31:45.6241278Z         T: int,
2025-05-07T20:31:45.6241354Z         D: int,
2025-05-07T20:31:45.6241454Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6241541Z         contiguous: bool,
2025-05-07T20:31:45.6241634Z         compiled: bool,
2025-05-07T20:31:45.6241714Z     ) -> None:
2025-05-07T20:31:45.6241807Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6241885Z     
2025-05-07T20:31:45.6242055Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6243823Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6243840Z 
2025-05-07T20:31:45.6243958Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6243963Z 
2025-05-07T20:31:45.6244064Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6244288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6244367Z     T=128,
2025-05-07T20:31:45.6244448Z     D=5120,
2025-05-07T20:31:45.6244535Z     scale_ub=1200.0,
2025-05-07T20:31:45.6244621Z     contiguous=False,
2025-05-07T20:31:45.6244709Z     compiled=False,
2025-05-07T20:31:45.6244782Z )
2025-05-07T20:31:45.6244995Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6245172Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.6245177Z 
2025-05-07T20:31:45.6245253Z     @given(
2025-05-07T20:31:45.6245451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6245553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6245665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6245788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6245902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6245976Z     )
2025-05-07T20:31:45.6246223Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6246323Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6246399Z         self,
2025-05-07T20:31:45.6246480Z         T: int,
2025-05-07T20:31:45.6246558Z         D: int,
2025-05-07T20:31:45.6246658Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6246750Z         contiguous: bool,
2025-05-07T20:31:45.6246835Z         compiled: bool,
2025-05-07T20:31:45.6246918Z     ) -> None:
2025-05-07T20:31:45.6247013Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6247093Z     
2025-05-07T20:31:45.6247262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6247335Z     
2025-05-07T20:31:45.6247427Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6247674Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6247769Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6247853Z         x0 = x[:, :D]
2025-05-07T20:31:45.6247940Z         x1 = x[:, D:]
2025-05-07T20:31:45.6248016Z     
2025-05-07T20:31:45.6248107Z         if contiguous:
2025-05-07T20:31:45.6248291Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6248386Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6248469Z     
2025-05-07T20:31:45.6248564Z         if scale_ub is not None:
2025-05-07T20:31:45.6248676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6248819Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6248899Z             )
2025-05-07T20:31:45.6248978Z         else:
2025-05-07T20:31:45.6249086Z             scale_ub_tensor = None
2025-05-07T20:31:45.6249161Z     
2025-05-07T20:31:45.6249296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6249394Z             op = silu_mul_quant
2025-05-07T20:31:45.6249480Z             if compiled:
2025-05-07T20:31:45.6249583Z                 op = torch.compile(op)
2025-05-07T20:31:45.6249696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6249770Z     
2025-05-07T20:31:45.6249870Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6249881Z 
2025-05-07T20:31:45.6249982Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6250116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6250226Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6250327Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6250855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6250976Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6251354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6251582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6251924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6252022Z     kernel = self.compile(
2025-05-07T20:31:45.6252421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6252597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6252726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6252733Z 
2025-05-07T20:31:45.6252939Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb35f750>
2025-05-07T20:31:45.6253719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6254314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb368220>}
2025-05-07T20:31:45.6255068Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6255262Z context = <triton._C.libtriton.ir.context object at 0x7f51bb363370>
2025-05-07T20:31:45.6255267Z 
2025-05-07T20:31:45.6255430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6255696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6255815Z                            module_map=module_map)
2025-05-07T20:31:45.6255981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6256088Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6256168Z E       ^
2025-05-07T20:31:45.6256527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6256532Z 
2025-05-07T20:31:45.6257027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6257032Z 
2025-05-07T20:31:45.6257137Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6257360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6257451Z     T=2048,
2025-05-07T20:31:45.6257530Z     D=7168,
2025-05-07T20:31:45.6257618Z     scale_ub=None,
2025-05-07T20:31:45.6257708Z     contiguous=False,
2025-05-07T20:31:45.6257798Z     compiled=False,
2025-05-07T20:31:45.6257876Z )
2025-05-07T20:31:45.6258093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6258270Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.6258274Z 
2025-05-07T20:31:45.6258361Z     @given(
2025-05-07T20:31:45.6258483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6258587Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6258711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6258828Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6258948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6259029Z     )
2025-05-07T20:31:45.6259276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6259377Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6259458Z         self,
2025-05-07T20:31:45.6259540Z         T: int,
2025-05-07T20:31:45.6259627Z         D: int,
2025-05-07T20:31:45.6259730Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6259824Z         contiguous: bool,
2025-05-07T20:31:45.6259917Z         compiled: bool,
2025-05-07T20:31:45.6259999Z     ) -> None:
2025-05-07T20:31:45.6260098Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6260182Z     
2025-05-07T20:31:45.6260354Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6262134Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6262222Z 
2025-05-07T20:31:45.6262345Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6262350Z 
2025-05-07T20:31:45.6262459Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6262682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6262761Z     T=128,
2025-05-07T20:31:45.6262848Z     D=7168,
2025-05-07T20:31:45.6262934Z     scale_ub=1200.0,
2025-05-07T20:31:45.6263027Z     contiguous=True,
2025-05-07T20:31:45.6263119Z     compiled=True,
2025-05-07T20:31:45.6263195Z )
2025-05-07T20:31:45.6263411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6263583Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.6263588Z 
2025-05-07T20:31:45.6263668Z     @given(
2025-05-07T20:31:45.6263791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6263891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6264014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6264134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6264247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6264325Z     )
2025-05-07T20:31:45.6264572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6264669Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6264749Z         self,
2025-05-07T20:31:45.6264912Z         T: int,
2025-05-07T20:31:45.6264992Z         D: int,
2025-05-07T20:31:45.6265098Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6265189Z         contiguous: bool,
2025-05-07T20:31:45.6265283Z         compiled: bool,
2025-05-07T20:31:45.6265367Z     ) -> None:
2025-05-07T20:31:45.6265465Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6265543Z     
2025-05-07T20:31:45.6265717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6265802Z     
2025-05-07T20:31:45.6265896Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6266029Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6266121Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6266204Z         x0 = x[:, :D]
2025-05-07T20:31:45.6266294Z         x1 = x[:, D:]
2025-05-07T20:31:45.6266369Z     
2025-05-07T20:31:45.6266461Z         if contiguous:
2025-05-07T20:31:45.6266557Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6266649Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6266733Z     
2025-05-07T20:31:45.6266828Z         if scale_ub is not None:
2025-05-07T20:31:45.6266935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6267078Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6267160Z             )
2025-05-07T20:31:45.6267239Z         else:
2025-05-07T20:31:45.6267344Z             scale_ub_tensor = None
2025-05-07T20:31:45.6267420Z     
2025-05-07T20:31:45.6267553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6267654Z             op = silu_mul_quant
2025-05-07T20:31:45.6267744Z             if compiled:
2025-05-07T20:31:45.6267852Z                 op = torch.compile(op)
2025-05-07T20:31:45.6267961Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6268037Z     
2025-05-07T20:31:45.6268135Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6268140Z 
2025-05-07T20:31:45.6268241Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6268378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6268487Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6268587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6268958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.6269056Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.6269549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6269759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6270116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6270339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6270683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6270787Z     kernel = self.compile(
2025-05-07T20:31:45.6271213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6271399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6271531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6271536Z 
2025-05-07T20:31:45.6271749Z self = <triton.compiler.compiler.ASTSource object at 0x7f51bb3c2e90>
2025-05-07T20:31:45.6272529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6273034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5227644400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f51bb368860>}
2025-05-07T20:31:45.6273860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6274054Z context = <triton._C.libtriton.ir.context object at 0x7f51bb3babb0>
2025-05-07T20:31:45.6274059Z 
2025-05-07T20:31:45.6274228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6274492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6274611Z                            module_map=module_map)
2025-05-07T20:31:45.6274777Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6274878Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6274962Z E       ^
2025-05-07T20:31:45.6275316Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6275321Z 
2025-05-07T20:31:45.6275740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6275748Z 
2025-05-07T20:31:45.6275855Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6276077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6276160Z     T=128,
2025-05-07T20:31:45.6276240Z     D=7168,
2025-05-07T20:31:45.6276325Z     scale_ub=1200.0,
2025-05-07T20:31:45.6276423Z     contiguous=True,
2025-05-07T20:31:45.6276510Z     compiled=False,
2025-05-07T20:31:45.6276586Z )
2025-05-07T20:31:45.6276808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6276984Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6276989Z 
2025-05-07T20:31:45.6277075Z     @given(
2025-05-07T20:31:45.6277196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6277301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6277423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6277542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6277657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6277736Z     )
2025-05-07T20:31:45.6277981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6278080Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6278245Z         self,
2025-05-07T20:31:45.6278327Z         T: int,
2025-05-07T20:31:45.6278407Z         D: int,
2025-05-07T20:31:45.6278513Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6278605Z         contiguous: bool,
2025-05-07T20:31:45.6278699Z         compiled: bool,
2025-05-07T20:31:45.6278781Z     ) -> None:
2025-05-07T20:31:45.6278879Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6278957Z     
2025-05-07T20:31:45.6279126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6279210Z     
2025-05-07T20:31:45.6279311Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6279437Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6281208Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6281222Z 
2025-05-07T20:31:45.6281363Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.6281368Z 
2025-05-07T20:31:45.6281493Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6281798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6281884Z     T=128,
2025-05-07T20:31:45.6281967Z     D=5120,
2025-05-07T20:31:45.6282055Z     scale_ub=1200.0,
2025-05-07T20:31:45.6282147Z     contiguous=True,
2025-05-07T20:31:45.6282238Z     compiled=True,
2025-05-07T20:31:45.6282317Z )
2025-05-07T20:31:45.6282533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6282703Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.6282713Z 
2025-05-07T20:31:45.6282794Z     @given(
2025-05-07T20:31:45.6282915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6283021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6283138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6283258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6283372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6283453Z     )
2025-05-07T20:31:45.6283709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6283809Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6283890Z         self,
2025-05-07T20:31:45.6283973Z         T: int,
2025-05-07T20:31:45.6284054Z         D: int,
2025-05-07T20:31:45.6284156Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6284254Z         contiguous: bool,
2025-05-07T20:31:45.6284343Z         compiled: bool,
2025-05-07T20:31:45.6284430Z     ) -> None:
2025-05-07T20:31:45.6284533Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6284611Z     
2025-05-07T20:31:45.6284786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6284865Z     
2025-05-07T20:31:45.6284962Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.6286730Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6286736Z 
2025-05-07T20:31:45.6286855Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.6286941Z 
2025-05-07T20:31:45.6287051Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6287273Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6287355Z     T=128,
2025-05-07T20:31:45.6287440Z     D=7168,
2025-05-07T20:31:45.6287597Z     scale_ub=None,
2025-05-07T20:31:45.6287688Z     contiguous=True,
2025-05-07T20:31:45.6287777Z     compiled=True,
2025-05-07T20:31:45.6287851Z )
2025-05-07T20:31:45.6288079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6288247Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.6288252Z 
2025-05-07T20:31:45.6288333Z     @given(
2025-05-07T20:31:45.6288456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6288557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6288673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6288794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6288914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6288991Z     )
2025-05-07T20:31:45.6289239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6289335Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6289417Z         self,
2025-05-07T20:31:45.6289497Z         T: int,
2025-05-07T20:31:45.6289576Z         D: int,
2025-05-07T20:31:45.6289680Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6289852Z         contiguous: bool,
2025-05-07T20:31:45.6289943Z         compiled: bool,
2025-05-07T20:31:45.6290027Z     ) -> None:
2025-05-07T20:31:45.6290125Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6290200Z     
2025-05-07T20:31:45.6290370Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6292125Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6292136Z 
2025-05-07T20:31:45.6292265Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6292400Z =============================== warnings summary ===============================
2025-05-07T20:31:45.6292713Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:45.6293015Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:45.6293312Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:45.6294199Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:31:45.6294428Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:31:45.6294433Z 
2025-05-07T20:31:45.6294621Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:31:45.6295893Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:31:45.6296081Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:31:45.6296169Z 
2025-05-07T20:31:45.6296382Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:31:45.6296550Z ================== 1 failed, 1 passed, 13 warnings in 21.87s ===================
2025-05-07T20:31:47.3461972Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:31:47.4079862Z 
2025-05-07T20:31:47.4080303Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:31:47.4080728Z 
2025-05-07T20:31:47.4080740Z 
2025-05-07T20:31:47.4101037Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:31:49.5359591Z ============================= test session starts ==============================
2025-05-07T20:31:49.5360254Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:49.5360781Z cachedir: .pytest_cache
2025-05-07T20:31:49.5361354Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:49.5362081Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:49.5362799Z plugins: hypothesis-6.131.14
2025-05-07T20:31:51.1270204Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:51.2789655Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:31:51.2790443Z run-last-failure: rerun previous 1 failure
2025-05-07T20:31:51.2790884Z 
2025-05-07T20:31:53.4194697Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:53.4195824Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:53.4197170Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:53.4198727Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:53.4199710Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.4201014Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:53.4202397Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4203381Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.4204600Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:53.4206299Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4216268Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.4217754Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:53.4219035Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:53.4220268Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:53.4221485Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:53.4222327Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.4223355Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:53.4224581Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:53.4225392Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:53.4226612Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:53.4227906Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:53.4229026Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:53.4230079Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:53.4231259Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:53.4232620Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:53.4233824Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4234764Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.4235512Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:53.4236542Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4365773Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:53.4366837Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:53.4368501Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:53.4369940Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:53.4370920Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.4372230Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:53.4373609Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4374585Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.4375939Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:53.4377309Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4378368Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.4379652Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:53.4380882Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:53.4382098Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:53.4383297Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:53.4384121Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.4385134Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:53.4386149Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:53.4386939Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:53.4388144Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:53.4389419Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:53.4390604Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:53.4391636Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:53.4392818Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:53.4394163Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:53.4395222Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4396126Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.4396861Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:53.4397952Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.9530227Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.9530912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.9531349Z     T=1,
2025-05-07T20:31:53.9531547Z     D=5120,
2025-05-07T20:31:53.9531755Z     scale_ub=None,
2025-05-07T20:31:53.9531982Z     contiguous=True,
2025-05-07T20:31:53.9532210Z     compiled=True,
2025-05-07T20:31:53.9532452Z )
2025-05-07T20:31:53.9532781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.9533272Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:53.9533544Z 
2025-05-07T20:31:53.9533630Z     @given(
2025-05-07T20:31:53.9533875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.9534192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.9534509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.9534862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.9535204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.9535502Z     )
2025-05-07T20:31:53.9535862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.9536319Z     def test_silu_mul_quant(
2025-05-07T20:31:53.9536578Z         self,
2025-05-07T20:31:53.9536795Z         T: int,
2025-05-07T20:31:53.9537013Z         D: int,
2025-05-07T20:31:53.9537238Z         scale_ub: Optional[float],
2025-05-07T20:31:53.9537526Z         contiguous: bool,
2025-05-07T20:31:53.9537784Z         compiled: bool,
2025-05-07T20:31:53.9538018Z     ) -> None:
2025-05-07T20:31:53.9538248Z         torch.manual_seed(2025)
2025-05-07T20:31:53.9538507Z     
2025-05-07T20:31:53.9538794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.9539152Z     
2025-05-07T20:31:53.9539363Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.9539666Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.9539989Z         x = x_sign * x_clamp
2025-05-07T20:31:53.9540243Z         x0 = x[:, :D]
2025-05-07T20:31:53.9540471Z         x1 = x[:, D:]
2025-05-07T20:31:53.9540686Z     
2025-05-07T20:31:53.9540885Z         if contiguous:
2025-05-07T20:31:53.9541132Z             x0 = x0.contiguous()
2025-05-07T20:31:53.9541399Z             x1 = x1.contiguous()
2025-05-07T20:31:53.9541655Z     
2025-05-07T20:31:53.9542198Z         if scale_ub is not None:
2025-05-07T20:31:53.9542475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.9542823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.9543144Z             )
2025-05-07T20:31:53.9543347Z         else:
2025-05-07T20:31:53.9543570Z             scale_ub_tensor = None
2025-05-07T20:31:53.9543830Z     
2025-05-07T20:31:53.9544064Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.9544389Z             op = silu_mul_quant
2025-05-07T20:31:53.9544653Z             if compiled:
2025-05-07T20:31:53.9544903Z                 op = torch.compile(op)
2025-05-07T20:31:53.9545208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.9545492Z     
2025-05-07T20:31:53.9545695Z         y_fp8, y_scale = fn()
2025-05-07T20:31:53.9545995Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:53.9546296Z     
2025-05-07T20:31:53.9546542Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.9546881Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:53.9547183Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:53.9547509Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:53.9547876Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.9548187Z     
2025-05-07T20:31:53.9548402Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:53.9548600Z 
2025-05-07T20:31:53.9548710Z moe/activation_test.py:126: 
2025-05-07T20:31:53.9549154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.9549501Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:53.9549835Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.9550630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:53.9551391Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:53.9551951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.9552639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.9553369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:53.9554305Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.9555064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:53.9555818Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.9556541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:53.9557185Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:53.9557797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:53.9558322Z     fn()
2025-05-07T20:31:53.9558826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:53.9559414Z     self.fn.run(
2025-05-07T20:31:53.9559888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.9560422Z     kernel = self.compile(
2025-05-07T20:31:53.9560968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.9561619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.9562018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.9562401Z 
2025-05-07T20:31:53.9562610Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a97fb7950>
2025-05-07T20:31:53.9563789Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.9565180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a97697ce0>}
2025-05-07T20:31:53.9566536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.9567623Z context = <triton._C.libtriton.ir.context object at 0x7f3a9d5fbcf0>
2025-05-07T20:31:53.9567920Z 
2025-05-07T20:31:53.9568088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.9568618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.9569094Z                            module_map=module_map)
2025-05-07T20:31:53.9569457Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.9569819Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:53.9570093Z E       ^
2025-05-07T20:31:53.9570556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.9571097Z 
2025-05-07T20:31:53.9571516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.9572038Z 
2025-05-07T20:31:53.9572147Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.9572563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.9572965Z     T=2048,
2025-05-07T20:31:53.9573162Z     D=5120,
2025-05-07T20:31:53.9573367Z     scale_ub=1200.0,
2025-05-07T20:31:53.9573585Z     contiguous=True,
2025-05-07T20:31:53.9573812Z     compiled=False,
2025-05-07T20:31:53.9574024Z )
2025-05-07T20:31:54.4995232Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.4997391Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:54.5000164Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.5003452Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.5004490Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.5006059Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.5007448Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.5008542Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.5009778Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.5011507Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.5012589Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.5013872Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.5015126Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:54.5016354Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.5017561Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:54.5018537Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.5019568Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:54.5020589Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:54.5021383Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:54.5022594Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.5023930Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.5025064Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.5026108Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:54.5027290Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.5028655Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.5029722Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.5030640Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.5031389Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:54.5032413Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.6055620Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.6056688Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:54.6058041Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.6059477Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.6060458Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.6061780Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.6063164Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.6064414Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.6065652Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.6067025Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.6068094Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.6069382Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.6070620Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:54.6071840Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.6073051Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:54.6073879Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.6074911Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:54.6075923Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:54.6076719Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:54.6077930Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.6079357Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.6080473Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.6081512Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:54.6082691Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.6084055Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.6085120Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.6086023Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.6086872Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:54.6087990Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.0528150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.0528685Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:55.0529003Z 
2025-05-07T20:31:55.0529086Z     @given(
2025-05-07T20:31:55.0529327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.0529647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.0529961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.0530297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.0530627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.0530936Z     )
2025-05-07T20:31:55.0531291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.0531740Z     def test_silu_mul_quant(
2025-05-07T20:31:55.0531983Z         self,
2025-05-07T20:31:55.0532188Z         T: int,
2025-05-07T20:31:55.0532395Z         D: int,
2025-05-07T20:31:55.0532654Z         scale_ub: Optional[float],
2025-05-07T20:31:55.0532925Z         contiguous: bool,
2025-05-07T20:31:55.0533173Z         compiled: bool,
2025-05-07T20:31:55.0533414Z     ) -> None:
2025-05-07T20:31:55.0533629Z         torch.manual_seed(2025)
2025-05-07T20:31:55.0533883Z     
2025-05-07T20:31:55.0534168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.0534523Z     
2025-05-07T20:31:55.0534719Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.0535021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.0535337Z         x = x_sign * x_clamp
2025-05-07T20:31:55.0535583Z         x0 = x[:, :D]
2025-05-07T20:31:55.0535804Z         x1 = x[:, D:]
2025-05-07T20:31:55.0536019Z     
2025-05-07T20:31:55.0536207Z         if contiguous:
2025-05-07T20:31:55.0536449Z             x0 = x0.contiguous()
2025-05-07T20:31:55.0536710Z             x1 = x1.contiguous()
2025-05-07T20:31:55.0536947Z     
2025-05-07T20:31:55.0537145Z         if scale_ub is not None:
2025-05-07T20:31:55.0537421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.0537753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.0538381Z             )
2025-05-07T20:31:55.0538580Z         else:
2025-05-07T20:31:55.0538791Z             scale_ub_tensor = None
2025-05-07T20:31:55.0539050Z     
2025-05-07T20:31:55.0539289Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.0539613Z             op = silu_mul_quant
2025-05-07T20:31:55.0539863Z             if compiled:
2025-05-07T20:31:55.0540114Z                 op = torch.compile(op)
2025-05-07T20:31:55.0540419Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.0540693Z     
2025-05-07T20:31:55.0540897Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.0541063Z 
2025-05-07T20:31:55.0541177Z moe/activation_test.py:117: 
2025-05-07T20:31:55.0541475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.0541813Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.0542098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.0542790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.0543501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.0544091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.0544779Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.0545577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.0546121Z     kernel = self.compile(
2025-05-07T20:31:55.0546672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.0547335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.0547732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.0547978Z 
2025-05-07T20:31:55.0548187Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a960f9310>
2025-05-07T20:31:55.0549276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.0550686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a97911da0>}
2025-05-07T20:31:55.0552037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.0553067Z context = <triton._C.libtriton.ir.context object at 0x7f3a960eb2b0>
2025-05-07T20:31:55.0553364Z 
2025-05-07T20:31:55.0553531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.0554065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.0554530Z                            module_map=module_map)
2025-05-07T20:31:55.0554903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.0555261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.0555521Z E       ^
2025-05-07T20:31:55.0556000Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.0556463Z 
2025-05-07T20:31:55.0556888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.0557403Z 
2025-05-07T20:31:55.0557514Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.0557926Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.0558343Z     T=2048,
2025-05-07T20:31:55.0558629Z     D=5120,
2025-05-07T20:31:55.0558827Z     scale_ub=1200.0,
2025-05-07T20:31:55.0559059Z     contiguous=True,
2025-05-07T20:31:55.0559288Z     compiled=True,
2025-05-07T20:31:55.0559504Z )
2025-05-07T20:31:55.0559827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.0560328Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:55.0560602Z 
2025-05-07T20:31:55.0560693Z     @given(
2025-05-07T20:31:55.0560929Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.0561249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.0561564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.0561894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.0562229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.0562525Z     )
2025-05-07T20:31:55.0562881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.0563330Z     def test_silu_mul_quant(
2025-05-07T20:31:55.0563583Z         self,
2025-05-07T20:31:55.0563787Z         T: int,
2025-05-07T20:31:55.0563989Z         D: int,
2025-05-07T20:31:55.0564216Z         scale_ub: Optional[float],
2025-05-07T20:31:55.0564494Z         contiguous: bool,
2025-05-07T20:31:55.0564734Z         compiled: bool,
2025-05-07T20:31:55.0564970Z     ) -> None:
2025-05-07T20:31:55.0565197Z         torch.manual_seed(2025)
2025-05-07T20:31:55.0565439Z     
2025-05-07T20:31:55.0565801Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.0566158Z     
2025-05-07T20:31:55.0566353Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.0566654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.0566974Z         x = x_sign * x_clamp
2025-05-07T20:31:55.0567214Z         x0 = x[:, :D]
2025-05-07T20:31:55.0567440Z         x1 = x[:, D:]
2025-05-07T20:31:55.0567738Z     
2025-05-07T20:31:55.0567926Z         if contiguous:
2025-05-07T20:31:55.0568177Z             x0 = x0.contiguous()
2025-05-07T20:31:55.0568445Z             x1 = x1.contiguous()
2025-05-07T20:31:55.0568699Z     
2025-05-07T20:31:55.0568895Z         if scale_ub is not None:
2025-05-07T20:31:55.0569171Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.0569511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.0569825Z             )
2025-05-07T20:31:55.0570025Z         else:
2025-05-07T20:31:55.0570246Z             scale_ub_tensor = None
2025-05-07T20:31:55.0570503Z     
2025-05-07T20:31:55.0570740Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.0571057Z             op = silu_mul_quant
2025-05-07T20:31:55.0571306Z             if compiled:
2025-05-07T20:31:55.0571558Z                 op = torch.compile(op)
2025-05-07T20:31:55.0571860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.0572135Z     
2025-05-07T20:31:55.0572337Z         y_fp8, y_scale = fn()
2025-05-07T20:31:55.0572632Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:55.0572924Z     
2025-05-07T20:31:55.0573167Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.0573510Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:55.0573852Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:55.0574168Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:55.0574529Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.0574848Z     
2025-05-07T20:31:55.0575050Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:55.0575250Z 
2025-05-07T20:31:55.0575349Z moe/activation_test.py:126: 
2025-05-07T20:31:55.0575651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.0575986Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:55.0576314Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.0577104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:55.0577953Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:55.0578498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.0579186Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.0579881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:55.0580615Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.0589277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:55.0590256Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.0591186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:55.0591977Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:55.0592711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:55.0593347Z     fn()
2025-05-07T20:31:55.0594022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:55.0594845Z     self.fn.run(
2025-05-07T20:31:55.0595332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.0595878Z     kernel = self.compile(
2025-05-07T20:31:55.0596433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.0597091Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.0597508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.0597744Z 
2025-05-07T20:31:55.0597965Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a85e9a750>
2025-05-07T20:31:55.0599058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.0600443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a9642f2e0>}
2025-05-07T20:31:55.0601800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.0602835Z context = <triton._C.libtriton.ir.context object at 0x7f3a85e9e4b0>
2025-05-07T20:31:55.0603132Z 
2025-05-07T20:31:55.0603311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.0603840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.0604315Z                            module_map=module_map)
2025-05-07T20:31:55.0604698Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.0605067Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:55.0605342Z E       ^
2025-05-07T20:31:55.0606153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.0606608Z 
2025-05-07T20:31:55.0607037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.0607604Z 
2025-05-07T20:31:55.0607714Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.0608138Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.0608716Z     T=16384,
2025-05-07T20:31:55.0608925Z     D=7168,
2025-05-07T20:31:55.0609128Z     scale_ub=1200.0,
2025-05-07T20:31:55.0609370Z     contiguous=False,
2025-05-07T20:31:55.0609607Z     compiled=False,
2025-05-07T20:31:55.0609819Z )
2025-05-07T20:31:55.3726105Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.3728454Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:55.3731159Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.3733873Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.3734892Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.3736563Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.3737957Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.3738950Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.3740184Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.3741564Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.3742643Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.3743928Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.3745180Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:55.3746391Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.3747602Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:55.3748436Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.3749466Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:55.3750491Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:55.3751434Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:55.3752649Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.3753948Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.3755071Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:55.3756113Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:55.3757289Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.3758651Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.3759799Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.3760717Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.3761459Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:55.3762479Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.4482286Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.4483556Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:55.4485272Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.4486717Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.4487810Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.4489128Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.4490519Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.4491514Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.4492753Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.4494509Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.4495590Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.4496870Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.4498123Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:55.4499362Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.4500573Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:55.4501541Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.4502573Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:55.4503595Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:55.4504396Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:55.4505888Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.4507188Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.4508311Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:55.4509352Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:55.4510530Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.4511892Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.4512955Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.4513892Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.4514663Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:55.4515684Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.1322791Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.1323359Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:56.1323644Z 
2025-05-07T20:31:56.1323730Z     @given(
2025-05-07T20:31:56.1323973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.1324296Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.1324639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.1324984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.1325317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.1325617Z     )
2025-05-07T20:31:56.1325970Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.1326421Z     def test_silu_mul_quant(
2025-05-07T20:31:56.1326668Z         self,
2025-05-07T20:31:56.1326865Z         T: int,
2025-05-07T20:31:56.1327085Z         D: int,
2025-05-07T20:31:56.1327310Z         scale_ub: Optional[float],
2025-05-07T20:31:56.1327717Z         contiguous: bool,
2025-05-07T20:31:56.1327966Z         compiled: bool,
2025-05-07T20:31:56.1328194Z     ) -> None:
2025-05-07T20:31:56.1328419Z         torch.manual_seed(2025)
2025-05-07T20:31:56.1328673Z     
2025-05-07T20:31:56.1328944Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.1329295Z     
2025-05-07T20:31:56.1329497Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.1330116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.1330445Z         x = x_sign * x_clamp
2025-05-07T20:31:56.1330695Z         x0 = x[:, :D]
2025-05-07T20:31:56.1330914Z         x1 = x[:, D:]
2025-05-07T20:31:56.1331131Z     
2025-05-07T20:31:56.1331323Z         if contiguous:
2025-05-07T20:31:56.1331555Z             x0 = x0.contiguous()
2025-05-07T20:31:56.1331819Z             x1 = x1.contiguous()
2025-05-07T20:31:56.1332062Z     
2025-05-07T20:31:56.1332261Z         if scale_ub is not None:
2025-05-07T20:31:56.1332538Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.1332883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.1333198Z             )
2025-05-07T20:31:56.1333394Z         else:
2025-05-07T20:31:56.1333614Z             scale_ub_tensor = None
2025-05-07T20:31:56.1333872Z     
2025-05-07T20:31:56.1334106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.1334432Z             op = silu_mul_quant
2025-05-07T20:31:56.1334686Z             if compiled:
2025-05-07T20:31:56.1334934Z                 op = torch.compile(op)
2025-05-07T20:31:56.1335236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.1335519Z     
2025-05-07T20:31:56.1335711Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.1335882Z 
2025-05-07T20:31:56.1335984Z moe/activation_test.py:117: 
2025-05-07T20:31:56.1336286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1336626Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.1336912Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.1337609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.1338311Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.1338846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.1339541Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.1340211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.1340751Z     kernel = self.compile(
2025-05-07T20:31:56.1341297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.1341960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.1342543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1342775Z 
2025-05-07T20:31:56.1342985Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a85ef6290>
2025-05-07T20:31:56.1344080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.1345479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a96238860>}
2025-05-07T20:31:56.1346827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.1347859Z context = <triton._C.libtriton.ir.context object at 0x7f3a97790d30>
2025-05-07T20:31:56.1348148Z 
2025-05-07T20:31:56.1348315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.1348846Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.1349319Z                            module_map=module_map)
2025-05-07T20:31:56.1349689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.1350122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.1350399Z E       ^
2025-05-07T20:31:56.1350869Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.1351320Z 
2025-05-07T20:31:56.1351740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.1352261Z 
2025-05-07T20:31:56.1352369Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.1352795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.1353206Z     T=1,
2025-05-07T20:31:56.1353397Z     D=7168,
2025-05-07T20:31:56.1353605Z     scale_ub=None,
2025-05-07T20:31:56.1353827Z     contiguous=True,
2025-05-07T20:31:56.1354050Z     compiled=True,
2025-05-07T20:31:56.1354266Z )
2025-05-07T20:31:56.1354594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.1355099Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:56.1355362Z 
2025-05-07T20:31:56.1355443Z     @given(
2025-05-07T20:31:56.1355686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.1356008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.1356312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.1356647Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.1356982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.1357283Z     )
2025-05-07T20:31:56.1357631Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.1358080Z     def test_silu_mul_quant(
2025-05-07T20:31:56.1358331Z         self,
2025-05-07T20:31:56.1358528Z         T: int,
2025-05-07T20:31:56.1358736Z         D: int,
2025-05-07T20:31:56.1358961Z         scale_ub: Optional[float],
2025-05-07T20:31:56.1359233Z         contiguous: bool,
2025-05-07T20:31:56.1359481Z         compiled: bool,
2025-05-07T20:31:56.1359719Z     ) -> None:
2025-05-07T20:31:56.1359936Z         torch.manual_seed(2025)
2025-05-07T20:31:56.1360185Z     
2025-05-07T20:31:56.1360464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.1360808Z     
2025-05-07T20:31:56.1361009Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.1361304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.1361611Z         x = x_sign * x_clamp
2025-05-07T20:31:56.1361949Z         x0 = x[:, :D]
2025-05-07T20:31:56.1362169Z         x1 = x[:, D:]
2025-05-07T20:31:56.1362378Z     
2025-05-07T20:31:56.1362572Z         if contiguous:
2025-05-07T20:31:56.1362813Z             x0 = x0.contiguous()
2025-05-07T20:31:56.1363080Z             x1 = x1.contiguous()
2025-05-07T20:31:56.1363320Z     
2025-05-07T20:31:56.1363518Z         if scale_ub is not None:
2025-05-07T20:31:56.1363795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.1364138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.1364456Z             )
2025-05-07T20:31:56.1364661Z         else:
2025-05-07T20:31:56.1364873Z             scale_ub_tensor = None
2025-05-07T20:31:56.1365132Z     
2025-05-07T20:31:56.1365369Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.1365682Z             op = silu_mul_quant
2025-05-07T20:31:56.1365938Z             if compiled:
2025-05-07T20:31:56.1366192Z                 op = torch.compile(op)
2025-05-07T20:31:56.1366493Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.1366780Z     
2025-05-07T20:31:56.1366983Z         y_fp8, y_scale = fn()
2025-05-07T20:31:56.1367271Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:56.1367639Z     
2025-05-07T20:31:56.1367894Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.1368239Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:56.1368533Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:56.1368934Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:56.1369302Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.1369617Z     
2025-05-07T20:31:56.1369827Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:56.1370025Z 
2025-05-07T20:31:56.1370134Z moe/activation_test.py:126: 
2025-05-07T20:31:56.1370430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1370771Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:56.1371108Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.1371897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:56.1372654Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:56.1373210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.1373937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.1374655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:56.1375376Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.1376139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:56.1376896Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.1377623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:56.1378268Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:56.1378876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:56.1379402Z     fn()
2025-05-07T20:31:56.1379909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:56.1380497Z     self.fn.run(
2025-05-07T20:31:56.1380972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.1381503Z     kernel = self.compile(
2025-05-07T20:31:56.1382048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.1382792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.1383195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.1383425Z 
2025-05-07T20:31:56.1383633Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a859d4d10>
2025-05-07T20:31:56.1384719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.1386094Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a96239080>}
2025-05-07T20:31:56.1387439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.1388472Z context = <triton._C.libtriton.ir.context object at 0x7f3a859cfe70>
2025-05-07T20:31:56.1388763Z 
2025-05-07T20:31:56.1388928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.1389454Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.1389925Z                            module_map=module_map)
2025-05-07T20:31:56.1390366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.1390734Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:56.1391012Z E       ^
2025-05-07T20:31:56.1391481Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.1391932Z 
2025-05-07T20:31:56.1392349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.1392874Z 
2025-05-07T20:31:56.1392982Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.1393400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.1393807Z     T=4096,
2025-05-07T20:31:56.1394000Z     D=5120,
2025-05-07T20:31:56.1394197Z     scale_ub=None,
2025-05-07T20:31:56.1394422Z     contiguous=False,
2025-05-07T20:31:56.1394648Z     compiled=False,
2025-05-07T20:31:56.1394860Z )
2025-05-07T20:31:56.5055054Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.5056129Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:31:56.5057468Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.5058894Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.5059867Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.5061180Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.5062562Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.5063719Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.5064948Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.5066323Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.5067391Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.5068673Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.5069933Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:31:56.5071255Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.5072451Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:31:56.5073275Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.5074294Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:56.5075313Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:31:56.5076094Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:56.5077305Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.5078581Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.5079695Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.5080736Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:31:56.5081902Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.5083253Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.5084315Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.5085224Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.5086078Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:31:56.5087087Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.7730244Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.7731309Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:31:56.7732645Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.7734068Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.7735036Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.7736506Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.7737890Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.7738870Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.7740102Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.7741468Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.7742533Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.7743810Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.7745106Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:31:56.7746326Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.7747532Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:31:56.7748356Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.7749382Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:56.7750396Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:31:56.7751317Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:56.7752526Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.7753814Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.7754930Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.7755971Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:31:56.7757152Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.7758498Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.7759639Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.7760556Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.7761294Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:31:56.7762302Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.4596407Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.4596963Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.4597251Z 
2025-05-07T20:31:57.4597342Z     @given(
2025-05-07T20:31:57.4597595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.4597915Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.4598228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.4598562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.4598894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.4599179Z     )
2025-05-07T20:31:57.4599530Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.4599979Z     def test_silu_mul_quant(
2025-05-07T20:31:57.4600220Z         self,
2025-05-07T20:31:57.4600420Z         T: int,
2025-05-07T20:31:57.4600622Z         D: int,
2025-05-07T20:31:57.4600839Z         scale_ub: Optional[float],
2025-05-07T20:31:57.4601116Z         contiguous: bool,
2025-05-07T20:31:57.4601360Z         compiled: bool,
2025-05-07T20:31:57.4601586Z     ) -> None:
2025-05-07T20:31:57.4601807Z         torch.manual_seed(2025)
2025-05-07T20:31:57.4602056Z     
2025-05-07T20:31:57.4609795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.4610196Z     
2025-05-07T20:31:57.4610415Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.4610722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.4611057Z         x = x_sign * x_clamp
2025-05-07T20:31:57.4611314Z         x0 = x[:, :D]
2025-05-07T20:31:57.4611537Z         x1 = x[:, D:]
2025-05-07T20:31:57.4611762Z     
2025-05-07T20:31:57.4611970Z         if contiguous:
2025-05-07T20:31:57.4612393Z             x0 = x0.contiguous()
2025-05-07T20:31:57.4612667Z             x1 = x1.contiguous()
2025-05-07T20:31:57.4612919Z     
2025-05-07T20:31:57.4613122Z         if scale_ub is not None:
2025-05-07T20:31:57.4613410Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.4613762Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.4614079Z             )
2025-05-07T20:31:57.4614289Z         else:
2025-05-07T20:31:57.4614517Z             scale_ub_tensor = None
2025-05-07T20:31:57.4614813Z     
2025-05-07T20:31:57.4615079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.4615416Z             op = silu_mul_quant
2025-05-07T20:31:57.4615680Z             if compiled:
2025-05-07T20:31:57.4615937Z                 op = torch.compile(op)
2025-05-07T20:31:57.4616246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4616531Z     
2025-05-07T20:31:57.4616730Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.4616916Z 
2025-05-07T20:31:57.4617022Z moe/activation_test.py:117: 
2025-05-07T20:31:57.4617334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4617710Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.4617999Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4618710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.4619424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.4620094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.4620791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.4621469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.4622019Z     kernel = self.compile(
2025-05-07T20:31:57.4622568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.4623244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.4623660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4623896Z 
2025-05-07T20:31:57.4624114Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a85837390>
2025-05-07T20:31:57.4625218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.4626623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85f2f6a0>}
2025-05-07T20:31:57.4627970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.4629004Z context = <triton._C.libtriton.ir.context object at 0x7f3a85985330>
2025-05-07T20:31:57.4629289Z 
2025-05-07T20:31:57.4629462Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.4629982Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.4630460Z                            module_map=module_map)
2025-05-07T20:31:57.4630834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.4631187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.4631455Z E       ^
2025-05-07T20:31:57.4631931Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.4632382Z 
2025-05-07T20:31:57.4632812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.4633407Z 
2025-05-07T20:31:57.4633516Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.4633938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.4634350Z     T=4096,
2025-05-07T20:31:57.4634573Z     D=7168,
2025-05-07T20:31:57.4634795Z     scale_ub=None,
2025-05-07T20:31:57.4635022Z     contiguous=False,
2025-05-07T20:31:57.4635250Z     compiled=False,
2025-05-07T20:31:57.4635467Z )
2025-05-07T20:31:57.4635802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.4636304Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.4636753Z 
2025-05-07T20:31:57.4636836Z     @given(
2025-05-07T20:31:57.4637078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.4637401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.4637711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.4638058Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.4638394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.4638679Z     )
2025-05-07T20:31:57.4639146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.4639660Z     def test_silu_mul_quant(
2025-05-07T20:31:57.4639917Z         self,
2025-05-07T20:31:57.4640118Z         T: int,
2025-05-07T20:31:57.4640330Z         D: int,
2025-05-07T20:31:57.4640653Z         scale_ub: Optional[float],
2025-05-07T20:31:57.4640929Z         contiguous: bool,
2025-05-07T20:31:57.4641177Z         compiled: bool,
2025-05-07T20:31:57.4641407Z     ) -> None:
2025-05-07T20:31:57.4641627Z         torch.manual_seed(2025)
2025-05-07T20:31:57.4641880Z     
2025-05-07T20:31:57.4642166Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.4642515Z     
2025-05-07T20:31:57.4642719Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.4643026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.4643339Z         x = x_sign * x_clamp
2025-05-07T20:31:57.4643589Z         x0 = x[:, :D]
2025-05-07T20:31:57.4643817Z         x1 = x[:, D:]
2025-05-07T20:31:57.4644029Z     
2025-05-07T20:31:57.4644222Z         if contiguous:
2025-05-07T20:31:57.4644460Z             x0 = x0.contiguous()
2025-05-07T20:31:57.4644715Z             x1 = x1.contiguous()
2025-05-07T20:31:57.4644959Z     
2025-05-07T20:31:57.4645162Z         if scale_ub is not None:
2025-05-07T20:31:57.4645444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.4645781Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.4646095Z             )
2025-05-07T20:31:57.4646301Z         else:
2025-05-07T20:31:57.4646517Z             scale_ub_tensor = None
2025-05-07T20:31:57.4646783Z     
2025-05-07T20:31:57.4647022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.4647344Z             op = silu_mul_quant
2025-05-07T20:31:57.4647651Z             if compiled:
2025-05-07T20:31:57.4647904Z                 op = torch.compile(op)
2025-05-07T20:31:57.4648199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4648470Z     
2025-05-07T20:31:57.4648668Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.4648835Z 
2025-05-07T20:31:57.4648947Z moe/activation_test.py:117: 
2025-05-07T20:31:57.4649248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4649601Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.4649889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4650586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.4651274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.4651817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.4652593Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.4653258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.4653791Z     kernel = self.compile(
2025-05-07T20:31:57.4654337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.4655002Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.4655393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4655627Z 
2025-05-07T20:31:57.4655834Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a8580f7d0>
2025-05-07T20:31:57.4656920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.4658308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85f2dda0>}
2025-05-07T20:31:57.4659660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.4660762Z context = <triton._C.libtriton.ir.context object at 0x7f3a85803df0>
2025-05-07T20:31:57.4661061Z 
2025-05-07T20:31:57.4661231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.4661762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.4662229Z                            module_map=module_map)
2025-05-07T20:31:57.4662593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.4662956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.4663222Z E       ^
2025-05-07T20:31:57.4663690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.4664145Z 
2025-05-07T20:31:57.4664563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.4665126Z 
2025-05-07T20:31:57.4665233Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.4665654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.4666060Z     T=128,
2025-05-07T20:31:57.4666259Z     D=7168,
2025-05-07T20:31:57.4666461Z     scale_ub=None,
2025-05-07T20:31:57.4666676Z     contiguous=False,
2025-05-07T20:31:57.4666907Z     compiled=True,
2025-05-07T20:31:57.4667116Z )
2025-05-07T20:31:57.5104100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.5104655Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.5104945Z 
2025-05-07T20:31:57.5105032Z     @given(
2025-05-07T20:31:57.5105275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.5105758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.5106134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.5106511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.5106886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.5107210Z     )
2025-05-07T20:31:57.5107614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.5108135Z     def test_silu_mul_quant(
2025-05-07T20:31:57.5108398Z         self,
2025-05-07T20:31:57.5108610Z         T: int,
2025-05-07T20:31:57.5108821Z         D: int,
2025-05-07T20:31:57.5109052Z         scale_ub: Optional[float],
2025-05-07T20:31:57.5109355Z         contiguous: bool,
2025-05-07T20:31:57.5109618Z         compiled: bool,
2025-05-07T20:31:57.5110021Z     ) -> None:
2025-05-07T20:31:57.5110240Z         torch.manual_seed(2025)
2025-05-07T20:31:57.5110488Z     
2025-05-07T20:31:57.5110758Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.5111107Z     
2025-05-07T20:31:57.5111305Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.5111598Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.5111904Z         x = x_sign * x_clamp
2025-05-07T20:31:57.5112153Z         x0 = x[:, :D]
2025-05-07T20:31:57.5112373Z         x1 = x[:, D:]
2025-05-07T20:31:57.5112579Z     
2025-05-07T20:31:57.5112773Z         if contiguous:
2025-05-07T20:31:57.5113009Z             x0 = x0.contiguous()
2025-05-07T20:31:57.5113269Z             x1 = x1.contiguous()
2025-05-07T20:31:57.5113516Z     
2025-05-07T20:31:57.5113714Z         if scale_ub is not None:
2025-05-07T20:31:57.5113984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.5114325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.5114647Z             )
2025-05-07T20:31:57.5114844Z         else:
2025-05-07T20:31:57.5115063Z             scale_ub_tensor = None
2025-05-07T20:31:57.5115322Z     
2025-05-07T20:31:57.5115553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.5115877Z             op = silu_mul_quant
2025-05-07T20:31:57.5116130Z             if compiled:
2025-05-07T20:31:57.5116386Z                 op = torch.compile(op)
2025-05-07T20:31:57.5116826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.5117109Z     
2025-05-07T20:31:57.5117309Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.5117592Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.5117886Z     
2025-05-07T20:31:57.5118140Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.5118479Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.5118774Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.5119097Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.5119462Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.5119776Z     
2025-05-07T20:31:57.5119985Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.5120183Z 
2025-05-07T20:31:57.5120289Z moe/activation_test.py:126: 
2025-05-07T20:31:57.5120586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.5120927Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.5121261Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.5122055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.5122799Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.5123343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.5124032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.5124715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.5125434Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.5126182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:57.5126928Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.5127745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.5128387Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.5128997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.5129673Z     fn()
2025-05-07T20:31:57.5130180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.5130763Z     self.fn.run(
2025-05-07T20:31:57.5131229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.5131755Z     kernel = self.compile(
2025-05-07T20:31:57.5132302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.5132953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.5133356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.5133585Z 
2025-05-07T20:31:57.5133790Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a8551e4d0>
2025-05-07T20:31:57.5134866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.5136238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c81620>}
2025-05-07T20:31:57.5137818Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.5138860Z context = <triton._C.libtriton.ir.context object at 0x7f3a85528030>
2025-05-07T20:31:57.5139147Z 
2025-05-07T20:31:57.5139311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.5139838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.5140300Z                            module_map=module_map)
2025-05-07T20:31:57.5140671Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.5141025Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.5141301Z E       ^
2025-05-07T20:31:57.5141766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.5142227Z 
2025-05-07T20:31:57.5142649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.5143170Z 
2025-05-07T20:31:57.5143275Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.5143691Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.5144094Z     T=128,
2025-05-07T20:31:57.5144284Z     D=7168,
2025-05-07T20:31:57.5144481Z     scale_ub=None,
2025-05-07T20:31:57.5144704Z     contiguous=False,
2025-05-07T20:31:57.5144924Z     compiled=False,
2025-05-07T20:31:57.5145141Z )
2025-05-07T20:31:57.8185399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.8185920Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.8186243Z 
2025-05-07T20:31:57.8186337Z     @given(
2025-05-07T20:31:57.8186571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.8186893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.8187204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.8187540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.8187874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.8188170Z     )
2025-05-07T20:31:57.8188525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.8188964Z     def test_silu_mul_quant(
2025-05-07T20:31:57.8189214Z         self,
2025-05-07T20:31:57.8189417Z         T: int,
2025-05-07T20:31:57.8189621Z         D: int,
2025-05-07T20:31:57.8189844Z         scale_ub: Optional[float],
2025-05-07T20:31:57.8190300Z         contiguous: bool,
2025-05-07T20:31:57.8190538Z         compiled: bool,
2025-05-07T20:31:57.8190770Z     ) -> None:
2025-05-07T20:31:57.8190993Z         torch.manual_seed(2025)
2025-05-07T20:31:57.8191237Z     
2025-05-07T20:31:57.8191512Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.8191861Z     
2025-05-07T20:31:57.8192058Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.8192365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.8192678Z         x = x_sign * x_clamp
2025-05-07T20:31:57.8192917Z         x0 = x[:, :D]
2025-05-07T20:31:57.8193141Z         x1 = x[:, D:]
2025-05-07T20:31:57.8193352Z     
2025-05-07T20:31:57.8193543Z         if contiguous:
2025-05-07T20:31:57.8193784Z             x0 = x0.contiguous()
2025-05-07T20:31:57.8194045Z             x1 = x1.contiguous()
2025-05-07T20:31:57.8194290Z     
2025-05-07T20:31:57.8194482Z         if scale_ub is not None:
2025-05-07T20:31:57.8194768Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.8195106Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.8195419Z             )
2025-05-07T20:31:57.8195618Z         else:
2025-05-07T20:31:57.8195834Z             scale_ub_tensor = None
2025-05-07T20:31:57.8196085Z     
2025-05-07T20:31:57.8196320Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.8196638Z             op = silu_mul_quant
2025-05-07T20:31:57.8197008Z             if compiled:
2025-05-07T20:31:57.8197271Z                 op = torch.compile(op)
2025-05-07T20:31:57.8197568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.8197844Z     
2025-05-07T20:31:57.8198042Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.8198212Z 
2025-05-07T20:31:57.8198312Z moe/activation_test.py:117: 
2025-05-07T20:31:57.8198615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.8198946Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.8199242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.8199939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.8200637Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.8201185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.8201881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.8202547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.8203081Z     kernel = self.compile(
2025-05-07T20:31:57.8203627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.8204288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.8204686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.8204918Z 
2025-05-07T20:31:57.8205126Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a855984d0>
2025-05-07T20:31:57.8206466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.8207924Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c825c0>}
2025-05-07T20:31:57.8209271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.8210293Z context = <triton._C.libtriton.ir.context object at 0x7f3a8559caf0>
2025-05-07T20:31:57.8210733Z 
2025-05-07T20:31:57.8210899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.8211424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.8211889Z                            module_map=module_map)
2025-05-07T20:31:57.8212250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.8212608Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.8212877Z E       ^
2025-05-07T20:31:57.8213341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.8213795Z 
2025-05-07T20:31:57.8214211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.8214775Z 
2025-05-07T20:31:57.8214880Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.8215294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.8215699Z     T=4096,
2025-05-07T20:31:57.8215895Z     D=5120,
2025-05-07T20:31:57.8216095Z     scale_ub=1200.0,
2025-05-07T20:31:57.8216317Z     contiguous=True,
2025-05-07T20:31:57.8216545Z     compiled=False,
2025-05-07T20:31:57.8216762Z )
2025-05-07T20:31:57.8217078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.8217572Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.8217968Z 
2025-05-07T20:31:57.8218050Z     @given(
2025-05-07T20:31:57.8218286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.8218597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.8218905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.8219237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.8219561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.8219860Z     )
2025-05-07T20:31:57.8220208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.8220645Z     def test_silu_mul_quant(
2025-05-07T20:31:57.8220896Z         self,
2025-05-07T20:31:57.8221100Z         T: int,
2025-05-07T20:31:57.8221298Z         D: int,
2025-05-07T20:31:57.8221521Z         scale_ub: Optional[float],
2025-05-07T20:31:57.8221795Z         contiguous: bool,
2025-05-07T20:31:57.8222032Z         compiled: bool,
2025-05-07T20:31:57.8222260Z     ) -> None:
2025-05-07T20:31:57.8222485Z         torch.manual_seed(2025)
2025-05-07T20:31:57.8222729Z     
2025-05-07T20:31:57.8222999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.8223344Z     
2025-05-07T20:31:57.8223543Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.8223829Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.8224143Z         x = x_sign * x_clamp
2025-05-07T20:31:57.8224389Z         x0 = x[:, :D]
2025-05-07T20:31:57.8224607Z         x1 = x[:, D:]
2025-05-07T20:31:57.8224822Z     
2025-05-07T20:31:57.8225015Z         if contiguous:
2025-05-07T20:31:57.8225245Z             x0 = x0.contiguous()
2025-05-07T20:31:57.8225506Z             x1 = x1.contiguous()
2025-05-07T20:31:57.8225749Z     
2025-05-07T20:31:57.8225941Z         if scale_ub is not None:
2025-05-07T20:31:57.8226214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.8226552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.8226870Z             )
2025-05-07T20:31:57.8227072Z         else:
2025-05-07T20:31:57.8227292Z             scale_ub_tensor = None
2025-05-07T20:31:57.8227548Z     
2025-05-07T20:31:57.8227777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.8228092Z             op = silu_mul_quant
2025-05-07T20:31:57.8228344Z             if compiled:
2025-05-07T20:31:57.8228590Z                 op = torch.compile(op)
2025-05-07T20:31:57.8228890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.8229291Z     
2025-05-07T20:31:57.8229486Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.8229658Z 
2025-05-07T20:31:57.8229759Z moe/activation_test.py:117: 
2025-05-07T20:31:57.8230056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.8230385Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.8230667Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.8231358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.8232048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.8232577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.8233255Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.8233915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.8234447Z     kernel = self.compile(
2025-05-07T20:31:57.8241259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.8241938Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.8242346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.8242583Z 
2025-05-07T20:31:57.8242905Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a855bce10>
2025-05-07T20:31:57.8243997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.8245427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c839c0>}
2025-05-07T20:31:57.8246780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.8247904Z context = <triton._C.libtriton.ir.context object at 0x7f3a855dd470>
2025-05-07T20:31:57.8248249Z 
2025-05-07T20:31:57.8248440Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.8249036Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.8249507Z                            module_map=module_map)
2025-05-07T20:31:57.8249886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.8250246Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.8250507Z E       ^
2025-05-07T20:31:57.8250983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.8251444Z 
2025-05-07T20:31:57.8251863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.8252375Z 
2025-05-07T20:31:57.8252487Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.8252904Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.8253320Z     T=1,
2025-05-07T20:31:57.8253521Z     D=5120,
2025-05-07T20:31:57.8253717Z     scale_ub=None,
2025-05-07T20:31:57.8253948Z     contiguous=True,
2025-05-07T20:31:57.8254178Z     compiled=True,
2025-05-07T20:31:57.8254381Z )
2025-05-07T20:31:58.1603297Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.1604633Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:31:58.1606324Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.1607848Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.1608828Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.1610119Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.1611502Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.1612482Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.1613818Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.1615189Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.1616240Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.1617519Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.1618753Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:31:58.1619962Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.1621153Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:31:58.1621964Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.1622977Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:58.1623992Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:31:58.1624788Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:58.1625983Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.1627262Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.1628490Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.1629528Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:31:58.1630700Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.1632045Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.1633102Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.1634012Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.1634749Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:31:58.1635757Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.2457883Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.2458952Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:31:58.2460283Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.2461697Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.2462672Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.2463955Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.2465324Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.2466308Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.2467527Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.2468893Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.2469944Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.2471220Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.2472583Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:31:58.2473797Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.2474997Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:31:58.2475805Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.2476823Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:58.2477832Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:31:58.2478618Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:58.2479893Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.2481157Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.2482260Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.2483292Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:31:58.2484454Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.2485800Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.2486844Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.2487845Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.2488579Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:31:58.2489588Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.5460392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.5460993Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:58.5461266Z 
2025-05-07T20:31:58.5461353Z     @given(
2025-05-07T20:31:58.5461598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.5461917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.5462232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.5462569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.5462903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.5463421Z     )
2025-05-07T20:31:58.5463778Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.5464222Z     def test_silu_mul_quant(
2025-05-07T20:31:58.5464464Z         self,
2025-05-07T20:31:58.5464664Z         T: int,
2025-05-07T20:31:58.5464868Z         D: int,
2025-05-07T20:31:58.5465087Z         scale_ub: Optional[float],
2025-05-07T20:31:58.5465365Z         contiguous: bool,
2025-05-07T20:31:58.5465607Z         compiled: bool,
2025-05-07T20:31:58.5465843Z     ) -> None:
2025-05-07T20:31:58.5466069Z         torch.manual_seed(2025)
2025-05-07T20:31:58.5466321Z     
2025-05-07T20:31:58.5466596Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.5466949Z     
2025-05-07T20:31:58.5467151Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.5467443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.5467760Z         x = x_sign * x_clamp
2025-05-07T20:31:58.5468015Z         x0 = x[:, :D]
2025-05-07T20:31:58.5468237Z         x1 = x[:, D:]
2025-05-07T20:31:58.5468448Z     
2025-05-07T20:31:58.5468637Z         if contiguous:
2025-05-07T20:31:58.5468870Z             x0 = x0.contiguous()
2025-05-07T20:31:58.5469122Z             x1 = x1.contiguous()
2025-05-07T20:31:58.5469366Z     
2025-05-07T20:31:58.5469566Z         if scale_ub is not None:
2025-05-07T20:31:58.5469838Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.5470308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.5470628Z             )
2025-05-07T20:31:58.5470821Z         else:
2025-05-07T20:31:58.5471040Z             scale_ub_tensor = None
2025-05-07T20:31:58.5471308Z     
2025-05-07T20:31:58.5471544Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.5471864Z             op = silu_mul_quant
2025-05-07T20:31:58.5472120Z             if compiled:
2025-05-07T20:31:58.5472366Z                 op = torch.compile(op)
2025-05-07T20:31:58.5472670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.5472949Z     
2025-05-07T20:31:58.5473141Z         y_fp8, y_scale = fn()
2025-05-07T20:31:58.5473425Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:58.5473720Z     
2025-05-07T20:31:58.5473964Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.5474297Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:58.5474592Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:58.5474914Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:58.5475271Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.5475586Z     
2025-05-07T20:31:58.5475790Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:58.5475984Z 
2025-05-07T20:31:58.5476085Z moe/activation_test.py:126: 
2025-05-07T20:31:58.5476385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.5476725Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:58.5477061Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.5477847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:58.5478607Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:58.5479153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.5479840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.5480524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:58.5481250Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.5482002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:58.5482838Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.5483568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:58.5484207Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:58.5484809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:58.5485330Z     fn()
2025-05-07T20:31:58.5485839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:58.5486424Z     self.fn.run(
2025-05-07T20:31:58.5486894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.5487429Z     kernel = self.compile(
2025-05-07T20:31:58.5488050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.5488711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.5489106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.5489344Z 
2025-05-07T20:31:58.5489553Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a84a56cd0>
2025-05-07T20:31:58.5490731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.5492134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c8c7c0>}
2025-05-07T20:31:58.5493483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.5494513Z context = <triton._C.libtriton.ir.context object at 0x7f3a84a372f0>
2025-05-07T20:31:58.5494853Z 
2025-05-07T20:31:58.5495025Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.5495553Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.5496030Z                            module_map=module_map)
2025-05-07T20:31:58.5496397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.5496758Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:58.5497029Z E       ^
2025-05-07T20:31:58.5497504Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.5497957Z 
2025-05-07T20:31:58.5498375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.5498904Z 
2025-05-07T20:31:58.5499010Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.5499430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.5499832Z     T=2048,
2025-05-07T20:31:58.5500027Z     D=5120,
2025-05-07T20:31:58.5500227Z     scale_ub=None,
2025-05-07T20:31:58.5500443Z     contiguous=True,
2025-05-07T20:31:58.5500670Z     compiled=True,
2025-05-07T20:31:58.5500878Z )
2025-05-07T20:31:58.8705455Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.8708575Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:31:58.8711266Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.8714451Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.8715626Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.8716931Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.8718308Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.8719289Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.8720508Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.8721991Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.8723052Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.8724329Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.8725562Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:31:58.8726783Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.8728099Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:31:58.8728925Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.8729949Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:58.8730961Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:31:58.8731751Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:58.8732970Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.8734248Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.8735357Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.8736481Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:31:58.8737655Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.8739005Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.8740071Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.8740971Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.8741710Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:31:58.8742732Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.9555479Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.9556597Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:31:58.9557929Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.9559340Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.9560316Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.9561629Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.9562994Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.9563964Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.9565182Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.9566530Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.9567664Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.9568936Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.9570291Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:31:58.9571494Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.9572690Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:31:58.9573505Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.9574518Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:58.9575524Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:31:58.9576309Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:58.9577499Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.9578840Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.9579945Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.9580970Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:31:58.9582131Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.9583467Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.9584524Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.9585419Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.9586143Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:31:58.9587398Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.2584174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.2584737Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:59.2585021Z 
2025-05-07T20:31:59.2585104Z     @given(
2025-05-07T20:31:59.2585378Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.2585694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.2586009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.2586346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.2586673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.2586970Z     )
2025-05-07T20:31:59.2587327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.2588148Z     def test_silu_mul_quant(
2025-05-07T20:31:59.2588399Z         self,
2025-05-07T20:31:59.2588604Z         T: int,
2025-05-07T20:31:59.2588803Z         D: int,
2025-05-07T20:31:59.2589032Z         scale_ub: Optional[float],
2025-05-07T20:31:59.2589311Z         contiguous: bool,
2025-05-07T20:31:59.2589564Z         compiled: bool,
2025-05-07T20:31:59.2589795Z     ) -> None:
2025-05-07T20:31:59.2590015Z         torch.manual_seed(2025)
2025-05-07T20:31:59.2590261Z     
2025-05-07T20:31:59.2590540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.2590888Z     
2025-05-07T20:31:59.2591085Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.2591374Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.2591689Z         x = x_sign * x_clamp
2025-05-07T20:31:59.2591932Z         x0 = x[:, :D]
2025-05-07T20:31:59.2592146Z         x1 = x[:, D:]
2025-05-07T20:31:59.2592357Z     
2025-05-07T20:31:59.2592550Z         if contiguous:
2025-05-07T20:31:59.2592786Z             x0 = x0.contiguous()
2025-05-07T20:31:59.2593049Z             x1 = x1.contiguous()
2025-05-07T20:31:59.2593289Z     
2025-05-07T20:31:59.2593479Z         if scale_ub is not None:
2025-05-07T20:31:59.2593755Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.2594095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.2594409Z             )
2025-05-07T20:31:59.2594601Z         else:
2025-05-07T20:31:59.2594835Z             scale_ub_tensor = None
2025-05-07T20:31:59.2595258Z     
2025-05-07T20:31:59.2595492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.2595811Z             op = silu_mul_quant
2025-05-07T20:31:59.2596064Z             if compiled:
2025-05-07T20:31:59.2596309Z                 op = torch.compile(op)
2025-05-07T20:31:59.2596609Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.2596888Z     
2025-05-07T20:31:59.2597080Z         y_fp8, y_scale = fn()
2025-05-07T20:31:59.2597376Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:59.2597674Z     
2025-05-07T20:31:59.2597912Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.2598257Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:59.2598558Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:59.2598871Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:59.2599239Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:59.2599560Z     
2025-05-07T20:31:59.2599779Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:59.2599977Z 
2025-05-07T20:31:59.2600081Z moe/activation_test.py:126: 
2025-05-07T20:31:59.2600389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.2600731Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:59.2601056Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:59.2601854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:59.2602626Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:59.2603181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.2603867Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.2604565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:59.2605293Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:59.2606413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:59.2607158Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:59.2607949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:59.2608739Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:59.2609335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:59.2609860Z     fn()
2025-05-07T20:31:59.2610372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:59.2610965Z     self.fn.run(
2025-05-07T20:31:59.2611430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.2611973Z     kernel = self.compile(
2025-05-07T20:31:59.2612517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.2613166Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.2613577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.2613814Z 
2025-05-07T20:31:59.2614024Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a85040fd0>
2025-05-07T20:31:59.2615103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.2616615Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c8d4e0>}
2025-05-07T20:31:59.2617958Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.2618985Z context = <triton._C.libtriton.ir.context object at 0x7f3a85044d30>
2025-05-07T20:31:59.2619285Z 
2025-05-07T20:31:59.2619450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.2619973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.2620434Z                            module_map=module_map)
2025-05-07T20:31:59.2620802Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.2621166Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:59.2621434Z E       ^
2025-05-07T20:31:59.2621917Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.2622373Z 
2025-05-07T20:31:59.2622791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.2623301Z 
2025-05-07T20:31:59.2623414Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.2623824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.2624239Z     T=128,
2025-05-07T20:31:59.2624436Z     D=5120,
2025-05-07T20:31:59.2624639Z     scale_ub=None,
2025-05-07T20:31:59.2624864Z     contiguous=True,
2025-05-07T20:31:59.2625096Z     compiled=True,
2025-05-07T20:31:59.2625306Z )
2025-05-07T20:31:59.6043139Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.6044260Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:31:59.6045622Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.6047301Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.6059005Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6060400Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.6061802Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.6062790Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6064028Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.6065450Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.6066712Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6067994Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.6069238Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:31:59.6070469Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.6071680Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:31:59.6072503Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6073527Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:59.6074540Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:31:59.6075336Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:59.6076540Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.6077818Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.6078935Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:59.6079973Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:31:59.6081242Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.6082590Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.6083650Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.6084560Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.6085302Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:31:59.6086322Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.6904971Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.6906356Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:31:59.6908030Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.6909475Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.6910465Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6911778Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.6913166Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.6914148Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6915375Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.6916751Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.6917815Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6919084Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.6920321Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:31:59.6921742Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.6922947Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:31:59.6923780Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6924804Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:59.6925823Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:31:59.6926618Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:59.6927936Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.6929219Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.6930413Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:59.6931456Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:31:59.6932636Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.6934150Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.6935221Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.6936144Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.6936893Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:31:59.6937919Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2102572Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.2104034Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:00.2104734Z 
2025-05-07T20:32:00.2104918Z     @given(
2025-05-07T20:32:00.2105242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.2105579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.2106150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.2106500Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.2106843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.2107141Z     )
2025-05-07T20:32:00.2107494Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.2107948Z     def test_silu_mul_quant(
2025-05-07T20:32:00.2108193Z         self,
2025-05-07T20:32:00.2108395Z         T: int,
2025-05-07T20:32:00.2108593Z         D: int,
2025-05-07T20:32:00.2109063Z         scale_ub: Optional[float],
2025-05-07T20:32:00.2109340Z         contiguous: bool,
2025-05-07T20:32:00.2109579Z         compiled: bool,
2025-05-07T20:32:00.2109813Z     ) -> None:
2025-05-07T20:32:00.2110037Z         torch.manual_seed(2025)
2025-05-07T20:32:00.2110273Z     
2025-05-07T20:32:00.2110554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.2110902Z     
2025-05-07T20:32:00.2111094Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.2111398Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.2111716Z         x = x_sign * x_clamp
2025-05-07T20:32:00.2111955Z         x0 = x[:, :D]
2025-05-07T20:32:00.2112176Z         x1 = x[:, D:]
2025-05-07T20:32:00.2112391Z     
2025-05-07T20:32:00.2112574Z         if contiguous:
2025-05-07T20:32:00.2112806Z             x0 = x0.contiguous()
2025-05-07T20:32:00.2113068Z             x1 = x1.contiguous()
2025-05-07T20:32:00.2113303Z     
2025-05-07T20:32:00.2113504Z         if scale_ub is not None:
2025-05-07T20:32:00.2113790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.2114130Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.2114431Z             )
2025-05-07T20:32:00.2114635Z         else:
2025-05-07T20:32:00.2114852Z             scale_ub_tensor = None
2025-05-07T20:32:00.2115110Z     
2025-05-07T20:32:00.2115343Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2115651Z             op = silu_mul_quant
2025-05-07T20:32:00.2116078Z             if compiled:
2025-05-07T20:32:00.2116332Z                 op = torch.compile(op)
2025-05-07T20:32:00.2116627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2116903Z     
2025-05-07T20:32:00.2117096Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.2117381Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.2117668Z     
2025-05-07T20:32:00.2117911Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2118250Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.2118540Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.2118857Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.2119215Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.2119524Z     
2025-05-07T20:32:00.2119729Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.2119923Z 
2025-05-07T20:32:00.2120029Z moe/activation_test.py:126: 
2025-05-07T20:32:00.2120330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2120666Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.2120991Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.2121781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.2122531Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.2123082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.2123769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.2124460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.2125209Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.2125996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:00.2126749Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.2127474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.2128201Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.2128897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.2129427Z     fn()
2025-05-07T20:32:00.2129935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.2130529Z     self.fn.run(
2025-05-07T20:32:00.2131006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.2131551Z     kernel = self.compile(
2025-05-07T20:32:00.2132096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.2132754Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.2133159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2133394Z 
2025-05-07T20:32:00.2133604Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a849f5cd0>
2025-05-07T20:32:00.2134702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.2136100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84e442c0>}
2025-05-07T20:32:00.2137534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.2138574Z context = <triton._C.libtriton.ir.context object at 0x7f3a848221f0>
2025-05-07T20:32:00.2138864Z 
2025-05-07T20:32:00.2139031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.2139558Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.2140036Z                            module_map=module_map)
2025-05-07T20:32:00.2140400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.2140762Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.2141037Z E       ^
2025-05-07T20:32:00.2141507Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2141964Z 
2025-05-07T20:32:00.2142393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.2142914Z 
2025-05-07T20:32:00.2143020Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.2143440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.2143855Z     T=4096,
2025-05-07T20:32:00.2144048Z     D=5120,
2025-05-07T20:32:00.2144250Z     scale_ub=None,
2025-05-07T20:32:00.2144475Z     contiguous=True,
2025-05-07T20:32:00.2144695Z     compiled=True,
2025-05-07T20:32:00.2144909Z )
2025-05-07T20:32:00.5597083Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.5598181Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:00.5599557Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.5601014Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.5602399Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.5603713Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.5605115Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.5606401Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.5607717Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.5609099Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.5610173Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.5611605Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.5612860Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:00.5614094Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.5615334Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:00.5616193Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.5617231Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:32:00.5618263Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:00.5619065Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:00.5620285Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.5621574Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.5622707Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:32:00.5623758Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:00.5624958Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.5626445Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.5627512Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.5628432Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.5629178Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:00.5630199Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.6456426Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.6457522Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:00.6459243Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.6461071Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.6462050Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.6463377Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.6464750Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.6465746Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.6466978Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.6468349Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.6469416Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.6470693Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.6471938Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:00.6473159Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.6474521Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:00.6475377Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.6476427Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:32:00.6477445Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:00.6478243Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:00.6479457Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.6480738Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.6481930Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:32:00.6482979Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:00.6484162Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.6485515Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.6486575Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.6487481Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.6488357Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:00.6489375Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.0029527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.0030101Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.0030412Z 
2025-05-07T20:32:01.0030493Z     @given(
2025-05-07T20:32:01.0030758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.0031199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.0031520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.0031861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.0032304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.0032596Z     )
2025-05-07T20:32:01.0032960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.0033411Z     def test_silu_mul_quant(
2025-05-07T20:32:01.0033652Z         self,
2025-05-07T20:32:01.0033853Z         T: int,
2025-05-07T20:32:01.0034055Z         D: int,
2025-05-07T20:32:01.0034270Z         scale_ub: Optional[float],
2025-05-07T20:32:01.0034546Z         contiguous: bool,
2025-05-07T20:32:01.0034790Z         compiled: bool,
2025-05-07T20:32:01.0035469Z     ) -> None:
2025-05-07T20:32:01.0035694Z         torch.manual_seed(2025)
2025-05-07T20:32:01.0035941Z     
2025-05-07T20:32:01.0036214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.0036564Z     
2025-05-07T20:32:01.0036771Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.0037064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.0037382Z         x = x_sign * x_clamp
2025-05-07T20:32:01.0037626Z         x0 = x[:, :D]
2025-05-07T20:32:01.0037858Z         x1 = x[:, D:]
2025-05-07T20:32:01.0038064Z     
2025-05-07T20:32:01.0038256Z         if contiguous:
2025-05-07T20:32:01.0038498Z             x0 = x0.contiguous()
2025-05-07T20:32:01.0038754Z             x1 = x1.contiguous()
2025-05-07T20:32:01.0038996Z     
2025-05-07T20:32:01.0039191Z         if scale_ub is not None:
2025-05-07T20:32:01.0039455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.0039798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.0040115Z             )
2025-05-07T20:32:01.0040303Z         else:
2025-05-07T20:32:01.0040517Z             scale_ub_tensor = None
2025-05-07T20:32:01.0040779Z     
2025-05-07T20:32:01.0041010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.0041336Z             op = silu_mul_quant
2025-05-07T20:32:01.0041591Z             if compiled:
2025-05-07T20:32:01.0041844Z                 op = torch.compile(op)
2025-05-07T20:32:01.0042319Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.0042632Z     
2025-05-07T20:32:01.0042917Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.0043260Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.0043565Z     
2025-05-07T20:32:01.0043816Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.0044161Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.0044467Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.0044790Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.0045163Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.0045484Z     
2025-05-07T20:32:01.0045698Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.0045896Z 
2025-05-07T20:32:01.0046001Z moe/activation_test.py:126: 
2025-05-07T20:32:01.0046305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.0046650Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.0046992Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.0047909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.0048672Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.0049217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.0049901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.0050590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.0051308Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.0052063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.0052807Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.0053534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.0054170Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.0054770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.0055390Z     fn()
2025-05-07T20:32:01.0055947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.0056535Z     self.fn.run(
2025-05-07T20:32:01.0056994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.0057525Z     kernel = self.compile(
2025-05-07T20:32:01.0058064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.0058722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.0059113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.0059351Z 
2025-05-07T20:32:01.0059559Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a847eecd0>
2025-05-07T20:32:01.0060648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.0062044Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84e46e80>}
2025-05-07T20:32:01.0063468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.0064487Z context = <triton._C.libtriton.ir.context object at 0x7f3a84d069f0>
2025-05-07T20:32:01.0064783Z 
2025-05-07T20:32:01.0064950Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.0065471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.0065960Z                            module_map=module_map)
2025-05-07T20:32:01.0066364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.0066729Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.0067001Z E       ^
2025-05-07T20:32:01.0067467Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.0067924Z 
2025-05-07T20:32:01.0068338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.0068846Z 
2025-05-07T20:32:01.0068966Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.0069379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.0069782Z     T=16384,
2025-05-07T20:32:01.0069982Z     D=5120,
2025-05-07T20:32:01.0070190Z     scale_ub=None,
2025-05-07T20:32:01.0070399Z     contiguous=True,
2025-05-07T20:32:01.0070640Z     compiled=True,
2025-05-07T20:32:01.0070852Z )
2025-05-07T20:32:01.0322469Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:01.0323733Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:01.0325077Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:01.0326074Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:01.0327181Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:01.1009570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.1010532Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.1010821Z 
2025-05-07T20:32:01.1010903Z     @given(
2025-05-07T20:32:01.1011140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.1011456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.1011764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.1012101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.1012449Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.1012739Z     )
2025-05-07T20:32:01.1013093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.1013539Z     def test_silu_mul_quant(
2025-05-07T20:32:01.1013786Z         self,
2025-05-07T20:32:01.1013988Z         T: int,
2025-05-07T20:32:01.1014191Z         D: int,
2025-05-07T20:32:01.1014407Z         scale_ub: Optional[float],
2025-05-07T20:32:01.1014696Z         contiguous: bool,
2025-05-07T20:32:01.1014943Z         compiled: bool,
2025-05-07T20:32:01.1015173Z     ) -> None:
2025-05-07T20:32:01.1015396Z         torch.manual_seed(2025)
2025-05-07T20:32:01.1015645Z     
2025-05-07T20:32:01.1015919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.1016271Z     
2025-05-07T20:32:01.1016482Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.1016779Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.1017232Z         x = x_sign * x_clamp
2025-05-07T20:32:01.1017483Z         x0 = x[:, :D]
2025-05-07T20:32:01.1017706Z         x1 = x[:, D:]
2025-05-07T20:32:01.1017916Z     
2025-05-07T20:32:01.1018115Z         if contiguous:
2025-05-07T20:32:01.1018354Z             x0 = x0.contiguous()
2025-05-07T20:32:01.1018607Z             x1 = x1.contiguous()
2025-05-07T20:32:01.1018851Z     
2025-05-07T20:32:01.1019054Z         if scale_ub is not None:
2025-05-07T20:32:01.1019334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.1019675Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.1019987Z             )
2025-05-07T20:32:01.1020178Z         else:
2025-05-07T20:32:01.1020394Z             scale_ub_tensor = None
2025-05-07T20:32:01.1020659Z     
2025-05-07T20:32:01.1020893Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.1021225Z             op = silu_mul_quant
2025-05-07T20:32:01.1021479Z             if compiled:
2025-05-07T20:32:01.1029755Z                 op = torch.compile(op)
2025-05-07T20:32:01.1030119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.1030417Z     
2025-05-07T20:32:01.1030630Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.1030925Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.1031232Z     
2025-05-07T20:32:01.1031486Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.1031839Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.1032146Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.1032472Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.1032845Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.1033161Z     
2025-05-07T20:32:01.1033378Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.1033585Z 
2025-05-07T20:32:01.1033698Z moe/activation_test.py:126: 
2025-05-07T20:32:01.1034011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.1034371Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.1034715Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.1035532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.1036304Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.1036870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.1037708Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.1038568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.1039428Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.1040208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.1040972Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.1041700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.1042343Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.1042946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.1043475Z     fn()
2025-05-07T20:32:01.1043978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.1044565Z     self.fn.run(
2025-05-07T20:32:01.1045036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.1045564Z     kernel = self.compile(
2025-05-07T20:32:01.1046212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.1046872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.1047275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.1047505Z 
2025-05-07T20:32:01.1047845Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a84184850>
2025-05-07T20:32:01.1048948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.1050334Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a854ea5c0>}
2025-05-07T20:32:01.1051686Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.1052714Z context = <triton._C.libtriton.ir.context object at 0x7f3a841c0530>
2025-05-07T20:32:01.1053002Z 
2025-05-07T20:32:01.1053171Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.1053699Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.1054178Z                            module_map=module_map)
2025-05-07T20:32:01.1054546Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.1054911Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.1055190Z E       ^
2025-05-07T20:32:01.1055656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.1056114Z 
2025-05-07T20:32:01.1056540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.1057058Z 
2025-05-07T20:32:01.1057167Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.1057585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.1057987Z     T=1,
2025-05-07T20:32:01.1058184Z     D=5120,
2025-05-07T20:32:01.1058389Z     scale_ub=1200.0,
2025-05-07T20:32:01.1058625Z     contiguous=True,
2025-05-07T20:32:01.1058979Z     compiled=True,
2025-05-07T20:32:01.1059204Z )
2025-05-07T20:32:01.2128515Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2129629Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.2130151Z 
2025-05-07T20:32:01.2130314Z     @given(
2025-05-07T20:32:01.2130780Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2131400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2132027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2132685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2133334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2133891Z     )
2025-05-07T20:32:01.2134584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2135458Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2135711Z         self,
2025-05-07T20:32:01.2135909Z         T: int,
2025-05-07T20:32:01.2136130Z         D: int,
2025-05-07T20:32:01.2136366Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2136638Z         contiguous: bool,
2025-05-07T20:32:01.2136886Z         compiled: bool,
2025-05-07T20:32:01.2137122Z     ) -> None:
2025-05-07T20:32:01.2137337Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2137591Z     
2025-05-07T20:32:01.2137875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2138221Z     
2025-05-07T20:32:01.2138779Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2139085Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2139395Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2139648Z         x0 = x[:, :D]
2025-05-07T20:32:01.2139876Z         x1 = x[:, D:]
2025-05-07T20:32:01.2140084Z     
2025-05-07T20:32:01.2140284Z         if contiguous:
2025-05-07T20:32:01.2140531Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2140791Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2141048Z     
2025-05-07T20:32:01.2141254Z         if scale_ub is not None:
2025-05-07T20:32:01.2141537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2141874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2142194Z             )
2025-05-07T20:32:01.2142399Z         else:
2025-05-07T20:32:01.2142610Z             scale_ub_tensor = None
2025-05-07T20:32:01.2142868Z     
2025-05-07T20:32:01.2143112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2143429Z             op = silu_mul_quant
2025-05-07T20:32:01.2143682Z             if compiled:
2025-05-07T20:32:01.2143933Z                 op = torch.compile(op)
2025-05-07T20:32:01.2144233Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2144506Z     
2025-05-07T20:32:01.2144707Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2144872Z 
2025-05-07T20:32:01.2144980Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2145274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2145653Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2145957Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2146510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.2147076Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.2147739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2148433Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2148964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2149648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2150310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2150992Z     kernel = self.compile(
2025-05-07T20:32:01.2151537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2152192Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2152593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2152821Z 
2025-05-07T20:32:01.2153028Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a96a1990>
2025-05-07T20:32:01.2154110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2155503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8455df80>}
2025-05-07T20:32:01.2156905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2157935Z context = <triton._C.libtriton.ir.context object at 0x7f39a9689a30>
2025-05-07T20:32:01.2158223Z 
2025-05-07T20:32:01.2158390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2158999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2159482Z                            module_map=module_map)
2025-05-07T20:32:01.2159848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2160215Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2160487Z E       ^
2025-05-07T20:32:01.2160960Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2161411Z 
2025-05-07T20:32:01.2161833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2162351Z 
2025-05-07T20:32:01.2162460Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2162881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2163295Z     T=1,
2025-05-07T20:32:01.2163486Z     D=5120,
2025-05-07T20:32:01.2163692Z     scale_ub=None,
2025-05-07T20:32:01.2163920Z     contiguous=False,
2025-05-07T20:32:01.2164154Z     compiled=True,
2025-05-07T20:32:01.2164378Z )
2025-05-07T20:32:01.4326785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.4327629Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.4327984Z 
2025-05-07T20:32:01.4328085Z     @given(
2025-05-07T20:32:01.4328325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.4328650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.4328989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.4329321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.4329656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.4329955Z     )
2025-05-07T20:32:01.4330308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.4330757Z     def test_silu_mul_quant(
2025-05-07T20:32:01.4331006Z         self,
2025-05-07T20:32:01.4331214Z         T: int,
2025-05-07T20:32:01.4331433Z         D: int,
2025-05-07T20:32:01.4331661Z         scale_ub: Optional[float],
2025-05-07T20:32:01.4331940Z         contiguous: bool,
2025-05-07T20:32:01.4332178Z         compiled: bool,
2025-05-07T20:32:01.4332416Z     ) -> None:
2025-05-07T20:32:01.4332640Z         torch.manual_seed(2025)
2025-05-07T20:32:01.4332882Z     
2025-05-07T20:32:01.4333163Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.4333852Z     
2025-05-07T20:32:01.4334049Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.4334344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.4334658Z         x = x_sign * x_clamp
2025-05-07T20:32:01.4334899Z         x0 = x[:, :D]
2025-05-07T20:32:01.4335124Z         x1 = x[:, D:]
2025-05-07T20:32:01.4335341Z     
2025-05-07T20:32:01.4335528Z         if contiguous:
2025-05-07T20:32:01.4335769Z             x0 = x0.contiguous()
2025-05-07T20:32:01.4336033Z             x1 = x1.contiguous()
2025-05-07T20:32:01.4336282Z     
2025-05-07T20:32:01.4336512Z         if scale_ub is not None:
2025-05-07T20:32:01.4336813Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.4337149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.4337465Z             )
2025-05-07T20:32:01.4337667Z         else:
2025-05-07T20:32:01.4337883Z             scale_ub_tensor = None
2025-05-07T20:32:01.4338155Z     
2025-05-07T20:32:01.4338391Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.4338715Z             op = silu_mul_quant
2025-05-07T20:32:01.4338975Z             if compiled:
2025-05-07T20:32:01.4339229Z                 op = torch.compile(op)
2025-05-07T20:32:01.4339524Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.4339814Z     
2025-05-07T20:32:01.4340014Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.4340297Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.4340600Z     
2025-05-07T20:32:01.4340986Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.4341324Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.4341626Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.4341947Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.4342313Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.4342627Z     
2025-05-07T20:32:01.4342838Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.4343039Z 
2025-05-07T20:32:01.4343146Z moe/activation_test.py:126: 
2025-05-07T20:32:01.4343441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.4343786Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.4344119Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.4344915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.4345690Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.4346236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.4346919Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.4347608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.4348330Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.4349083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.4349833Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.4350561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.4351199Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.4351800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.4352322Z     fn()
2025-05-07T20:32:01.4352825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.4353408Z     self.fn.run(
2025-05-07T20:32:01.4353878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.4354499Z     kernel = self.compile(
2025-05-07T20:32:01.4355035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.4355695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.4356097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.4356330Z 
2025-05-07T20:32:01.4356543Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a96e4950>
2025-05-07T20:32:01.4357633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.4359022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8455d4e0>}
2025-05-07T20:32:01.4360369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.4361395Z context = <triton._C.libtriton.ir.context object at 0x7f39a9624c30>
2025-05-07T20:32:01.4361682Z 
2025-05-07T20:32:01.4361926Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.4362452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.4362922Z                            module_map=module_map)
2025-05-07T20:32:01.4363292Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.4363647Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.4363920Z E       ^
2025-05-07T20:32:01.4364389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.4364842Z 
2025-05-07T20:32:01.4365260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.4365780Z 
2025-05-07T20:32:01.4365888Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.4366344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.4366761Z     T=1,
2025-05-07T20:32:01.4366950Z     D=5120,
2025-05-07T20:32:01.4367162Z     scale_ub=None,
2025-05-07T20:32:01.4367388Z     contiguous=True,
2025-05-07T20:32:01.4367669Z     compiled=False,
2025-05-07T20:32:01.4367886Z )
2025-05-07T20:32:01.5549628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.5551051Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.5551577Z 
2025-05-07T20:32:01.5551751Z     @given(
2025-05-07T20:32:01.5552238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.5552859Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.5553474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.5554123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.5554782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.5555358Z     )
2025-05-07T20:32:01.5555989Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.5556491Z     def test_silu_mul_quant(
2025-05-07T20:32:01.5556738Z         self,
2025-05-07T20:32:01.5556939Z         T: int,
2025-05-07T20:32:01.5557135Z         D: int,
2025-05-07T20:32:01.5557361Z         scale_ub: Optional[float],
2025-05-07T20:32:01.5557640Z         contiguous: bool,
2025-05-07T20:32:01.5557877Z         compiled: bool,
2025-05-07T20:32:01.5558116Z     ) -> None:
2025-05-07T20:32:01.5558337Z         torch.manual_seed(2025)
2025-05-07T20:32:01.5558576Z     
2025-05-07T20:32:01.5559245Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.5559595Z     
2025-05-07T20:32:01.5559788Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.5560082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.5560397Z         x = x_sign * x_clamp
2025-05-07T20:32:01.5560643Z         x0 = x[:, :D]
2025-05-07T20:32:01.5560860Z         x1 = x[:, D:]
2025-05-07T20:32:01.5561078Z     
2025-05-07T20:32:01.5561273Z         if contiguous:
2025-05-07T20:32:01.5561510Z             x0 = x0.contiguous()
2025-05-07T20:32:01.5561771Z             x1 = x1.contiguous()
2025-05-07T20:32:01.5562018Z     
2025-05-07T20:32:01.5562212Z         if scale_ub is not None:
2025-05-07T20:32:01.5562488Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.5562828Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.5563139Z             )
2025-05-07T20:32:01.5563343Z         else:
2025-05-07T20:32:01.5563564Z             scale_ub_tensor = None
2025-05-07T20:32:01.5563823Z     
2025-05-07T20:32:01.5564055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.5564378Z             op = silu_mul_quant
2025-05-07T20:32:01.5564625Z             if compiled:
2025-05-07T20:32:01.5564877Z                 op = torch.compile(op)
2025-05-07T20:32:01.5565178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.5565454Z     
2025-05-07T20:32:01.5565656Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.5565829Z 
2025-05-07T20:32:01.5566077Z moe/activation_test.py:117: 
2025-05-07T20:32:01.5566382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.5566717Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.5567006Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.5567821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.5568515Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.5569049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.5569730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.5570389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.5570920Z     kernel = self.compile(
2025-05-07T20:32:01.5571468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.5572122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.5572518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.5572753Z 
2025-05-07T20:32:01.5572959Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a961ab90>
2025-05-07T20:32:01.5574038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.5575436Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8455cea0>}
2025-05-07T20:32:01.5576835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.5577852Z context = <triton._C.libtriton.ir.context object at 0x7f39a96cf170>
2025-05-07T20:32:01.5578145Z 
2025-05-07T20:32:01.5578311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.5578836Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.5579395Z                            module_map=module_map)
2025-05-07T20:32:01.5579757Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.5580116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.5580385Z E       ^
2025-05-07T20:32:01.5580846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.5581301Z 
2025-05-07T20:32:01.5581721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.5582239Z 
2025-05-07T20:32:01.5582346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.5582760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.5583163Z     T=128,
2025-05-07T20:32:01.5583364Z     D=5120,
2025-05-07T20:32:01.5583566Z     scale_ub=None,
2025-05-07T20:32:01.5583787Z     contiguous=False,
2025-05-07T20:32:01.5584019Z     compiled=True,
2025-05-07T20:32:01.5584241Z )
2025-05-07T20:32:01.5584559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.5585053Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.5585328Z 
2025-05-07T20:32:01.5585414Z     @given(
2025-05-07T20:32:01.5585661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.5585976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.5586370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.5586710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.5587037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.5587331Z     )
2025-05-07T20:32:01.5587684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.5588119Z     def test_silu_mul_quant(
2025-05-07T20:32:01.5588366Z         self,
2025-05-07T20:32:01.5588569Z         T: int,
2025-05-07T20:32:01.5588772Z         D: int,
2025-05-07T20:32:01.5588998Z         scale_ub: Optional[float],
2025-05-07T20:32:01.5589276Z         contiguous: bool,
2025-05-07T20:32:01.5589525Z         compiled: bool,
2025-05-07T20:32:01.5589751Z     ) -> None:
2025-05-07T20:32:01.5589971Z         torch.manual_seed(2025)
2025-05-07T20:32:01.5590219Z     
2025-05-07T20:32:01.5590489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.5590834Z     
2025-05-07T20:32:01.5591033Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.5591325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.5591641Z         x = x_sign * x_clamp
2025-05-07T20:32:01.5591884Z         x0 = x[:, :D]
2025-05-07T20:32:01.5592100Z         x1 = x[:, D:]
2025-05-07T20:32:01.5592312Z     
2025-05-07T20:32:01.5592504Z         if contiguous:
2025-05-07T20:32:01.5592733Z             x0 = x0.contiguous()
2025-05-07T20:32:01.5592995Z             x1 = x1.contiguous()
﻿2025-05-07T20:32:01.5596615Z     
2025-05-07T20:32:01.5596812Z         if scale_ub is not None:
2025-05-07T20:32:01.5597096Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.5597439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.5597750Z             )
2025-05-07T20:32:01.5597954Z         else:
2025-05-07T20:32:01.5598177Z             scale_ub_tensor = None
2025-05-07T20:32:01.5598429Z     
2025-05-07T20:32:01.5598663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.5598984Z             op = silu_mul_quant
2025-05-07T20:32:01.5599225Z             if compiled:
2025-05-07T20:32:01.5599471Z                 op = torch.compile(op)
2025-05-07T20:32:01.5599765Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.5600045Z     
2025-05-07T20:32:01.5600251Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.5600425Z 
2025-05-07T20:32:01.5600525Z moe/activation_test.py:117: 
2025-05-07T20:32:01.5600829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.5609677Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.5610005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.5610578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.5611150Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.5611822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.5612520Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.5613064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.5613751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.5614417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.5614950Z     kernel = self.compile(
2025-05-07T20:32:01.5615506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.5616217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.5616616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.5616858Z 
2025-05-07T20:32:01.5617068Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a956c190>
2025-05-07T20:32:01.5618317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.5619699Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9c09c60>}
2025-05-07T20:32:01.5621045Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.5622067Z context = <triton._C.libtriton.ir.context object at 0x7f39a9580730>
2025-05-07T20:32:01.5622363Z 
2025-05-07T20:32:01.5622531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.5623055Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.5623534Z                            module_map=module_map)
2025-05-07T20:32:01.5623900Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.5624260Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.5624527Z E       ^
2025-05-07T20:32:01.5624992Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.5625448Z 
2025-05-07T20:32:01.5625981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.5626504Z 
2025-05-07T20:32:01.5626611Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.5627028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.5627430Z     T=128,
2025-05-07T20:32:01.5627628Z     D=7168,
2025-05-07T20:32:01.5627829Z     scale_ub=1200.0,
2025-05-07T20:32:01.5628053Z     contiguous=False,
2025-05-07T20:32:01.5628289Z     compiled=False,
2025-05-07T20:32:01.5628502Z )
2025-05-07T20:32:01.6493639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.6494433Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.6494764Z 
2025-05-07T20:32:01.6494850Z     @given(
2025-05-07T20:32:01.6495082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.6495399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.6495995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.6496331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.6496669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.6496953Z     )
2025-05-07T20:32:01.6497310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.6497758Z     def test_silu_mul_quant(
2025-05-07T20:32:01.6498004Z         self,
2025-05-07T20:32:01.6498209Z         T: int,
2025-05-07T20:32:01.6498431Z         D: int,
2025-05-07T20:32:01.6498652Z         scale_ub: Optional[float],
2025-05-07T20:32:01.6498932Z         contiguous: bool,
2025-05-07T20:32:01.6499179Z         compiled: bool,
2025-05-07T20:32:01.6499408Z     ) -> None:
2025-05-07T20:32:01.6499633Z         torch.manual_seed(2025)
2025-05-07T20:32:01.6499901Z     
2025-05-07T20:32:01.6500193Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.6500544Z     
2025-05-07T20:32:01.6500751Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.6501053Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.6501363Z         x = x_sign * x_clamp
2025-05-07T20:32:01.6501617Z         x0 = x[:, :D]
2025-05-07T20:32:01.6501842Z         x1 = x[:, D:]
2025-05-07T20:32:01.6502051Z     
2025-05-07T20:32:01.6502248Z         if contiguous:
2025-05-07T20:32:01.6502486Z             x0 = x0.contiguous()
2025-05-07T20:32:01.6502745Z             x1 = x1.contiguous()
2025-05-07T20:32:01.6502993Z     
2025-05-07T20:32:01.6503333Z         if scale_ub is not None:
2025-05-07T20:32:01.6503612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.6503956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.6504272Z             )
2025-05-07T20:32:01.6504466Z         else:
2025-05-07T20:32:01.6504688Z             scale_ub_tensor = None
2025-05-07T20:32:01.6504951Z     
2025-05-07T20:32:01.6505198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.6505519Z             op = silu_mul_quant
2025-05-07T20:32:01.6506080Z             if compiled:
2025-05-07T20:32:01.6506569Z                 op = torch.compile(op)
2025-05-07T20:32:01.6506868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.6507150Z     
2025-05-07T20:32:01.6507353Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.6507519Z 
2025-05-07T20:32:01.6507621Z moe/activation_test.py:117: 
2025-05-07T20:32:01.6507930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.6508275Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.6508553Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.6509246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.6509945Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.6510491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.6511302Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.6511970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.6512509Z     kernel = self.compile(
2025-05-07T20:32:01.6513060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.6513717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.6514120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.6514358Z 
2025-05-07T20:32:01.6514565Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a954c490>
2025-05-07T20:32:01.6515652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.6517110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9c09800>}
2025-05-07T20:32:01.6518446Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.6519472Z context = <triton._C.libtriton.ir.context object at 0x7f39a9570c30>
2025-05-07T20:32:01.6519767Z 
2025-05-07T20:32:01.6519931Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.6520456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.6520917Z                            module_map=module_map)
2025-05-07T20:32:01.6521286Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.6521647Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.6521904Z E       ^
2025-05-07T20:32:01.6522368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.6522822Z 
2025-05-07T20:32:01.6523234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.6523739Z 
2025-05-07T20:32:01.6523961Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.6524368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.6524773Z     T=128,
2025-05-07T20:32:01.6524970Z     D=5120,
2025-05-07T20:32:01.6525169Z     scale_ub=None,
2025-05-07T20:32:01.6525388Z     contiguous=False,
2025-05-07T20:32:01.6525620Z     compiled=False,
2025-05-07T20:32:01.6525837Z )
2025-05-07T20:32:01.6526176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.6526703Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.6526972Z 
2025-05-07T20:32:01.6527060Z     @given(
2025-05-07T20:32:01.6527290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.6527732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.6528047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.6528374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.6528718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.6529011Z     )
2025-05-07T20:32:01.6529363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.6529800Z     def test_silu_mul_quant(
2025-05-07T20:32:01.6530048Z         self,
2025-05-07T20:32:01.6530249Z         T: int,
2025-05-07T20:32:01.6530448Z         D: int,
2025-05-07T20:32:01.6530672Z         scale_ub: Optional[float],
2025-05-07T20:32:01.6530947Z         contiguous: bool,
2025-05-07T20:32:01.6531249Z         compiled: bool,
2025-05-07T20:32:01.6531480Z     ) -> None:
2025-05-07T20:32:01.6531697Z         torch.manual_seed(2025)
2025-05-07T20:32:01.6531937Z     
2025-05-07T20:32:01.6532212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.6532558Z     
2025-05-07T20:32:01.6532751Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.6533074Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.6533384Z         x = x_sign * x_clamp
2025-05-07T20:32:01.6533632Z         x0 = x[:, :D]
2025-05-07T20:32:01.6533849Z         x1 = x[:, D:]
2025-05-07T20:32:01.6534060Z     
2025-05-07T20:32:01.6534249Z         if contiguous:
2025-05-07T20:32:01.6534480Z             x0 = x0.contiguous()
2025-05-07T20:32:01.6534748Z             x1 = x1.contiguous()
2025-05-07T20:32:01.6534995Z     
2025-05-07T20:32:01.6535186Z         if scale_ub is not None:
2025-05-07T20:32:01.6535463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.6535854Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.6536159Z             )
2025-05-07T20:32:01.6536358Z         else:
2025-05-07T20:32:01.6536575Z             scale_ub_tensor = None
2025-05-07T20:32:01.6536825Z     
2025-05-07T20:32:01.6537062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.6537381Z             op = silu_mul_quant
2025-05-07T20:32:01.6537635Z             if compiled:
2025-05-07T20:32:01.6537882Z                 op = torch.compile(op)
2025-05-07T20:32:01.6538185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.6538467Z     
2025-05-07T20:32:01.6538660Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.6538830Z 
2025-05-07T20:32:01.6538933Z moe/activation_test.py:117: 
2025-05-07T20:32:01.6539232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.6539564Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.6539851Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.6540545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.6541239Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.6541770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.6542454Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.6543267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.6543799Z     kernel = self.compile(
2025-05-07T20:32:01.6544341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.6544995Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.6545392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.6545624Z 
2025-05-07T20:32:01.6545830Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9a4b510>
2025-05-07T20:32:01.6546909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.6548288Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9c0bce0>}
2025-05-07T20:32:01.6549630Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.6550653Z context = <triton._C.libtriton.ir.context object at 0x7f39a9affb70>
2025-05-07T20:32:01.6550937Z 
2025-05-07T20:32:01.6551184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.6551709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.6552176Z                            module_map=module_map)
2025-05-07T20:32:01.6552557Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.6552908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.6553179Z E       ^
2025-05-07T20:32:01.6553653Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.6554101Z 
2025-05-07T20:32:01.6554521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.6555031Z 
2025-05-07T20:32:01.6555139Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.6555563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.6556060Z     T=128,
2025-05-07T20:32:01.6556250Z     D=5120,
2025-05-07T20:32:01.6556453Z     scale_ub=1200.0,
2025-05-07T20:32:01.6556682Z     contiguous=True,
2025-05-07T20:32:01.6556910Z     compiled=False,
2025-05-07T20:32:01.6557128Z )
2025-05-07T20:32:01.7933584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.7934310Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.7934589Z 
2025-05-07T20:32:01.7934671Z     @given(
2025-05-07T20:32:01.7934929Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.7935237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.7935544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.7935875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.7936195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.7936484Z     )
2025-05-07T20:32:01.7936830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.7937275Z     def test_silu_mul_quant(
2025-05-07T20:32:01.7937520Z         self,
2025-05-07T20:32:01.7937712Z         T: int,
2025-05-07T20:32:01.7937901Z         D: int,
2025-05-07T20:32:01.7938118Z         scale_ub: Optional[float],
2025-05-07T20:32:01.7938390Z         contiguous: bool,
2025-05-07T20:32:01.7938621Z         compiled: bool,
2025-05-07T20:32:01.7938852Z     ) -> None:
2025-05-07T20:32:01.7939066Z         torch.manual_seed(2025)
2025-05-07T20:32:01.7939635Z     
2025-05-07T20:32:01.7939904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.7940247Z     
2025-05-07T20:32:01.7940441Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.7940723Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.7941029Z         x = x_sign * x_clamp
2025-05-07T20:32:01.7941268Z         x0 = x[:, :D]
2025-05-07T20:32:01.7941477Z         x1 = x[:, D:]
2025-05-07T20:32:01.7941689Z     
2025-05-07T20:32:01.7941882Z         if contiguous:
2025-05-07T20:32:01.7942105Z             x0 = x0.contiguous()
2025-05-07T20:32:01.7942368Z             x1 = x1.contiguous()
2025-05-07T20:32:01.7942608Z     
2025-05-07T20:32:01.7942794Z         if scale_ub is not None:
2025-05-07T20:32:01.7943071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.7943406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.7943713Z             )
2025-05-07T20:32:01.7943912Z         else:
2025-05-07T20:32:01.7944132Z             scale_ub_tensor = None
2025-05-07T20:32:01.7944378Z     
2025-05-07T20:32:01.7944610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.7944927Z             op = silu_mul_quant
2025-05-07T20:32:01.7945182Z             if compiled:
2025-05-07T20:32:01.7945428Z                 op = torch.compile(op)
2025-05-07T20:32:01.7945726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.7946000Z     
2025-05-07T20:32:01.7946288Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.7946455Z 
2025-05-07T20:32:01.7946556Z moe/activation_test.py:117: 
2025-05-07T20:32:01.7946859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.7947191Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.7947481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.7948169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.7948871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.7949399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.7950075Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.7950734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.7951263Z     kernel = self.compile(
2025-05-07T20:32:01.7951883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.7952537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.7952934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.7953161Z 
2025-05-07T20:32:01.7953368Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9a7a3d0>
2025-05-07T20:32:01.7954455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.7955891Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9fc3920>}
2025-05-07T20:32:01.7957242Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.7958274Z context = <triton._C.libtriton.ir.context object at 0x7f39a9a6a9b0>
2025-05-07T20:32:01.7958560Z 
2025-05-07T20:32:01.7958724Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.7959327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.7959794Z                            module_map=module_map)
2025-05-07T20:32:01.7960154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.7960504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.7960763Z E       ^
2025-05-07T20:32:01.7961224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.7961670Z 
2025-05-07T20:32:01.7962088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.7962597Z 
2025-05-07T20:32:01.7962701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.7963111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.7963507Z     T=1,
2025-05-07T20:32:01.7963691Z     D=7168,
2025-05-07T20:32:01.7963889Z     scale_ub=1200.0,
2025-05-07T20:32:01.7964112Z     contiguous=True,
2025-05-07T20:32:01.7964334Z     compiled=True,
2025-05-07T20:32:01.7964542Z )
2025-05-07T20:32:01.7964857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.7965333Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.7965596Z 
2025-05-07T20:32:01.7965674Z     @given(
2025-05-07T20:32:01.7965906Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.7966211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.7966573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.7966898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.7967220Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.7967504Z     )
2025-05-07T20:32:01.7967972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.7968417Z     def test_silu_mul_quant(
2025-05-07T20:32:01.7968661Z         self,
2025-05-07T20:32:01.7968865Z         T: int,
2025-05-07T20:32:01.7969074Z         D: int,
2025-05-07T20:32:01.7969289Z         scale_ub: Optional[float],
2025-05-07T20:32:01.7969568Z         contiguous: bool,
2025-05-07T20:32:01.7969813Z         compiled: bool,
2025-05-07T20:32:01.7970037Z     ) -> None:
2025-05-07T20:32:01.7970258Z         torch.manual_seed(2025)
2025-05-07T20:32:01.7970508Z     
2025-05-07T20:32:01.7970781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.7971131Z     
2025-05-07T20:32:01.7971381Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.7971669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.7971990Z         x = x_sign * x_clamp
2025-05-07T20:32:01.7972233Z         x0 = x[:, :D]
2025-05-07T20:32:01.7972450Z         x1 = x[:, D:]
2025-05-07T20:32:01.7972660Z     
2025-05-07T20:32:01.7972851Z         if contiguous:
2025-05-07T20:32:01.7973086Z             x0 = x0.contiguous()
2025-05-07T20:32:01.7973343Z             x1 = x1.contiguous()
2025-05-07T20:32:01.7973596Z     
2025-05-07T20:32:01.7973796Z         if scale_ub is not None:
2025-05-07T20:32:01.7974066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.7974405Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.7974717Z             )
2025-05-07T20:32:01.7974907Z         else:
2025-05-07T20:32:01.7975126Z             scale_ub_tensor = None
2025-05-07T20:32:01.7975380Z     
2025-05-07T20:32:01.7975612Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.7975965Z             op = silu_mul_quant
2025-05-07T20:32:01.7976251Z             if compiled:
2025-05-07T20:32:01.7976497Z                 op = torch.compile(op)
2025-05-07T20:32:01.7976797Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.7977081Z     
2025-05-07T20:32:01.7977275Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.7977447Z 
2025-05-07T20:32:01.7977547Z moe/activation_test.py:117: 
2025-05-07T20:32:01.7977935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.7978282Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.7978566Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.7979126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.7979689Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.7980346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.7981048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.7981595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.7982280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.7982937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.7983486Z     kernel = self.compile(
2025-05-07T20:32:01.7984031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.7984687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.7985081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.7985318Z 
2025-05-07T20:32:01.7985523Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9ac0fd0>
2025-05-07T20:32:01.7986653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.7988017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84950220>}
2025-05-07T20:32:01.7989354Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.7990380Z context = <triton._C.libtriton.ir.context object at 0x7f39a9a19670>
2025-05-07T20:32:01.7990673Z 
2025-05-07T20:32:01.7990837Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.7991359Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.7991867Z                            module_map=module_map)
2025-05-07T20:32:01.7992234Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.7992590Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.7992846Z E       ^
2025-05-07T20:32:01.7993313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.7993766Z 
2025-05-07T20:32:01.7994186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.7994692Z 
2025-05-07T20:32:01.7994802Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.7995209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.7995616Z     T=1,
2025-05-07T20:32:01.7995807Z     D=7168,
2025-05-07T20:32:01.7996001Z     scale_ub=1200.0,
2025-05-07T20:32:01.7996240Z     contiguous=False,
2025-05-07T20:32:01.7996466Z     compiled=True,
2025-05-07T20:32:01.7996670Z )
2025-05-07T20:32:02.0779136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.0779668Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.0779962Z 
2025-05-07T20:32:02.0780051Z     @given(
2025-05-07T20:32:02.0792452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.0792816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.0793725Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.0794094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.0794434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.0794719Z     )
2025-05-07T20:32:02.0795085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.0795573Z     def test_silu_mul_quant(
2025-05-07T20:32:02.0795850Z         self,
2025-05-07T20:32:02.0796061Z         T: int,
2025-05-07T20:32:02.0796269Z         D: int,
2025-05-07T20:32:02.0796489Z         scale_ub: Optional[float],
2025-05-07T20:32:02.0796822Z         contiguous: bool,
2025-05-07T20:32:02.0797110Z         compiled: bool,
2025-05-07T20:32:02.0797345Z     ) -> None:
2025-05-07T20:32:02.0797573Z         torch.manual_seed(2025)
2025-05-07T20:32:02.0797827Z     
2025-05-07T20:32:02.0798102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.0798458Z     
2025-05-07T20:32:02.0798679Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.0798970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.0799293Z         x = x_sign * x_clamp
2025-05-07T20:32:02.0799549Z         x0 = x[:, :D]
2025-05-07T20:32:02.0799775Z         x1 = x[:, D:]
2025-05-07T20:32:02.0799986Z     
2025-05-07T20:32:02.0800181Z         if contiguous:
2025-05-07T20:32:02.0800427Z             x0 = x0.contiguous()
2025-05-07T20:32:02.0800836Z             x1 = x1.contiguous()
2025-05-07T20:32:02.0801087Z     
2025-05-07T20:32:02.0801289Z         if scale_ub is not None:
2025-05-07T20:32:02.0801564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.0801920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.0802247Z             )
2025-05-07T20:32:02.0802448Z         else:
2025-05-07T20:32:02.0802674Z             scale_ub_tensor = None
2025-05-07T20:32:02.0802939Z     
2025-05-07T20:32:02.0803179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.0803510Z             op = silu_mul_quant
2025-05-07T20:32:02.0803771Z             if compiled:
2025-05-07T20:32:02.0804023Z                 op = torch.compile(op)
2025-05-07T20:32:02.0804331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.0804617Z     
2025-05-07T20:32:02.0804822Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.0804995Z 
2025-05-07T20:32:02.0805101Z moe/activation_test.py:117: 
2025-05-07T20:32:02.0805495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.0806114Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.0806397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.0806975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.0807671Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.0808345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.0809048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.0809593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.0810283Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.0810951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.0811502Z     kernel = self.compile(
2025-05-07T20:32:02.0812053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.0812758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.0813157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.0813391Z 
2025-05-07T20:32:02.0813736Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9d16c10>
2025-05-07T20:32:02.0814831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.0816346Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a849518a0>}
2025-05-07T20:32:02.0817751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.0818778Z context = <triton._C.libtriton.ir.context object at 0x7f39a9dfb1f0>
2025-05-07T20:32:02.0819078Z 
2025-05-07T20:32:02.0819247Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.0819795Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.0820279Z                            module_map=module_map)
2025-05-07T20:32:02.0820649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.0821017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.0821286Z E       ^
2025-05-07T20:32:02.0821756Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.0822294Z 
2025-05-07T20:32:02.0822713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.0823233Z 
2025-05-07T20:32:02.0823343Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.0823772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.0824175Z     T=1,
2025-05-07T20:32:02.0824372Z     D=7168,
2025-05-07T20:32:02.0824580Z     scale_ub=None,
2025-05-07T20:32:02.0824808Z     contiguous=False,
2025-05-07T20:32:02.0825049Z     compiled=True,
2025-05-07T20:32:02.0825269Z )
2025-05-07T20:32:02.1506416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.1506979Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:02.1507255Z 
2025-05-07T20:32:02.1507334Z     @given(
2025-05-07T20:32:02.1507570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.1508198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.1508503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.1508839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.1509174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.1509459Z     )
2025-05-07T20:32:02.1509815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.1510257Z     def test_silu_mul_quant(
2025-05-07T20:32:02.1510495Z         self,
2025-05-07T20:32:02.1510703Z         T: int,
2025-05-07T20:32:02.1510903Z         D: int,
2025-05-07T20:32:02.1511115Z         scale_ub: Optional[float],
2025-05-07T20:32:02.1511391Z         contiguous: bool,
2025-05-07T20:32:02.1511630Z         compiled: bool,
2025-05-07T20:32:02.1511858Z     ) -> None:
2025-05-07T20:32:02.1512069Z         torch.manual_seed(2025)
2025-05-07T20:32:02.1512315Z     
2025-05-07T20:32:02.1512589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.1512936Z     
2025-05-07T20:32:02.1513131Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.1513425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.1513732Z         x = x_sign * x_clamp
2025-05-07T20:32:02.1513975Z         x0 = x[:, :D]
2025-05-07T20:32:02.1514194Z         x1 = x[:, D:]
2025-05-07T20:32:02.1514398Z     
2025-05-07T20:32:02.1514585Z         if contiguous:
2025-05-07T20:32:02.1514816Z             x0 = x0.contiguous()
2025-05-07T20:32:02.1515223Z             x1 = x1.contiguous()
2025-05-07T20:32:02.1515468Z     
2025-05-07T20:32:02.1515663Z         if scale_ub is not None:
2025-05-07T20:32:02.1515933Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.1516271Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.1516590Z             )
2025-05-07T20:32:02.1516791Z         else:
2025-05-07T20:32:02.1517006Z             scale_ub_tensor = None
2025-05-07T20:32:02.1517260Z     
2025-05-07T20:32:02.1517497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.1517807Z             op = silu_mul_quant
2025-05-07T20:32:02.1518059Z             if compiled:
2025-05-07T20:32:02.1518308Z                 op = torch.compile(op)
2025-05-07T20:32:02.1518599Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.1518876Z     
2025-05-07T20:32:02.1519067Z         y_fp8, y_scale = fn()
2025-05-07T20:32:02.1519345Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:02.1519645Z     
2025-05-07T20:32:02.1519881Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.1520215Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:02.1520510Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:02.1520822Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:02.1521182Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.1521489Z     
2025-05-07T20:32:02.1521790Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:02.1521987Z 
2025-05-07T20:32:02.1522095Z moe/activation_test.py:126: 
2025-05-07T20:32:02.1522387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1522724Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:02.1523049Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.1523838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:02.1524595Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:02.1525142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.1525823Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.1526502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:02.1527275Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.1528114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:02.1528867Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.1529599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:02.1530243Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:02.1530848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:02.1531377Z     fn()
2025-05-07T20:32:02.1531886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:02.1532470Z     self.fn.run(
2025-05-07T20:32:02.1532947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.1533477Z     kernel = self.compile(
2025-05-07T20:32:02.1534020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.1534680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.1535087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1535400Z 
2025-05-07T20:32:02.1535611Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9d71310>
2025-05-07T20:32:02.1536705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.1538103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84952e80>}
2025-05-07T20:32:02.1539451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.1540471Z context = <triton._C.libtriton.ir.context object at 0x7f39a9d635f0>
2025-05-07T20:32:02.1540766Z 
2025-05-07T20:32:02.1540940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.1541464Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.1541933Z                            module_map=module_map)
2025-05-07T20:32:02.1542296Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.1542658Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:02.1542947Z E       ^
2025-05-07T20:32:02.1543477Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.1543929Z 
2025-05-07T20:32:02.1544351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.1544868Z 
2025-05-07T20:32:02.1544975Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.1545392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.1545803Z     T=1,
2025-05-07T20:32:02.1545996Z     D=5120,
2025-05-07T20:32:02.1546199Z     scale_ub=1200.0,
2025-05-07T20:32:02.1546435Z     contiguous=False,
2025-05-07T20:32:02.1546664Z     compiled=True,
2025-05-07T20:32:02.1546887Z )
2025-05-07T20:32:02.2771955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.2773072Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.2773533Z 
2025-05-07T20:32:02.2773896Z     @given(
2025-05-07T20:32:02.2774303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.2774771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.2775154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.2775629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.2776155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.2776583Z     )
2025-05-07T20:32:02.2777104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.2777777Z     def test_silu_mul_quant(
2025-05-07T20:32:02.2778155Z         self,
2025-05-07T20:32:02.2778472Z         T: int,
2025-05-07T20:32:02.2778825Z         D: int,
2025-05-07T20:32:02.2779162Z         scale_ub: Optional[float],
2025-05-07T20:32:02.2779546Z         contiguous: bool,
2025-05-07T20:32:02.2779943Z         compiled: bool,
2025-05-07T20:32:02.2780286Z     ) -> None:
2025-05-07T20:32:02.2780615Z         torch.manual_seed(2025)
2025-05-07T20:32:02.2781018Z     
2025-05-07T20:32:02.2781410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.2781945Z     
2025-05-07T20:32:02.2782222Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.2782631Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.2783134Z         x = x_sign * x_clamp
2025-05-07T20:32:02.2783458Z         x0 = x[:, :D]
2025-05-07T20:32:02.2783791Z         x1 = x[:, D:]
2025-05-07T20:32:02.2784196Z     
2025-05-07T20:32:02.2784623Z         if contiguous:
2025-05-07T20:32:02.2785004Z             x0 = x0.contiguous()
2025-05-07T20:32:02.2785444Z             x1 = x1.contiguous()
2025-05-07T20:32:02.2785764Z     
2025-05-07T20:32:02.2786098Z         if scale_ub is not None:
2025-05-07T20:32:02.2786539Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.2786985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.2787403Z             )
2025-05-07T20:32:02.2787765Z         else:
2025-05-07T20:32:02.2788113Z             scale_ub_tensor = None
2025-05-07T20:32:02.2788444Z     
2025-05-07T20:32:02.2788828Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.2789273Z             op = silu_mul_quant
2025-05-07T20:32:02.2789597Z             if compiled:
2025-05-07T20:32:02.2790000Z                 op = torch.compile(op)
2025-05-07T20:32:02.2790428Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.2790777Z     
2025-05-07T20:32:02.2791131Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.2791397Z 
2025-05-07T20:32:02.2791541Z moe/activation_test.py:117: 
2025-05-07T20:32:02.2791939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.2792406Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.2792819Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.2793475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.2794334Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.2795067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.2795855Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.2796582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.2797337Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.2798097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.2798822Z     kernel = self.compile(
2025-05-07T20:32:02.2799474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.2800189Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.2800787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.2801144Z 
2025-05-07T20:32:02.2801392Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a842b57d0>
2025-05-07T20:32:02.2802641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.2804121Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84953a60>}
2025-05-07T20:32:02.2805578Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.2807865Z context = <triton._C.libtriton.ir.context object at 0x7f39a9250c30>
2025-05-07T20:32:02.2808215Z 
2025-05-07T20:32:02.2808450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.2809053Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.2809643Z                            module_map=module_map)
2025-05-07T20:32:02.2810107Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.2810566Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.2810911Z E       ^
2025-05-07T20:32:02.2811640Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.2812177Z 
2025-05-07T20:32:02.2812675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.2813198Z 
2025-05-07T20:32:02.2813431Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.2813893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.2814406Z     T=1,
2025-05-07T20:32:02.2814723Z     D=5120,
2025-05-07T20:32:02.2815026Z     scale_ub=1200.0,
2025-05-07T20:32:02.2815302Z     contiguous=False,
2025-05-07T20:32:02.2815655Z     compiled=False,
2025-05-07T20:32:02.2815970Z )
2025-05-07T20:32:02.2816335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.2816999Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.2817292Z 
2025-05-07T20:32:02.2817463Z     @given(
2025-05-07T20:32:02.2817787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.2818247Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.2818654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.2819053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.2819490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.2819878Z     )
2025-05-07T20:32:02.2820373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.2820947Z     def test_silu_mul_quant(
2025-05-07T20:32:02.2821272Z         self,
2025-05-07T20:32:02.2821539Z         T: int,
2025-05-07T20:32:02.2821868Z         D: int,
2025-05-07T20:32:02.2822171Z         scale_ub: Optional[float],
2025-05-07T20:32:02.2822557Z         contiguous: bool,
2025-05-07T20:32:02.2822960Z         compiled: bool,
2025-05-07T20:32:02.2823236Z     ) -> None:
2025-05-07T20:32:02.2823523Z         torch.manual_seed(2025)
2025-05-07T20:32:02.2823927Z     
2025-05-07T20:32:02.2824250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.2824684Z     
2025-05-07T20:32:02.2825025Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.2825368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.2825770Z         x = x_sign * x_clamp
2025-05-07T20:32:02.2826150Z         x0 = x[:, :D]
2025-05-07T20:32:02.2826463Z         x1 = x[:, D:]
2025-05-07T20:32:02.2826847Z     
2025-05-07T20:32:02.2827178Z         if contiguous:
2025-05-07T20:32:02.2827515Z             x0 = x0.contiguous()
2025-05-07T20:32:02.2827811Z             x1 = x1.contiguous()
2025-05-07T20:32:02.2828199Z     
2025-05-07T20:32:02.2828498Z         if scale_ub is not None:
2025-05-07T20:32:02.2828806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.2829289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.2829702Z             )
2025-05-07T20:32:02.2829940Z         else:
2025-05-07T20:32:02.2830300Z             scale_ub_tensor = None
2025-05-07T20:32:02.2830699Z     
2025-05-07T20:32:02.2830972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.2831438Z             op = silu_mul_quant
2025-05-07T20:32:02.2831792Z             if compiled:
2025-05-07T20:32:02.2832156Z                 op = torch.compile(op)
2025-05-07T20:32:02.2832538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.2832901Z     
2025-05-07T20:32:02.2833215Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.2833421Z 
2025-05-07T20:32:02.2833569Z moe/activation_test.py:117: 
2025-05-07T20:32:02.2833948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.2834410Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.2834850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.2835700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.2836589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.2837221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.2837982Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.2838744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.2839381Z     kernel = self.compile(
2025-05-07T20:32:02.2840003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.2840805Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.2841251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.2841539Z 
2025-05-07T20:32:02.2841792Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a84297250>
2025-05-07T20:32:02.2843042Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.2844494Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8420c9a0>}
2025-05-07T20:32:02.2845962Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.2847163Z context = <triton._C.libtriton.ir.context object at 0x7f3a84203870>
2025-05-07T20:32:02.2847508Z 
2025-05-07T20:32:02.2847783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.2848404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.2849018Z                            module_map=module_map)
2025-05-07T20:32:02.2849456Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.2849877Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.2850281Z E       ^
2025-05-07T20:32:02.2850819Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.2851377Z 
2025-05-07T20:32:02.2851805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.2852472Z 
2025-05-07T20:32:02.2852605Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.2853122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.2853561Z     T=16384,
2025-05-07T20:32:02.2853923Z     D=5120,
2025-05-07T20:32:02.2854201Z     scale_ub=1200.0,
2025-05-07T20:32:02.2854468Z     contiguous=False,
2025-05-07T20:32:02.2854861Z     compiled=True,
2025-05-07T20:32:02.2855151Z )
2025-05-07T20:32:02.3530291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.3531098Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.3531495Z 
2025-05-07T20:32:02.3531620Z     @given(
2025-05-07T20:32:02.3531941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.3532350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.3544070Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.3544488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.3544844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.3545155Z     )
2025-05-07T20:32:02.3545523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.3545978Z     def test_silu_mul_quant(
2025-05-07T20:32:02.3546241Z         self,
2025-05-07T20:32:02.3546733Z         T: int,
2025-05-07T20:32:02.3546942Z         D: int,
2025-05-07T20:32:02.3547175Z         scale_ub: Optional[float],
2025-05-07T20:32:02.3547460Z         contiguous: bool,
2025-05-07T20:32:02.3547704Z         compiled: bool,
2025-05-07T20:32:02.3547938Z     ) -> None:
2025-05-07T20:32:02.3548160Z         torch.manual_seed(2025)
2025-05-07T20:32:02.3548403Z     
2025-05-07T20:32:02.3548689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.3549044Z     
2025-05-07T20:32:02.3549237Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.3549538Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.3549856Z         x = x_sign * x_clamp
2025-05-07T20:32:02.3550098Z         x0 = x[:, :D]
2025-05-07T20:32:02.3550325Z         x1 = x[:, D:]
2025-05-07T20:32:02.3550545Z     
2025-05-07T20:32:02.3550743Z         if contiguous:
2025-05-07T20:32:02.3550977Z             x0 = x0.contiguous()
2025-05-07T20:32:02.3551251Z             x1 = x1.contiguous()
2025-05-07T20:32:02.3551499Z     
2025-05-07T20:32:02.3551694Z         if scale_ub is not None:
2025-05-07T20:32:02.3551974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.3552318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.3552624Z             )
2025-05-07T20:32:02.3552828Z         else:
2025-05-07T20:32:02.3553050Z             scale_ub_tensor = None
2025-05-07T20:32:02.3553297Z     
2025-05-07T20:32:02.3553641Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.3553963Z             op = silu_mul_quant
2025-05-07T20:32:02.3554215Z             if compiled:
2025-05-07T20:32:02.3554471Z                 op = torch.compile(op)
2025-05-07T20:32:02.3554778Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.3555054Z     
2025-05-07T20:32:02.3555260Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.3555434Z 
2025-05-07T20:32:02.3555536Z moe/activation_test.py:117: 
2025-05-07T20:32:02.3555882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.3556239Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.3556535Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.3557103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.3557663Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.3558333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.3559115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.3559660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.3560343Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.3561018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.3561555Z     kernel = self.compile(
2025-05-07T20:32:02.3562094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.3562757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.3563160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.3563390Z 
2025-05-07T20:32:02.3563610Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9771f10>
2025-05-07T20:32:02.3564688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.3566085Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8420de40>}
2025-05-07T20:32:02.3567605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.3568643Z context = <triton._C.libtriton.ir.context object at 0x7f39a97e2570>
2025-05-07T20:32:02.3568933Z 
2025-05-07T20:32:02.3569112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.3569634Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.3570115Z                            module_map=module_map)
2025-05-07T20:32:02.3570493Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.3570853Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.3571133Z E       ^
2025-05-07T20:32:02.3571606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.3572060Z 
2025-05-07T20:32:02.3572477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.3572985Z 
2025-05-07T20:32:02.3573092Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.3573523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.3573944Z     T=2048,
2025-05-07T20:32:02.3574150Z     D=7168,
2025-05-07T20:32:02.3574431Z     scale_ub=1200.0,
2025-05-07T20:32:02.3574680Z     contiguous=False,
2025-05-07T20:32:02.3574914Z     compiled=True,
2025-05-07T20:32:02.3575141Z )
2025-05-07T20:32:02.3575475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.3575974Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.3576259Z 
2025-05-07T20:32:02.3576343Z     @given(
2025-05-07T20:32:02.3576578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.3576903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.3577208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.3577544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.3577876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.3578160Z     )
2025-05-07T20:32:02.3578511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.3578956Z     def test_silu_mul_quant(
2025-05-07T20:32:02.3579249Z         self,
2025-05-07T20:32:02.3579449Z         T: int,
2025-05-07T20:32:02.3579655Z         D: int,
2025-05-07T20:32:02.3579870Z         scale_ub: Optional[float],
2025-05-07T20:32:02.3580147Z         contiguous: bool,
2025-05-07T20:32:02.3580394Z         compiled: bool,
2025-05-07T20:32:02.3580624Z     ) -> None:
2025-05-07T20:32:02.3580839Z         torch.manual_seed(2025)
2025-05-07T20:32:02.3581086Z     
2025-05-07T20:32:02.3581369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.3581712Z     
2025-05-07T20:32:02.3581912Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.3582212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.3582521Z         x = x_sign * x_clamp
2025-05-07T20:32:02.3582767Z         x0 = x[:, :D]
2025-05-07T20:32:02.3582991Z         x1 = x[:, D:]
2025-05-07T20:32:02.3583199Z     
2025-05-07T20:32:02.3583394Z         if contiguous:
2025-05-07T20:32:02.3583633Z             x0 = x0.contiguous()
2025-05-07T20:32:02.3583894Z             x1 = x1.contiguous()
2025-05-07T20:32:02.3584141Z     
2025-05-07T20:32:02.3584343Z         if scale_ub is not None:
2025-05-07T20:32:02.3584617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.3584956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.3585272Z             )
2025-05-07T20:32:02.3585476Z         else:
2025-05-07T20:32:02.3585690Z             scale_ub_tensor = None
2025-05-07T20:32:02.3585949Z     
2025-05-07T20:32:02.3586277Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.3586594Z             op = silu_mul_quant
2025-05-07T20:32:02.3586851Z             if compiled:
2025-05-07T20:32:02.3587106Z                 op = torch.compile(op)
2025-05-07T20:32:02.3587399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.3587680Z     
2025-05-07T20:32:02.3587879Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.3588043Z 
2025-05-07T20:32:02.3588150Z moe/activation_test.py:117: 
2025-05-07T20:32:02.3588450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.3588809Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.3589095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.3589648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.3590209Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.3590876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.3591561Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.3592099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.3592791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.3593453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.3594065Z     kernel = self.compile(
2025-05-07T20:32:02.3594609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.3595265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.3595669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.3595898Z 
2025-05-07T20:32:02.3596111Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a97eb210>
2025-05-07T20:32:02.3597191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.3598560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8420e980>}
2025-05-07T20:32:02.3599948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.3600969Z context = <triton._C.libtriton.ir.context object at 0x7f39a9727870>
2025-05-07T20:32:02.3601262Z 
2025-05-07T20:32:02.3601428Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.3601961Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.3602429Z                            module_map=module_map)
2025-05-07T20:32:02.3602790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.3603149Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.3603419Z E       ^
2025-05-07T20:32:02.3603881Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.3604342Z 
2025-05-07T20:32:02.3604757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.3605273Z 
2025-05-07T20:32:02.4506116Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.4507415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.4508489Z     T=1,
2025-05-07T20:32:02.4508887Z     D=5120,
2025-05-07T20:32:02.4509791Z     scale_ub=None,
2025-05-07T20:32:02.4510225Z     contiguous=False,
2025-05-07T20:32:02.4510674Z     compiled=False,
2025-05-07T20:32:02.4511076Z )
2025-05-07T20:32:02.4511698Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.4512669Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:02.4513203Z 
2025-05-07T20:32:02.4513355Z     @given(
2025-05-07T20:32:02.4513812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.4514445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.4515054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.4515709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.4516298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.4516642Z     )
2025-05-07T20:32:02.4517001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.4517445Z     def test_silu_mul_quant(
2025-05-07T20:32:02.4517692Z         self,
2025-05-07T20:32:02.4517889Z         T: int,
2025-05-07T20:32:02.4518082Z         D: int,
2025-05-07T20:32:02.4518301Z         scale_ub: Optional[float],
2025-05-07T20:32:02.4518572Z         contiguous: bool,
2025-05-07T20:32:02.4518808Z         compiled: bool,
2025-05-07T20:32:02.4519037Z     ) -> None:
2025-05-07T20:32:02.4519252Z         torch.manual_seed(2025)
2025-05-07T20:32:02.4519496Z     
2025-05-07T20:32:02.4519878Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.4520225Z     
2025-05-07T20:32:02.4520422Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.4520706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.4521015Z         x = x_sign * x_clamp
2025-05-07T20:32:02.4521257Z         x0 = x[:, :D]
2025-05-07T20:32:02.4521468Z         x1 = x[:, D:]
2025-05-07T20:32:02.4521678Z     
2025-05-07T20:32:02.4521865Z         if contiguous:
2025-05-07T20:32:02.4522100Z             x0 = x0.contiguous()
2025-05-07T20:32:02.4522364Z             x1 = x1.contiguous()
2025-05-07T20:32:02.4522609Z     
2025-05-07T20:32:02.4522796Z         if scale_ub is not None:
2025-05-07T20:32:02.4523068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.4523407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.4523711Z             )
2025-05-07T20:32:02.4523909Z         else:
2025-05-07T20:32:02.4524126Z             scale_ub_tensor = None
2025-05-07T20:32:02.4524470Z     
2025-05-07T20:32:02.4524696Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.4525012Z             op = silu_mul_quant
2025-05-07T20:32:02.4525261Z             if compiled:
2025-05-07T20:32:02.4525506Z                 op = torch.compile(op)
2025-05-07T20:32:02.4525807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.4526086Z     
2025-05-07T20:32:02.4526275Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.4526446Z 
2025-05-07T20:32:02.4526553Z moe/activation_test.py:117: 
2025-05-07T20:32:02.4526853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.4527182Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.4527463Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.4528268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.4528970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.4529512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.4530201Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.4530869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.4531403Z     kernel = self.compile(
2025-05-07T20:32:02.4532033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.4532693Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.4533093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.4533324Z 
2025-05-07T20:32:02.4533531Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a97ecf10>
2025-05-07T20:32:02.4534614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.4536042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9740220>}
2025-05-07T20:32:02.4537417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.4538438Z context = <triton._C.libtriton.ir.context object at 0x7f39a979d4f0>
2025-05-07T20:32:02.4538725Z 
2025-05-07T20:32:02.4538891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.4539417Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.4539935Z                            module_map=module_map)
2025-05-07T20:32:02.4540297Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.4540654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.4540918Z E       ^
2025-05-07T20:32:02.4541386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.4541834Z 
2025-05-07T20:32:02.4542255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.4542771Z 
2025-05-07T20:32:02.4542877Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.4543293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.4543704Z     T=4096,
2025-05-07T20:32:02.4543895Z     D=7168,
2025-05-07T20:32:02.4544095Z     scale_ub=1200.0,
2025-05-07T20:32:02.4544328Z     contiguous=False,
2025-05-07T20:32:02.4544552Z     compiled=False,
2025-05-07T20:32:02.4544810Z )
2025-05-07T20:32:02.4545132Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.4545624Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.4545907Z 
2025-05-07T20:32:02.4545990Z     @given(
2025-05-07T20:32:02.4546226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.4546536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.4546851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.4547191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.4547524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.4547811Z     )
2025-05-07T20:32:02.4548164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.4548611Z     def test_silu_mul_quant(
2025-05-07T20:32:02.4548859Z         self,
2025-05-07T20:32:02.4549062Z         T: int,
2025-05-07T20:32:02.4549271Z         D: int,
2025-05-07T20:32:02.4549493Z         scale_ub: Optional[float],
2025-05-07T20:32:02.4549775Z         contiguous: bool,
2025-05-07T20:32:02.4550024Z         compiled: bool,
2025-05-07T20:32:02.4550247Z     ) -> None:
2025-05-07T20:32:02.4550470Z         torch.manual_seed(2025)
2025-05-07T20:32:02.4550722Z     
2025-05-07T20:32:02.4550996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.4551347Z     
2025-05-07T20:32:02.4551551Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.4551925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.4552246Z         x = x_sign * x_clamp
2025-05-07T20:32:02.4552496Z         x0 = x[:, :D]
2025-05-07T20:32:02.4552716Z         x1 = x[:, D:]
2025-05-07T20:32:02.4552933Z     
2025-05-07T20:32:02.4553127Z         if contiguous:
2025-05-07T20:32:02.4553363Z             x0 = x0.contiguous()
2025-05-07T20:32:02.4553621Z             x1 = x1.contiguous()
2025-05-07T20:32:02.4553866Z     
2025-05-07T20:32:02.4554068Z         if scale_ub is not None:
2025-05-07T20:32:02.4554341Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.4554682Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.4554998Z             )
2025-05-07T20:32:02.4555195Z         else:
2025-05-07T20:32:02.4555410Z             scale_ub_tensor = None
2025-05-07T20:32:02.4555663Z     
2025-05-07T20:32:02.4555898Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.4556215Z             op = silu_mul_quant
2025-05-07T20:32:02.4556477Z             if compiled:
2025-05-07T20:32:02.4556725Z                 op = torch.compile(op)
2025-05-07T20:32:02.4557025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.4557309Z     
2025-05-07T20:32:02.4557502Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.4557678Z 
2025-05-07T20:32:02.4557780Z moe/activation_test.py:117: 
2025-05-07T20:32:02.4558082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.4558524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.4558805Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.4559499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.4560197Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.4560734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.4561431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.4562094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.4562636Z     kernel = self.compile(
2025-05-07T20:32:02.4563176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.4563840Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.4564291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.4564521Z 
2025-05-07T20:32:02.4564735Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9167dd0>
2025-05-07T20:32:02.4565810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.4567237Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9741440>}
2025-05-07T20:32:02.4568652Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.4569683Z context = <triton._C.libtriton.ir.context object at 0x7f39a914c430>
2025-05-07T20:32:02.4569971Z 
2025-05-07T20:32:02.4570137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.4570662Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.4571131Z                            module_map=module_map)
2025-05-07T20:32:02.4571501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.4571935Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.4572207Z E       ^
2025-05-07T20:32:02.4572680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.4573130Z 
2025-05-07T20:32:02.4573549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.4574060Z 
2025-05-07T20:32:02.4574166Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.4574595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.4575007Z     T=16384,
2025-05-07T20:32:02.4575205Z     D=7168,
2025-05-07T20:32:02.4575405Z     scale_ub=None,
2025-05-07T20:32:02.4575622Z     contiguous=True,
2025-05-07T20:32:02.4575848Z     compiled=True,
2025-05-07T20:32:02.4576248Z )
2025-05-07T20:32:02.7636120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.7637124Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:02.7637445Z 
2025-05-07T20:32:02.7637540Z     @given(
2025-05-07T20:32:02.7637773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.7638096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.7638410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.7638750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.7639080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.7639686Z     )
2025-05-07T20:32:02.7640046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.7640491Z     def test_silu_mul_quant(
2025-05-07T20:32:02.7640744Z         self,
2025-05-07T20:32:02.7640943Z         T: int,
2025-05-07T20:32:02.7641141Z         D: int,
2025-05-07T20:32:02.7641367Z         scale_ub: Optional[float],
2025-05-07T20:32:02.7641644Z         contiguous: bool,
2025-05-07T20:32:02.7641885Z         compiled: bool,
2025-05-07T20:32:02.7642126Z     ) -> None:
2025-05-07T20:32:02.7642348Z         torch.manual_seed(2025)
2025-05-07T20:32:02.7642595Z     
2025-05-07T20:32:02.7642876Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.7643227Z     
2025-05-07T20:32:02.7643450Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.7643747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.7644062Z         x = x_sign * x_clamp
2025-05-07T20:32:02.7644405Z         x0 = x[:, :D]
2025-05-07T20:32:02.7644629Z         x1 = x[:, D:]
2025-05-07T20:32:02.7644845Z     
2025-05-07T20:32:02.7645040Z         if contiguous:
2025-05-07T20:32:02.7645273Z             x0 = x0.contiguous()
2025-05-07T20:32:02.7645539Z             x1 = x1.contiguous()
2025-05-07T20:32:02.7645786Z     
2025-05-07T20:32:02.7645981Z         if scale_ub is not None:
2025-05-07T20:32:02.7646260Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.7646609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.7646970Z             )
2025-05-07T20:32:02.7647178Z         else:
2025-05-07T20:32:02.7647398Z             scale_ub_tensor = None
2025-05-07T20:32:02.7647773Z     
2025-05-07T20:32:02.7648012Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.7648418Z             op = silu_mul_quant
2025-05-07T20:32:02.7648838Z             if compiled:
2025-05-07T20:32:02.7649089Z                 op = torch.compile(op)
2025-05-07T20:32:02.7649434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7660111Z     
2025-05-07T20:32:02.7660352Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.7660534Z 
2025-05-07T20:32:02.7660653Z moe/activation_test.py:117: 
2025-05-07T20:32:02.7660971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7661320Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.7661621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7662415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.7662999Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.7663687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.7664407Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.7664968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.7665669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.7666351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.7666903Z     kernel = self.compile(
2025-05-07T20:32:02.7667461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.7668149Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.7668565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7668800Z 
2025-05-07T20:32:02.7669020Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a911f390>
2025-05-07T20:32:02.7670129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.7671609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9742520>}
2025-05-07T20:32:02.7672983Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.7674038Z context = <triton._C.libtriton.ir.context object at 0x7f39a91faf70>
2025-05-07T20:32:02.7674332Z 
2025-05-07T20:32:02.7674511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.7675044Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.7675528Z                            module_map=module_map)
2025-05-07T20:32:02.7675911Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.7676328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.7676605Z E       ^
2025-05-07T20:32:02.7677087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.7677543Z 
2025-05-07T20:32:02.7677972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.7678484Z 
2025-05-07T20:32:02.7678600Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.7679024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.7679438Z     T=4096,
2025-05-07T20:32:02.7679638Z     D=5120,
2025-05-07T20:32:02.7679844Z     scale_ub=None,
2025-05-07T20:32:02.7680079Z     contiguous=False,
2025-05-07T20:32:02.7680310Z     compiled=True,
2025-05-07T20:32:02.7680533Z )
2025-05-07T20:32:02.7680863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.7681376Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:02.7681652Z 
2025-05-07T20:32:02.7681737Z     @given(
2025-05-07T20:32:02.7681989Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.7682317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.7682630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.7682977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.7683404Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.7683699Z     )
2025-05-07T20:32:02.7684061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.7684516Z     def test_silu_mul_quant(
2025-05-07T20:32:02.7684768Z         self,
2025-05-07T20:32:02.7684986Z         T: int,
2025-05-07T20:32:02.7685201Z         D: int,
2025-05-07T20:32:02.7685436Z         scale_ub: Optional[float],
2025-05-07T20:32:02.7685718Z         contiguous: bool,
2025-05-07T20:32:02.7685978Z         compiled: bool,
2025-05-07T20:32:02.7686218Z     ) -> None:
2025-05-07T20:32:02.7686446Z         torch.manual_seed(2025)
2025-05-07T20:32:02.7686745Z     
2025-05-07T20:32:02.7687039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.7687388Z     
2025-05-07T20:32:02.7687672Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.7687979Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.7688298Z         x = x_sign * x_clamp
2025-05-07T20:32:02.7688561Z         x0 = x[:, :D]
2025-05-07T20:32:02.7688796Z         x1 = x[:, D:]
2025-05-07T20:32:02.7689010Z     
2025-05-07T20:32:02.7689215Z         if contiguous:
2025-05-07T20:32:02.7689463Z             x0 = x0.contiguous()
2025-05-07T20:32:02.7689728Z             x1 = x1.contiguous()
2025-05-07T20:32:02.7689984Z     
2025-05-07T20:32:02.7690195Z         if scale_ub is not None:
2025-05-07T20:32:02.7690478Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.7690878Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.7691200Z             )
2025-05-07T20:32:02.7691409Z         else:
2025-05-07T20:32:02.7691632Z             scale_ub_tensor = None
2025-05-07T20:32:02.7691897Z     
2025-05-07T20:32:02.7692139Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.7692461Z             op = silu_mul_quant
2025-05-07T20:32:02.7692720Z             if compiled:
2025-05-07T20:32:02.7692980Z                 op = torch.compile(op)
2025-05-07T20:32:02.7693285Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7693572Z     
2025-05-07T20:32:02.7693776Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.7693944Z 
2025-05-07T20:32:02.7694054Z moe/activation_test.py:117: 
2025-05-07T20:32:02.7694352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7694697Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.7694992Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7695603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.7696172Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.7696888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.7697582Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.7698128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.7698819Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.7699489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.7700027Z     kernel = self.compile(
2025-05-07T20:32:02.7700577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.7701245Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.7701652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7701885Z 
2025-05-07T20:32:02.7702093Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a92abd50>
2025-05-07T20:32:02.7703256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.7704639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9742c00>}
2025-05-07T20:32:02.7706341Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.7707376Z context = <triton._C.libtriton.ir.context object at 0x7f39a92543f0>
2025-05-07T20:32:02.7707667Z 
2025-05-07T20:32:02.7707837Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.7708369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.7708844Z                            module_map=module_map)
2025-05-07T20:32:02.7709217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.7709586Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.7709857Z E       ^
2025-05-07T20:32:02.7710333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.7710784Z 
2025-05-07T20:32:02.7711201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.7711801Z 
2025-05-07T20:32:02.8850581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.8851248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.8851810Z     T=4096,
2025-05-07T20:32:02.8852071Z     D=5120,
2025-05-07T20:32:02.8852344Z     scale_ub=1200.0,
2025-05-07T20:32:02.8852605Z     contiguous=False,
2025-05-07T20:32:02.8852830Z     compiled=False,
2025-05-07T20:32:02.8853048Z )
2025-05-07T20:32:02.8853370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.8853891Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.8854179Z 
2025-05-07T20:32:02.8854262Z     @given(
2025-05-07T20:32:02.8854497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.8854815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.8855122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.8855456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.8856143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.8856435Z     )
2025-05-07T20:32:02.8856786Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.8857234Z     def test_silu_mul_quant(
2025-05-07T20:32:02.8857478Z         self,
2025-05-07T20:32:02.8857686Z         T: int,
2025-05-07T20:32:02.8857891Z         D: int,
2025-05-07T20:32:02.8858109Z         scale_ub: Optional[float],
2025-05-07T20:32:02.8858392Z         contiguous: bool,
2025-05-07T20:32:02.8858639Z         compiled: bool,
2025-05-07T20:32:02.8858869Z     ) -> None:
2025-05-07T20:32:02.8859083Z         torch.manual_seed(2025)
2025-05-07T20:32:02.8859332Z     
2025-05-07T20:32:02.8859613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.8859960Z     
2025-05-07T20:32:02.8860159Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.8860455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.8860772Z         x = x_sign * x_clamp
2025-05-07T20:32:02.8861018Z         x0 = x[:, :D]
2025-05-07T20:32:02.8861245Z         x1 = x[:, D:]
2025-05-07T20:32:02.8861458Z     
2025-05-07T20:32:02.8861652Z         if contiguous:
2025-05-07T20:32:02.8861898Z             x0 = x0.contiguous()
2025-05-07T20:32:02.8862154Z             x1 = x1.contiguous()
2025-05-07T20:32:02.8862401Z     
2025-05-07T20:32:02.8862603Z         if scale_ub is not None:
2025-05-07T20:32:02.8862876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.8863361Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.8863680Z             )
2025-05-07T20:32:02.8863888Z         else:
2025-05-07T20:32:02.8864103Z             scale_ub_tensor = None
2025-05-07T20:32:02.8864361Z     
2025-05-07T20:32:02.8864605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.8864921Z             op = silu_mul_quant
2025-05-07T20:32:02.8865206Z             if compiled:
2025-05-07T20:32:02.8865474Z                 op = torch.compile(op)
2025-05-07T20:32:02.8865786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.8866062Z     
2025-05-07T20:32:02.8866262Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.8866426Z 
2025-05-07T20:32:02.8866537Z moe/activation_test.py:117: 
2025-05-07T20:32:02.8866858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.8867192Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.8867489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.8868183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.8868875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.8869417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.8870105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.8870877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.8871407Z     kernel = self.compile(
2025-05-07T20:32:02.8871956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.8872616Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.8873018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.8873254Z 
2025-05-07T20:32:02.8873461Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a92e8f90>
2025-05-07T20:32:02.8874543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.8875936Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a92d8400>}
2025-05-07T20:32:02.8877330Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.8878351Z context = <triton._C.libtriton.ir.context object at 0x7f39a92c55b0>
2025-05-07T20:32:02.8878645Z 
2025-05-07T20:32:02.8878817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.8879341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.8879811Z                            module_map=module_map)
2025-05-07T20:32:02.8880174Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.8880627Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.8880933Z E       ^
2025-05-07T20:32:02.8881401Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.8881855Z 
2025-05-07T20:32:02.8882272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.8882794Z 
2025-05-07T20:32:02.8882902Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.8883325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.8883807Z     T=4096,
2025-05-07T20:32:02.8884012Z     D=5120,
2025-05-07T20:32:02.8884212Z     scale_ub=1200.0,
2025-05-07T20:32:02.8884438Z     contiguous=False,
2025-05-07T20:32:02.8884672Z     compiled=True,
2025-05-07T20:32:02.8884886Z )
2025-05-07T20:32:02.8885206Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.8885705Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.8885995Z 
2025-05-07T20:32:02.8886078Z     @given(
2025-05-07T20:32:02.8886319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.8886631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.8886948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.8887282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.8887701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.8887997Z     )
2025-05-07T20:32:02.8888360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.8888798Z     def test_silu_mul_quant(
2025-05-07T20:32:02.8889047Z         self,
2025-05-07T20:32:02.8889259Z         T: int,
2025-05-07T20:32:02.8889459Z         D: int,
2025-05-07T20:32:02.8889690Z         scale_ub: Optional[float],
2025-05-07T20:32:02.8889977Z         contiguous: bool,
2025-05-07T20:32:02.8890222Z         compiled: bool,
2025-05-07T20:32:02.8890448Z     ) -> None:
2025-05-07T20:32:02.8890726Z         torch.manual_seed(2025)
2025-05-07T20:32:02.8890974Z     
2025-05-07T20:32:02.8891254Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.8891607Z     
2025-05-07T20:32:02.8891813Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.8892104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.8892418Z         x = x_sign * x_clamp
2025-05-07T20:32:02.8892664Z         x0 = x[:, :D]
2025-05-07T20:32:02.8892886Z         x1 = x[:, D:]
2025-05-07T20:32:02.8893104Z     
2025-05-07T20:32:02.8893307Z         if contiguous:
2025-05-07T20:32:02.8893534Z             x0 = x0.contiguous()
2025-05-07T20:32:02.8893804Z             x1 = x1.contiguous()
2025-05-07T20:32:02.8894056Z     
2025-05-07T20:32:02.8894249Z         if scale_ub is not None:
2025-05-07T20:32:02.8894529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.8894876Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.8895194Z             )
2025-05-07T20:32:02.8895441Z         else:
2025-05-07T20:32:02.8895657Z             scale_ub_tensor = None
2025-05-07T20:32:02.8895916Z     
2025-05-07T20:32:02.8896146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.8896463Z             op = silu_mul_quant
2025-05-07T20:32:02.8896716Z             if compiled:
2025-05-07T20:32:02.8896966Z                 op = torch.compile(op)
2025-05-07T20:32:02.8897267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.8897548Z     
2025-05-07T20:32:02.8897745Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.8897917Z 
2025-05-07T20:32:02.8898019Z moe/activation_test.py:117: 
2025-05-07T20:32:02.8898320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.8898651Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.8898938Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.8899498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.8900066Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.8900723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.8901415Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.8901955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.8902715Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.8903382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.8903921Z     kernel = self.compile(
2025-05-07T20:32:02.8904464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.8905114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.8905524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.8906059Z 
2025-05-07T20:32:02.8906269Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a925ba90>
2025-05-07T20:32:02.8907348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.8908716Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a92d9620>}
2025-05-07T20:32:02.8910062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.8911089Z context = <triton._C.libtriton.ir.context object at 0x7f39a99180f0>
2025-05-07T20:32:02.8911457Z 
2025-05-07T20:32:02.8911634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.8912162Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.8912626Z                            module_map=module_map)
2025-05-07T20:32:02.8912994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.8913352Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.8913613Z E       ^
2025-05-07T20:32:02.8914087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.8914536Z 
2025-05-07T20:32:02.8914956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.8915467Z 
2025-05-07T20:32:02.9799600Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.9800287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.9801122Z     T=2048,
2025-05-07T20:32:02.9801372Z     D=7168,
2025-05-07T20:32:02.9801567Z     scale_ub=1200.0,
2025-05-07T20:32:02.9801796Z     contiguous=False,
2025-05-07T20:32:02.9802029Z     compiled=False,
2025-05-07T20:32:02.9802231Z )
2025-05-07T20:32:02.9802555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.9803067Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.9803358Z 
2025-05-07T20:32:02.9803452Z     @given(
2025-05-07T20:32:02.9803686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.9804010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.9804325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.9804656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.9804990Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.9805290Z     )
2025-05-07T20:32:02.9806073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.9806529Z     def test_silu_mul_quant(
2025-05-07T20:32:02.9806783Z         self,
2025-05-07T20:32:02.9806978Z         T: int,
2025-05-07T20:32:02.9807192Z         D: int,
2025-05-07T20:32:02.9807423Z         scale_ub: Optional[float],
2025-05-07T20:32:02.9807784Z         contiguous: bool,
2025-05-07T20:32:02.9808023Z         compiled: bool,
2025-05-07T20:32:02.9808257Z     ) -> None:
2025-05-07T20:32:02.9808674Z         torch.manual_seed(2025)
2025-05-07T20:32:02.9808928Z     
2025-05-07T20:32:02.9809206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.9809554Z     
2025-05-07T20:32:02.9809751Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.9810047Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.9810362Z         x = x_sign * x_clamp
2025-05-07T20:32:02.9810602Z         x0 = x[:, :D]
2025-05-07T20:32:02.9810829Z         x1 = x[:, D:]
2025-05-07T20:32:02.9811040Z     
2025-05-07T20:32:02.9811229Z         if contiguous:
2025-05-07T20:32:02.9811469Z             x0 = x0.contiguous()
2025-05-07T20:32:02.9811733Z             x1 = x1.contiguous()
2025-05-07T20:32:02.9811973Z     
2025-05-07T20:32:02.9812172Z         if scale_ub is not None:
2025-05-07T20:32:02.9812450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.9812784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.9813110Z             )
2025-05-07T20:32:02.9813312Z         else:
2025-05-07T20:32:02.9813527Z             scale_ub_tensor = None
2025-05-07T20:32:02.9813777Z     
2025-05-07T20:32:02.9814014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.9814333Z             op = silu_mul_quant
2025-05-07T20:32:02.9814582Z             if compiled:
2025-05-07T20:32:02.9814842Z                 op = torch.compile(op)
2025-05-07T20:32:02.9815141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.9815497Z     
2025-05-07T20:32:02.9815698Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.9815867Z 
2025-05-07T20:32:02.9815972Z moe/activation_test.py:117: 
2025-05-07T20:32:02.9816266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.9816603Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.9816888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.9817587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.9818281Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.9818825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.9819512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.9820176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.9820788Z     kernel = self.compile(
2025-05-07T20:32:02.9821340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.9822005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.9822401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.9822638Z 
2025-05-07T20:32:02.9822851Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a99f0c50>
2025-05-07T20:32:02.9823949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.9825338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a92da480>}
2025-05-07T20:32:02.9826692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.9827711Z context = <triton._C.libtriton.ir.context object at 0x7f39a9985270>
2025-05-07T20:32:02.9828004Z 
2025-05-07T20:32:02.9828171Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.9828779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.9829253Z                            module_map=module_map)
2025-05-07T20:32:02.9829617Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.9829979Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.9830248Z E       ^
2025-05-07T20:32:02.9830713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.9831174Z 
2025-05-07T20:32:02.9831589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.9832108Z 
2025-05-07T20:32:02.9832218Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.9832637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.9833036Z     T=1,
2025-05-07T20:32:02.9833235Z     D=7168,
2025-05-07T20:32:02.9833437Z     scale_ub=None,
2025-05-07T20:32:02.9833665Z     contiguous=True,
2025-05-07T20:32:02.9833896Z     compiled=False,
2025-05-07T20:32:02.9834107Z )
2025-05-07T20:32:02.9834424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.9844657Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:02.9844931Z 
2025-05-07T20:32:02.9845018Z     @given(
2025-05-07T20:32:02.9845276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.9845689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.9846010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.9846350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.9846691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.9846989Z     )
2025-05-07T20:32:02.9847343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.9847899Z     def test_silu_mul_quant(
2025-05-07T20:32:02.9848161Z         self,
2025-05-07T20:32:02.9848362Z         T: int,
2025-05-07T20:32:02.9848575Z         D: int,
2025-05-07T20:32:02.9848809Z         scale_ub: Optional[float],
2025-05-07T20:32:02.9849086Z         contiguous: bool,
2025-05-07T20:32:02.9849337Z         compiled: bool,
2025-05-07T20:32:02.9849575Z     ) -> None:
2025-05-07T20:32:02.9849800Z         torch.manual_seed(2025)
2025-05-07T20:32:02.9850056Z     
2025-05-07T20:32:02.9850342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.9850753Z     
2025-05-07T20:32:02.9850979Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.9851283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.9851607Z         x = x_sign * x_clamp
2025-05-07T20:32:02.9851854Z         x0 = x[:, :D]
2025-05-07T20:32:02.9852084Z         x1 = x[:, D:]
2025-05-07T20:32:02.9852304Z     
2025-05-07T20:32:02.9852496Z         if contiguous:
2025-05-07T20:32:02.9852741Z             x0 = x0.contiguous()
2025-05-07T20:32:02.9853019Z             x1 = x1.contiguous()
2025-05-07T20:32:02.9853264Z     
2025-05-07T20:32:02.9853471Z         if scale_ub is not None:
2025-05-07T20:32:02.9853761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.9854112Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.9854427Z             )
2025-05-07T20:32:02.9854639Z         else:
2025-05-07T20:32:02.9854865Z             scale_ub_tensor = None
2025-05-07T20:32:02.9855126Z     
2025-05-07T20:32:02.9855379Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.9855709Z             op = silu_mul_quant
2025-05-07T20:32:02.9855968Z             if compiled:
2025-05-07T20:32:02.9856232Z                 op = torch.compile(op)
2025-05-07T20:32:02.9856544Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.9856828Z     
2025-05-07T20:32:02.9857043Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.9857212Z 
2025-05-07T20:32:02.9857327Z moe/activation_test.py:117: 
2025-05-07T20:32:02.9857724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.9858068Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.9858364Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.9859069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.9859770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.9860327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.9861026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.9861704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.9862245Z     kernel = self.compile(
2025-05-07T20:32:02.9862809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.9863484Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.9863885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.9864126Z 
2025-05-07T20:32:02.9864338Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a99804d0>
2025-05-07T20:32:02.9865428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.9866914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a92d9da0>}
2025-05-07T20:32:02.9868275Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.9869306Z context = <triton._C.libtriton.ir.context object at 0x7f39a9994b30>
2025-05-07T20:32:02.9869606Z 
2025-05-07T20:32:02.9869776Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.9870312Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.9870791Z                            module_map=module_map)
2025-05-07T20:32:02.9871208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.9871573Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.9871848Z E       ^
2025-05-07T20:32:02.9872318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.9872779Z 
2025-05-07T20:32:02.9873197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.9873724Z 
2025-05-07T20:32:02.9873833Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.9874261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.9874668Z     T=16384,
2025-05-07T20:32:02.9874876Z     D=7168,
2025-05-07T20:32:02.9875087Z     scale_ub=1200.0,
2025-05-07T20:32:02.9875318Z     contiguous=False,
2025-05-07T20:32:02.9875555Z     compiled=True,
2025-05-07T20:32:03.3497500Z )
2025-05-07T20:32:03.3498096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.3498822Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:03.3499213Z 
2025-05-07T20:32:03.3499306Z     @given(
2025-05-07T20:32:03.3499545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.3499865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.3500178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.3500862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.3501200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.3501494Z     )
2025-05-07T20:32:03.3501843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.3502287Z     def test_silu_mul_quant(
2025-05-07T20:32:03.3502536Z         self,
2025-05-07T20:32:03.3502731Z         T: int,
2025-05-07T20:32:03.3502939Z         D: int,
2025-05-07T20:32:03.3503167Z         scale_ub: Optional[float],
2025-05-07T20:32:03.3503445Z         contiguous: bool,
2025-05-07T20:32:03.3503692Z         compiled: bool,
2025-05-07T20:32:03.3503926Z     ) -> None:
2025-05-07T20:32:03.3504146Z         torch.manual_seed(2025)
2025-05-07T20:32:03.3504397Z     
2025-05-07T20:32:03.3504679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.3505029Z     
2025-05-07T20:32:03.3505226Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.3505532Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.3506131Z         x = x_sign * x_clamp
2025-05-07T20:32:03.3506372Z         x0 = x[:, :D]
2025-05-07T20:32:03.3506596Z         x1 = x[:, D:]
2025-05-07T20:32:03.3506814Z     
2025-05-07T20:32:03.3507005Z         if contiguous:
2025-05-07T20:32:03.3507246Z             x0 = x0.contiguous()
2025-05-07T20:32:03.3507512Z             x1 = x1.contiguous()
2025-05-07T20:32:03.3507753Z     
2025-05-07T20:32:03.3507953Z         if scale_ub is not None:
2025-05-07T20:32:03.3508318Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.3508656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.3508984Z             )
2025-05-07T20:32:03.3509184Z         else:
2025-05-07T20:32:03.3509397Z             scale_ub_tensor = None
2025-05-07T20:32:03.3509657Z     
2025-05-07T20:32:03.3509895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.3510214Z             op = silu_mul_quant
2025-05-07T20:32:03.3510472Z             if compiled:
2025-05-07T20:32:03.3510725Z                 op = torch.compile(op)
2025-05-07T20:32:03.3511024Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.3511302Z     
2025-05-07T20:32:03.3511499Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.3511664Z 
2025-05-07T20:32:03.3511772Z moe/activation_test.py:117: 
2025-05-07T20:32:03.3512066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.3512405Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.3512780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.3513337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.3513906Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.3514571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.3515266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.3515808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.3516493Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.3517161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.3517697Z     kernel = self.compile(
2025-05-07T20:32:03.3518238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.3518901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.3519301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.3519529Z 
2025-05-07T20:32:03.3519736Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a98e6d90>
2025-05-07T20:32:03.3520933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.3522329Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9880a40>}
2025-05-07T20:32:03.3523674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.3524704Z context = <triton._C.libtriton.ir.context object at 0x7f39a982f3f0>
2025-05-07T20:32:03.3524990Z 
2025-05-07T20:32:03.3525156Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.3525680Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.3526186Z                            module_map=module_map)
2025-05-07T20:32:03.3526580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.3526939Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.3527204Z E       ^
2025-05-07T20:32:03.3527778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.3528225Z 
2025-05-07T20:32:03.3528640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.3529249Z 
2025-05-07T20:32:03.3529357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.3529775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.3530180Z     T=1,
2025-05-07T20:32:03.3530367Z     D=7168,
2025-05-07T20:32:03.3530573Z     scale_ub=None,
2025-05-07T20:32:03.3530813Z     contiguous=False,
2025-05-07T20:32:03.3531045Z     compiled=False,
2025-05-07T20:32:03.3531250Z )
2025-05-07T20:32:03.3531584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.3532078Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:03.3532342Z 
2025-05-07T20:32:03.3532423Z     @given(
2025-05-07T20:32:03.3532661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.3532979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.3533281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.3533683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.3534018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.3534310Z     )
2025-05-07T20:32:03.3534657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.3535109Z     def test_silu_mul_quant(
2025-05-07T20:32:03.3535360Z         self,
2025-05-07T20:32:03.3535560Z         T: int,
2025-05-07T20:32:03.3535764Z         D: int,
2025-05-07T20:32:03.3535993Z         scale_ub: Optional[float],
2025-05-07T20:32:03.3536271Z         contiguous: bool,
2025-05-07T20:32:03.3536514Z         compiled: bool,
2025-05-07T20:32:03.3536743Z     ) -> None:
2025-05-07T20:32:03.3536966Z         torch.manual_seed(2025)
2025-05-07T20:32:03.3537208Z     
2025-05-07T20:32:03.3537486Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.3537836Z     
2025-05-07T20:32:03.3538030Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.3538335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.3538831Z         x = x_sign * x_clamp
2025-05-07T20:32:03.3539074Z         x0 = x[:, :D]
2025-05-07T20:32:03.3539295Z         x1 = x[:, D:]
2025-05-07T20:32:03.3539508Z     
2025-05-07T20:32:03.3539695Z         if contiguous:
2025-05-07T20:32:03.3539929Z             x0 = x0.contiguous()
2025-05-07T20:32:03.3540194Z             x1 = x1.contiguous()
2025-05-07T20:32:03.3540434Z     
2025-05-07T20:32:03.3540633Z         if scale_ub is not None:
2025-05-07T20:32:03.3541050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.3541389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.3541703Z             )
2025-05-07T20:32:03.3541901Z         else:
2025-05-07T20:32:03.3542118Z             scale_ub_tensor = None
2025-05-07T20:32:03.3542364Z     
2025-05-07T20:32:03.3542599Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.3542919Z             op = silu_mul_quant
2025-05-07T20:32:03.3543175Z             if compiled:
2025-05-07T20:32:03.3543434Z                 op = torch.compile(op)
2025-05-07T20:32:03.3543732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.3544004Z     
2025-05-07T20:32:03.3544209Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.3544374Z 
2025-05-07T20:32:03.3544483Z moe/activation_test.py:117: 
2025-05-07T20:32:03.3544783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.3545131Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.3545417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.3546107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.3546796Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.3547337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.3548094Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.3548750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.3549291Z     kernel = self.compile(
2025-05-07T20:32:03.3549834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.3550492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.3550895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.3551130Z 
2025-05-07T20:32:03.3551338Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9830750>
2025-05-07T20:32:03.3552416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.3553833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a98818a0>}
2025-05-07T20:32:03.3555170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.3556255Z context = <triton._C.libtriton.ir.context object at 0x7f39a9898d70>
2025-05-07T20:32:03.3556549Z 
2025-05-07T20:32:03.3556715Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.3557242Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.3557706Z                            module_map=module_map)
2025-05-07T20:32:03.3558077Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.3558440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.3558712Z E       ^
2025-05-07T20:32:03.3559174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.3559627Z 
2025-05-07T20:32:03.3560041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.3560548Z 
2025-05-07T20:32:03.3560662Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.3561157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.3561560Z     T=2048,
2025-05-07T20:32:03.3561759Z     D=7168,
2025-05-07T20:32:03.3561956Z     scale_ub=None,
2025-05-07T20:32:03.3562172Z     contiguous=False,
2025-05-07T20:32:03.3562401Z     compiled=True,
2025-05-07T20:32:03.3562612Z )
2025-05-07T20:32:03.4255563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.4256292Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.4256693Z 
2025-05-07T20:32:03.4256805Z     @given(
2025-05-07T20:32:03.4257124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.4257533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.4257846Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.4258176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.4258509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.4258810Z     )
2025-05-07T20:32:03.4259161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.4259613Z     def test_silu_mul_quant(
2025-05-07T20:32:03.4259866Z         self,
2025-05-07T20:32:03.4260066Z         T: int,
2025-05-07T20:32:03.4260269Z         D: int,
2025-05-07T20:32:03.4260491Z         scale_ub: Optional[float],
2025-05-07T20:32:03.4260770Z         contiguous: bool,
2025-05-07T20:32:03.4261010Z         compiled: bool,
2025-05-07T20:32:03.4261461Z     ) -> None:
2025-05-07T20:32:03.4261681Z         torch.manual_seed(2025)
2025-05-07T20:32:03.4261927Z     
2025-05-07T20:32:03.4262203Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.4262553Z     
2025-05-07T20:32:03.4262750Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.4263048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.4263364Z         x = x_sign * x_clamp
2025-05-07T20:32:03.4263607Z         x0 = x[:, :D]
2025-05-07T20:32:03.4263832Z         x1 = x[:, D:]
2025-05-07T20:32:03.4264051Z     
2025-05-07T20:32:03.4264241Z         if contiguous:
2025-05-07T20:32:03.4264481Z             x0 = x0.contiguous()
2025-05-07T20:32:03.4264746Z             x1 = x1.contiguous()
2025-05-07T20:32:03.4264987Z     
2025-05-07T20:32:03.4265188Z         if scale_ub is not None:
2025-05-07T20:32:03.4265467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.4265802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.4266224Z             )
2025-05-07T20:32:03.4266428Z         else:
2025-05-07T20:32:03.4266646Z             scale_ub_tensor = None
2025-05-07T20:32:03.4266930Z     
2025-05-07T20:32:03.4267162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.4267483Z             op = silu_mul_quant
2025-05-07T20:32:03.4267740Z             if compiled:
2025-05-07T20:32:03.4267998Z                 op = torch.compile(op)
2025-05-07T20:32:03.4268306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.4268591Z     
2025-05-07T20:32:03.4268792Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.4268957Z 
2025-05-07T20:32:03.4269059Z moe/activation_test.py:117: 
2025-05-07T20:32:03.4269363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.4269701Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.4269982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.4270545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.4271115Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.4271779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.4272472Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.4273009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.4273831Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.4274491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.4275030Z     kernel = self.compile(
2025-05-07T20:32:03.4275576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.4276242Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.4276639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.4276879Z 
2025-05-07T20:32:03.4277086Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9453510>
2025-05-07T20:32:03.4278172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.4279561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9882b60>}
2025-05-07T20:32:03.4280896Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.4281966Z context = <triton._C.libtriton.ir.context object at 0x7f39a945fb30>
2025-05-07T20:32:03.4282258Z 
2025-05-07T20:32:03.4282425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.4282949Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.4283411Z                            module_map=module_map)
2025-05-07T20:32:03.4283779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.4284145Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.4284414Z E       ^
2025-05-07T20:32:03.4284879Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.4285332Z 
2025-05-07T20:32:03.4285746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.4286256Z 
2025-05-07T20:32:03.4286368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.4286840Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.4287245Z     T=4096,
2025-05-07T20:32:03.4287442Z     D=7168,
2025-05-07T20:32:03.4287741Z     scale_ub=None,
2025-05-07T20:32:03.4287957Z     contiguous=False,
2025-05-07T20:32:03.4288189Z     compiled=True,
2025-05-07T20:32:03.4288404Z )
2025-05-07T20:32:03.4288725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.4289227Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.4289499Z 
2025-05-07T20:32:03.4289587Z     @given(
2025-05-07T20:32:03.4289819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.4290136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.4290449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.4290779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.4291114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.4291412Z     )
2025-05-07T20:32:03.4291777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.4292219Z     def test_silu_mul_quant(
2025-05-07T20:32:03.4292467Z         self,
2025-05-07T20:32:03.4292667Z         T: int,
2025-05-07T20:32:03.4292864Z         D: int,
2025-05-07T20:32:03.4293088Z         scale_ub: Optional[float],
2025-05-07T20:32:03.4293364Z         contiguous: bool,
2025-05-07T20:32:03.4293691Z         compiled: bool,
2025-05-07T20:32:03.4293925Z     ) -> None:
2025-05-07T20:32:03.4294148Z         torch.manual_seed(2025)
2025-05-07T20:32:03.4294389Z     
2025-05-07T20:32:03.4294670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.4295025Z     
2025-05-07T20:32:03.4295217Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.4295532Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.4295849Z         x = x_sign * x_clamp
2025-05-07T20:32:03.4296098Z         x0 = x[:, :D]
2025-05-07T20:32:03.4296324Z         x1 = x[:, D:]
2025-05-07T20:32:03.4296535Z     
2025-05-07T20:32:03.4296731Z         if contiguous:
2025-05-07T20:32:03.4296966Z             x0 = x0.contiguous()
2025-05-07T20:32:03.4297227Z             x1 = x1.contiguous()
2025-05-07T20:32:03.4297478Z     
2025-05-07T20:32:03.4297683Z         if scale_ub is not None:
2025-05-07T20:32:03.4297959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.4298313Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.4298632Z             )
2025-05-07T20:32:03.4298832Z         else:
2025-05-07T20:32:03.4299054Z             scale_ub_tensor = None
2025-05-07T20:32:03.4299314Z     
2025-05-07T20:32:03.4299545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.4299865Z             op = silu_mul_quant
2025-05-07T20:32:03.4300124Z             if compiled:
2025-05-07T20:32:03.4310180Z                 op = torch.compile(op)
2025-05-07T20:32:03.4310640Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.4310923Z     
2025-05-07T20:32:03.4311140Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.4311313Z 
2025-05-07T20:32:03.4311429Z moe/activation_test.py:117: 
2025-05-07T20:32:03.4311734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.4312077Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.4312371Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.4312959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.4313530Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.4314207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.4314911Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.4315454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.4316225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.4316896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.4317442Z     kernel = self.compile(
2025-05-07T20:32:03.4317991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.4318668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.4319086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.4319320Z 
2025-05-07T20:32:03.4319539Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9404c90>
2025-05-07T20:32:03.4320632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.4322028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9883e20>}
2025-05-07T20:32:03.4323381Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.4324528Z context = <triton._C.libtriton.ir.context object at 0x7f39a94918b0>
2025-05-07T20:32:03.4324820Z 
2025-05-07T20:32:03.4324990Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.4325526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.4326004Z                            module_map=module_map)
2025-05-07T20:32:03.4326387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.4326755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.4327037Z E       ^
2025-05-07T20:32:03.4327596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.4328049Z 
2025-05-07T20:32:03.4328467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.4328984Z 
2025-05-07T20:32:03.5586454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.5587439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.5588550Z     T=16384,
2025-05-07T20:32:03.5589094Z     D=5120,
2025-05-07T20:32:03.5589563Z     scale_ub=1200.0,
2025-05-07T20:32:03.5590008Z     contiguous=False,
2025-05-07T20:32:03.5590460Z     compiled=False,
2025-05-07T20:32:03.5590865Z )
2025-05-07T20:32:03.5591492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.5592842Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:03.5593396Z 
2025-05-07T20:32:03.5593545Z     @given(
2025-05-07T20:32:03.5594006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.5594625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.5595230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.5595874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.5596542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.5596957Z     )
2025-05-07T20:32:03.5597338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.5597788Z     def test_silu_mul_quant(
2025-05-07T20:32:03.5598041Z         self,
2025-05-07T20:32:03.5598236Z         T: int,
2025-05-07T20:32:03.5598446Z         D: int,
2025-05-07T20:32:03.5598675Z         scale_ub: Optional[float],
2025-05-07T20:32:03.5598951Z         contiguous: bool,
2025-05-07T20:32:03.5599290Z         compiled: bool,
2025-05-07T20:32:03.5599526Z     ) -> None:
2025-05-07T20:32:03.5599746Z         torch.manual_seed(2025)
2025-05-07T20:32:03.5599999Z     
2025-05-07T20:32:03.5600291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.5600643Z     
2025-05-07T20:32:03.5600842Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.5601144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.5601470Z         x = x_sign * x_clamp
2025-05-07T20:32:03.5601716Z         x0 = x[:, :D]
2025-05-07T20:32:03.5601949Z         x1 = x[:, D:]
2025-05-07T20:32:03.5602175Z     
2025-05-07T20:32:03.5602368Z         if contiguous:
2025-05-07T20:32:03.5602621Z             x0 = x0.contiguous()
2025-05-07T20:32:03.5602892Z             x1 = x1.contiguous()
2025-05-07T20:32:03.5603134Z     
2025-05-07T20:32:03.5603346Z         if scale_ub is not None:
2025-05-07T20:32:03.5603627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.5603974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.5604299Z             )
2025-05-07T20:32:03.5604507Z         else:
2025-05-07T20:32:03.5604722Z             scale_ub_tensor = None
2025-05-07T20:32:03.5604995Z     
2025-05-07T20:32:03.5605243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.5605562Z             op = silu_mul_quant
2025-05-07T20:32:03.5606140Z             if compiled:
2025-05-07T20:32:03.5606570Z                 op = torch.compile(op)
2025-05-07T20:32:03.5606877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.5607167Z     
2025-05-07T20:32:03.5607374Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.5607637Z 
2025-05-07T20:32:03.5607747Z moe/activation_test.py:117: 
2025-05-07T20:32:03.5608044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.5608385Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.5608674Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.5609366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.5610064Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.5610606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.5611290Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.5611953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.5612490Z     kernel = self.compile(
2025-05-07T20:32:03.5613038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.5613692Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.5614119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.5614429Z 
2025-05-07T20:32:03.5614639Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9431ed0>
2025-05-07T20:32:03.5615728Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.5617125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a941cd60>}
2025-05-07T20:32:03.5618478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.5619505Z context = <triton._C.libtriton.ir.context object at 0x7f39a94ea4f0>
2025-05-07T20:32:03.5619806Z 
2025-05-07T20:32:03.5620048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.5620584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.5621062Z                            module_map=module_map)
2025-05-07T20:32:03.5621432Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.5621797Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.5622075Z E       ^
2025-05-07T20:32:03.5622555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.5623016Z 
2025-05-07T20:32:03.5623435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.5623955Z 
2025-05-07T20:32:03.5624067Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.5624495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.5624907Z     T=16384,
2025-05-07T20:32:03.5625117Z     D=5120,
2025-05-07T20:32:03.5625328Z     scale_ub=1200.0,
2025-05-07T20:32:03.5625559Z     contiguous=True,
2025-05-07T20:32:03.5625796Z     compiled=True,
2025-05-07T20:32:03.5626016Z )
2025-05-07T20:32:03.5626342Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.5626846Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:03.5627132Z 
2025-05-07T20:32:03.5627217Z     @given(
2025-05-07T20:32:03.5627546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.5627869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.5628189Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.5628527Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.5628862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.5629160Z     )
2025-05-07T20:32:03.5629519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.5629972Z     def test_silu_mul_quant(
2025-05-07T20:32:03.5630227Z         self,
2025-05-07T20:32:03.5630437Z         T: int,
2025-05-07T20:32:03.5630648Z         D: int,
2025-05-07T20:32:03.5630878Z         scale_ub: Optional[float],
2025-05-07T20:32:03.5631160Z         contiguous: bool,
2025-05-07T20:32:03.5631407Z         compiled: bool,
2025-05-07T20:32:03.5631643Z     ) -> None:
2025-05-07T20:32:03.5631871Z         torch.manual_seed(2025)
2025-05-07T20:32:03.5632131Z     
2025-05-07T20:32:03.5632408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.5632761Z     
2025-05-07T20:32:03.5632968Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.5633266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.5633589Z         x = x_sign * x_clamp
2025-05-07T20:32:03.5633842Z         x0 = x[:, :D]
2025-05-07T20:32:03.5634066Z         x1 = x[:, D:]
2025-05-07T20:32:03.5634339Z     
2025-05-07T20:32:03.5634538Z         if contiguous:
2025-05-07T20:32:03.5634778Z             x0 = x0.contiguous()
2025-05-07T20:32:03.5635051Z             x1 = x1.contiguous()
2025-05-07T20:32:03.5635306Z     
2025-05-07T20:32:03.5635507Z         if scale_ub is not None:
2025-05-07T20:32:03.5635794Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.5636145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.5636460Z             )
2025-05-07T20:32:03.5636673Z         else:
2025-05-07T20:32:03.5636933Z             scale_ub_tensor = None
2025-05-07T20:32:03.5637223Z     
2025-05-07T20:32:03.5637460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.5637788Z             op = silu_mul_quant
2025-05-07T20:32:03.5638055Z             if compiled:
2025-05-07T20:32:03.5638309Z                 op = torch.compile(op)
2025-05-07T20:32:03.5638621Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.5638913Z     
2025-05-07T20:32:03.5639167Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.5639343Z 
2025-05-07T20:32:03.5639452Z moe/activation_test.py:117: 
2025-05-07T20:32:03.5639763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.5640106Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.5640411Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.5640982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.5641558Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.5642224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.5642926Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.5643476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.5644176Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.5644849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.5645392Z     kernel = self.compile(
2025-05-07T20:32:03.5645946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.5646612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.5647174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.5647416Z 
2025-05-07T20:32:03.5647711Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a90eeed0>
2025-05-07T20:32:03.5648794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.5650166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a941e200>}
2025-05-07T20:32:03.5651518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.5652541Z context = <triton._C.libtriton.ir.context object at 0x7f39a90dd270>
2025-05-07T20:32:03.5652844Z 
2025-05-07T20:32:03.5653020Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.5653552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.5654021Z                            module_map=module_map)
2025-05-07T20:32:03.5654390Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.5654746Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.5655062Z E       ^
2025-05-07T20:32:03.5655532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.5655988Z 
2025-05-07T20:32:03.5656408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.5656922Z 
2025-05-07T20:32:03.6971485Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.6972162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.6972726Z     T=16384,
2025-05-07T20:32:03.6972984Z     D=5120,
2025-05-07T20:32:03.6973189Z     scale_ub=None,
2025-05-07T20:32:03.6973410Z     contiguous=False,
2025-05-07T20:32:03.6973636Z     compiled=True,
2025-05-07T20:32:03.6973853Z )
2025-05-07T20:32:03.6974178Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.6974675Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.6975268Z 
2025-05-07T20:32:03.6975352Z     @given(
2025-05-07T20:32:03.6975591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.6975902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.6976215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.6976550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.6976879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.6977174Z     )
2025-05-07T20:32:03.6977538Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.6977984Z     def test_silu_mul_quant(
2025-05-07T20:32:03.6978229Z         self,
2025-05-07T20:32:03.6978431Z         T: int,
2025-05-07T20:32:03.6978637Z         D: int,
2025-05-07T20:32:03.6978856Z         scale_ub: Optional[float],
2025-05-07T20:32:03.6979135Z         contiguous: bool,
2025-05-07T20:32:03.6979381Z         compiled: bool,
2025-05-07T20:32:03.6979613Z     ) -> None:
2025-05-07T20:32:03.6979844Z         torch.manual_seed(2025)
2025-05-07T20:32:03.6980098Z     
2025-05-07T20:32:03.6980376Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.6980728Z     
2025-05-07T20:32:03.6980932Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.6981223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.6981542Z         x = x_sign * x_clamp
2025-05-07T20:32:03.6981792Z         x0 = x[:, :D]
2025-05-07T20:32:03.6982011Z         x1 = x[:, D:]
2025-05-07T20:32:03.6982385Z     
2025-05-07T20:32:03.6982580Z         if contiguous:
2025-05-07T20:32:03.6982825Z             x0 = x0.contiguous()
2025-05-07T20:32:03.6983083Z             x1 = x1.contiguous()
2025-05-07T20:32:03.6983330Z     
2025-05-07T20:32:03.6983530Z         if scale_ub is not None:
2025-05-07T20:32:03.6983803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.6984146Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.6984467Z             )
2025-05-07T20:32:03.6984664Z         else:
2025-05-07T20:32:03.6984886Z             scale_ub_tensor = None
2025-05-07T20:32:03.6985145Z     
2025-05-07T20:32:03.6985376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.6985694Z             op = silu_mul_quant
2025-05-07T20:32:03.6985954Z             if compiled:
2025-05-07T20:32:03.6986201Z                 op = torch.compile(op)
2025-05-07T20:32:03.6986529Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.6986837Z     
2025-05-07T20:32:03.6987043Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.6987219Z 
2025-05-07T20:32:03.6987323Z moe/activation_test.py:117: 
2025-05-07T20:32:03.6987629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.6987970Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.6988249Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.6988812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.6989457Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.6990116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.6990812Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.6991354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.6992046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.6992706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.6993248Z     kernel = self.compile(
2025-05-07T20:32:03.6993794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.6994451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.6994900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.6995136Z 
2025-05-07T20:32:03.6995345Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9022290>
2025-05-07T20:32:03.6996427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.6997831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a941ed40>}
2025-05-07T20:32:03.6999168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.7000195Z context = <triton._C.libtriton.ir.context object at 0x7f39a9058c30>
2025-05-07T20:32:03.7000493Z 
2025-05-07T20:32:03.7000658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.7001186Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.7001650Z                            module_map=module_map)
2025-05-07T20:32:03.7002020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.7002379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.7002722Z E       ^
2025-05-07T20:32:03.7003195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.7003654Z 
2025-05-07T20:32:03.7004069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.7004578Z 
2025-05-07T20:32:03.7004693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.7005106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.7005523Z     T=2048,
2025-05-07T20:32:03.7006044Z     D=5120,
2025-05-07T20:32:03.7006239Z     scale_ub=None,
2025-05-07T20:32:03.7006462Z     contiguous=False,
2025-05-07T20:32:03.7006694Z     compiled=True,
2025-05-07T20:32:03.7006905Z )
2025-05-07T20:32:03.9536922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.9537740Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.9538075Z 
2025-05-07T20:32:03.9538163Z     @given(
2025-05-07T20:32:03.9538404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.9538727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.9539034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.9539371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.9539705Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.9540303Z     )
2025-05-07T20:32:03.9540662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.9541109Z     def test_silu_mul_quant(
2025-05-07T20:32:03.9541369Z         self,
2025-05-07T20:32:03.9541566Z         T: int,
2025-05-07T20:32:03.9541774Z         D: int,
2025-05-07T20:32:03.9542002Z         scale_ub: Optional[float],
2025-05-07T20:32:03.9542274Z         contiguous: bool,
2025-05-07T20:32:03.9542521Z         compiled: bool,
2025-05-07T20:32:03.9542755Z     ) -> None:
2025-05-07T20:32:03.9542978Z         torch.manual_seed(2025)
2025-05-07T20:32:03.9543232Z     
2025-05-07T20:32:03.9543511Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.9543859Z     
2025-05-07T20:32:03.9544070Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.9544368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.9544677Z         x = x_sign * x_clamp
2025-05-07T20:32:03.9544929Z         x0 = x[:, :D]
2025-05-07T20:32:03.9545263Z         x1 = x[:, D:]
2025-05-07T20:32:03.9545475Z     
2025-05-07T20:32:03.9545669Z         if contiguous:
2025-05-07T20:32:03.9545907Z             x0 = x0.contiguous()
2025-05-07T20:32:03.9546165Z             x1 = x1.contiguous()
2025-05-07T20:32:03.9546413Z     
2025-05-07T20:32:03.9546611Z         if scale_ub is not None:
2025-05-07T20:32:03.9546887Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.9547222Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.9547544Z             )
2025-05-07T20:32:03.9547743Z         else:
2025-05-07T20:32:03.9547953Z             scale_ub_tensor = None
2025-05-07T20:32:03.9548210Z     
2025-05-07T20:32:03.9548445Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.9548761Z             op = silu_mul_quant
2025-05-07T20:32:03.9549017Z             if compiled:
2025-05-07T20:32:03.9549273Z                 op = torch.compile(op)
2025-05-07T20:32:03.9549567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.9549860Z     
2025-05-07T20:32:03.9550057Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.9550223Z 
2025-05-07T20:32:03.9550326Z moe/activation_test.py:117: 
2025-05-07T20:32:03.9550632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.9550974Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.9551265Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.9551976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.9552549Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.9553219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.9553912Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.9554461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.9555153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.9555819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.9556352Z     kernel = self.compile(
2025-05-07T20:32:03.9556907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.9557570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.9557978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.9558216Z 
2025-05-07T20:32:03.9558423Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9343890>
2025-05-07T20:32:03.9559523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.9560975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a93fc7c0>}
2025-05-07T20:32:03.9562326Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.9563359Z context = <triton._C.libtriton.ir.context object at 0x7f39a9377e70>
2025-05-07T20:32:03.9563656Z 
2025-05-07T20:32:03.9563827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.9564357Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.9564830Z                            module_map=module_map)
2025-05-07T20:32:03.9565198Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.9565634Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.9565903Z E       ^
2025-05-07T20:32:03.9566370Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.9566828Z 
2025-05-07T20:32:03.9567251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.9567902Z 
2025-05-07T20:32:03.9568011Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.9568439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.9568843Z     T=2048,
2025-05-07T20:32:03.9569043Z     D=5120,
2025-05-07T20:32:03.9569248Z     scale_ub=1200.0,
2025-05-07T20:32:03.9581251Z     contiguous=False,
2025-05-07T20:32:03.9581521Z     compiled=True,
2025-05-07T20:32:03.9581734Z )
2025-05-07T20:32:03.9582055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.9582566Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:03.9582847Z 
2025-05-07T20:32:03.9582930Z     @given(
2025-05-07T20:32:03.9583163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.9583477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.9583783Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.9584111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.9584435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.9584835Z     )
2025-05-07T20:32:03.9585182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.9585612Z     def test_silu_mul_quant(
2025-05-07T20:32:03.9585864Z         self,
2025-05-07T20:32:03.9586066Z         T: int,
2025-05-07T20:32:03.9586263Z         D: int,
2025-05-07T20:32:03.9586480Z         scale_ub: Optional[float],
2025-05-07T20:32:03.9586760Z         contiguous: bool,
2025-05-07T20:32:03.9587004Z         compiled: bool,
2025-05-07T20:32:03.9587246Z     ) -> None:
2025-05-07T20:32:03.9587479Z         torch.manual_seed(2025)
2025-05-07T20:32:03.9587722Z     
2025-05-07T20:32:03.9588001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.9588355Z     
2025-05-07T20:32:03.9588549Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.9588846Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.9589164Z         x = x_sign * x_clamp
2025-05-07T20:32:03.9589410Z         x0 = x[:, :D]
2025-05-07T20:32:03.9589636Z         x1 = x[:, D:]
2025-05-07T20:32:03.9589853Z     
2025-05-07T20:32:03.9590047Z         if contiguous:
2025-05-07T20:32:03.9590281Z             x0 = x0.contiguous()
2025-05-07T20:32:03.9590546Z             x1 = x1.contiguous()
2025-05-07T20:32:03.9590795Z     
2025-05-07T20:32:03.9590991Z         if scale_ub is not None:
2025-05-07T20:32:03.9591268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.9591614Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.9591976Z             )
2025-05-07T20:32:03.9592176Z         else:
2025-05-07T20:32:03.9592397Z             scale_ub_tensor = None
2025-05-07T20:32:03.9592646Z     
2025-05-07T20:32:03.9592882Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.9593200Z             op = silu_mul_quant
2025-05-07T20:32:03.9593457Z             if compiled:
2025-05-07T20:32:03.9593716Z                 op = torch.compile(op)
2025-05-07T20:32:03.9594023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.9594304Z     
2025-05-07T20:32:03.9594502Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.9594676Z 
2025-05-07T20:32:03.9594779Z moe/activation_test.py:117: 
2025-05-07T20:32:03.9595085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.9595425Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.9595721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.9596297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.9596912Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.9597580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.9598277Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.9598818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.9599506Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.9600175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.9600715Z     kernel = self.compile(
2025-05-07T20:32:03.9601265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.9601927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.9602331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.9602562Z 
2025-05-07T20:32:03.9602781Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a933cbd0>
2025-05-07T20:32:03.9603944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.9605322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a93fd300>}
2025-05-07T20:32:03.9606987Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.9608113Z context = <triton._C.libtriton.ir.context object at 0x7f39a93dd1f0>
2025-05-07T20:32:03.9608402Z 
2025-05-07T20:32:03.9608578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.9609099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.9609581Z                            module_map=module_map)
2025-05-07T20:32:03.9609952Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.9610317Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.9610578Z E       ^
2025-05-07T20:32:03.9611048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.9611496Z 
2025-05-07T20:32:03.9611918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.9612429Z 
2025-05-07T20:32:04.0934451Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.0935138Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.0935702Z     T=4096,
2025-05-07T20:32:04.0935966Z     D=5120,
2025-05-07T20:32:04.0936230Z     scale_ub=1200.0,
2025-05-07T20:32:04.0936499Z     contiguous=True,
2025-05-07T20:32:04.0936723Z     compiled=True,
2025-05-07T20:32:04.0936930Z )
2025-05-07T20:32:04.0937259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.0937775Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:04.0938047Z 
2025-05-07T20:32:04.0938139Z     @given(
2025-05-07T20:32:04.0938374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.0938700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.0939017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.0939349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.0939980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.0940280Z     )
2025-05-07T20:32:04.0940633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.0941085Z     def test_silu_mul_quant(
2025-05-07T20:32:04.0941346Z         self,
2025-05-07T20:32:04.0941553Z         T: int,
2025-05-07T20:32:04.0941755Z         D: int,
2025-05-07T20:32:04.0941984Z         scale_ub: Optional[float],
2025-05-07T20:32:04.0942264Z         contiguous: bool,
2025-05-07T20:32:04.0942515Z         compiled: bool,
2025-05-07T20:32:04.0942744Z     ) -> None:
2025-05-07T20:32:04.0942972Z         torch.manual_seed(2025)
2025-05-07T20:32:04.0943224Z     
2025-05-07T20:32:04.0943500Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.0943853Z     
2025-05-07T20:32:04.0944058Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.0944350Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.0944665Z         x = x_sign * x_clamp
2025-05-07T20:32:04.0944916Z         x0 = x[:, :D]
2025-05-07T20:32:04.0945136Z         x1 = x[:, D:]
2025-05-07T20:32:04.0945370Z     
2025-05-07T20:32:04.0945562Z         if contiguous:
2025-05-07T20:32:04.0945803Z             x0 = x0.contiguous()
2025-05-07T20:32:04.0946066Z             x1 = x1.contiguous()
2025-05-07T20:32:04.0946313Z     
2025-05-07T20:32:04.0946515Z         if scale_ub is not None:
2025-05-07T20:32:04.0946788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.0947275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.0947599Z             )
2025-05-07T20:32:04.0947798Z         else:
2025-05-07T20:32:04.0948019Z             scale_ub_tensor = None
2025-05-07T20:32:04.0948282Z     
2025-05-07T20:32:04.0948516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.0948842Z             op = silu_mul_quant
2025-05-07T20:32:04.0949101Z             if compiled:
2025-05-07T20:32:04.0949355Z                 op = torch.compile(op)
2025-05-07T20:32:04.0949663Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0949949Z     
2025-05-07T20:32:04.0950147Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.0950322Z 
2025-05-07T20:32:04.0950425Z moe/activation_test.py:117: 
2025-05-07T20:32:04.0950732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0951076Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.0951364Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0951940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.0952509Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.0953172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.0953877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.0954426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.0955196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.0955858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.0956399Z     kernel = self.compile(
2025-05-07T20:32:04.0956946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.0957611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.0958008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0958244Z 
2025-05-07T20:32:04.0958452Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8e29c90>
2025-05-07T20:32:04.0959539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.0960993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a93fdbc0>}
2025-05-07T20:32:04.0962346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.0963381Z context = <triton._C.libtriton.ir.context object at 0x7f39a8ee2230>
2025-05-07T20:32:04.0963676Z 
2025-05-07T20:32:04.0963844Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.0964374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.0964842Z                            module_map=module_map)
2025-05-07T20:32:04.0965220Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.0965585Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.0965845Z E       ^
2025-05-07T20:32:04.0966316Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.0966821Z 
2025-05-07T20:32:04.0967242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.0967859Z 
2025-05-07T20:32:04.0968085Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.0968495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.0968901Z     T=128,
2025-05-07T20:32:04.0969091Z     D=5120,
2025-05-07T20:32:04.0969281Z     scale_ub=1200.0,
2025-05-07T20:32:04.0969507Z     contiguous=False,
2025-05-07T20:32:04.0969731Z     compiled=True,
2025-05-07T20:32:04.0969932Z )
2025-05-07T20:32:04.1801493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.1802301Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:04.1802693Z 
2025-05-07T20:32:04.1802803Z     @given(
2025-05-07T20:32:04.1803129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.1803469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.1803778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.1804114Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.1804465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.1804751Z     )
2025-05-07T20:32:04.1805105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.1805554Z     def test_silu_mul_quant(
2025-05-07T20:32:04.1806081Z         self,
2025-05-07T20:32:04.1806285Z         T: int,
2025-05-07T20:32:04.1806490Z         D: int,
2025-05-07T20:32:04.1806718Z         scale_ub: Optional[float],
2025-05-07T20:32:04.1807262Z         contiguous: bool,
2025-05-07T20:32:04.1807509Z         compiled: bool,
2025-05-07T20:32:04.1807827Z     ) -> None:
2025-05-07T20:32:04.1808043Z         torch.manual_seed(2025)
2025-05-07T20:32:04.1808303Z     
2025-05-07T20:32:04.1808580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.1808935Z     
2025-05-07T20:32:04.1809132Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.1809432Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.1809750Z         x = x_sign * x_clamp
2025-05-07T20:32:04.1809993Z         x0 = x[:, :D]
2025-05-07T20:32:04.1810216Z         x1 = x[:, D:]
2025-05-07T20:32:04.1810432Z     
2025-05-07T20:32:04.1810621Z         if contiguous:
2025-05-07T20:32:04.1810861Z             x0 = x0.contiguous()
2025-05-07T20:32:04.1811128Z             x1 = x1.contiguous()
2025-05-07T20:32:04.1811368Z     
2025-05-07T20:32:04.1811566Z         if scale_ub is not None:
2025-05-07T20:32:04.1811845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.1812278Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.1812595Z             )
2025-05-07T20:32:04.1812793Z         else:
2025-05-07T20:32:04.1813003Z             scale_ub_tensor = None
2025-05-07T20:32:04.1813265Z     
2025-05-07T20:32:04.1813501Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.1813812Z             op = silu_mul_quant
2025-05-07T20:32:04.1814068Z             if compiled:
2025-05-07T20:32:04.1814324Z                 op = torch.compile(op)
2025-05-07T20:32:04.1814624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.1814901Z     
2025-05-07T20:32:04.1815099Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.1815265Z 
2025-05-07T20:32:04.1815372Z moe/activation_test.py:117: 
2025-05-07T20:32:04.1815671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.1816014Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.1816334Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.1816918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.1817484Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.1818150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.1818855Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.1819593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.1820287Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.1820953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.1821493Z     kernel = self.compile(
2025-05-07T20:32:04.1822030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.1822702Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.1823102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.1823330Z 
2025-05-07T20:32:04.1823539Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8e02f50>
2025-05-07T20:32:04.1824626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.1826029Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a93feb60>}
2025-05-07T20:32:04.1827380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.1828457Z context = <triton._C.libtriton.ir.context object at 0x7f39a8ef2170>
2025-05-07T20:32:04.1828746Z 
2025-05-07T20:32:04.1828912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.1829440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.1829913Z                            module_map=module_map)
2025-05-07T20:32:04.1830284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.1830646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.1830914Z E       ^
2025-05-07T20:32:04.1831385Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.1831836Z 
2025-05-07T20:32:04.1832255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.1832826Z 
2025-05-07T20:32:04.1832933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.1833351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.1833759Z     T=16384,
2025-05-07T20:32:04.1833955Z     D=7168,
2025-05-07T20:32:04.1834156Z     scale_ub=1200.0,
2025-05-07T20:32:04.1834386Z     contiguous=True,
2025-05-07T20:32:04.1834608Z     compiled=True,
2025-05-07T20:32:04.1834833Z )
2025-05-07T20:32:04.1835167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.1835663Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:04.1835948Z 
2025-05-07T20:32:04.1836034Z     @given(
2025-05-07T20:32:04.1836274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.1836586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.1836898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.1837234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.1837569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.1837852Z     )
2025-05-07T20:32:04.1838201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.1838647Z     def test_silu_mul_quant(
2025-05-07T20:32:04.1838888Z         self,
2025-05-07T20:32:04.1839087Z         T: int,
2025-05-07T20:32:04.1839289Z         D: int,
2025-05-07T20:32:04.1839504Z         scale_ub: Optional[float],
2025-05-07T20:32:04.1839868Z         contiguous: bool,
2025-05-07T20:32:04.1840114Z         compiled: bool,
2025-05-07T20:32:04.1840336Z     ) -> None:
2025-05-07T20:32:04.1840558Z         torch.manual_seed(2025)
2025-05-07T20:32:04.1840803Z     
2025-05-07T20:32:04.1841079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.1841431Z     
2025-05-07T20:32:04.1841638Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.1841928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.1842246Z         x = x_sign * x_clamp
2025-05-07T20:32:04.1842494Z         x0 = x[:, :D]
2025-05-07T20:32:04.1842719Z         x1 = x[:, D:]
2025-05-07T20:32:04.1842929Z     
2025-05-07T20:32:04.1843122Z         if contiguous:
2025-05-07T20:32:04.1843361Z             x0 = x0.contiguous()
2025-05-07T20:32:04.1843615Z             x1 = x1.contiguous()
2025-05-07T20:32:04.1843864Z     
2025-05-07T20:32:04.1844065Z         if scale_ub is not None:
2025-05-07T20:32:04.1844344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.1844685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.1845003Z             )
2025-05-07T20:32:04.1845196Z         else:
2025-05-07T20:32:04.1845414Z             scale_ub_tensor = None
2025-05-07T20:32:04.1845673Z     
2025-05-07T20:32:04.1845901Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.1846220Z             op = silu_mul_quant
2025-05-07T20:32:04.1846554Z             if compiled:
2025-05-07T20:32:04.1846803Z                 op = torch.compile(op)
2025-05-07T20:32:04.1847104Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.1847387Z     
2025-05-07T20:32:04.1847649Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.1847814Z 
2025-05-07T20:32:04.1847924Z moe/activation_test.py:117: 
2025-05-07T20:32:04.1848224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.1848562Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.1848848Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.1849410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.1849975Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.1850636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.1851324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.1851916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.1852606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.1853265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.1853800Z     kernel = self.compile(
2025-05-07T20:32:04.1854349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.1855008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.1855401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.1855636Z 
2025-05-07T20:32:04.1855841Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8fc4410>
2025-05-07T20:32:04.1856924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.1858304Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8e18720>}
2025-05-07T20:32:04.1859728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.1860751Z context = <triton._C.libtriton.ir.context object at 0x7f39a8f08a30>
2025-05-07T20:32:04.1861044Z 
2025-05-07T20:32:04.1861210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.1861729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.1862195Z                            module_map=module_map)
2025-05-07T20:32:04.1862570Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.1862927Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.1863195Z E       ^
2025-05-07T20:32:04.1863658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.1864118Z 
2025-05-07T20:32:04.1864539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.1865049Z 
2025-05-07T20:32:04.2825860Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.2826536Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.2827147Z     T=16384,
2025-05-07T20:32:04.2827387Z     D=5120,
2025-05-07T20:32:04.2827578Z     scale_ub=1200.0,
2025-05-07T20:32:04.2827801Z     contiguous=True,
2025-05-07T20:32:04.2828024Z     compiled=False,
2025-05-07T20:32:04.2828520Z )
2025-05-07T20:32:04.2828844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.2829343Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.2829623Z 
2025-05-07T20:32:04.2829700Z     @given(
2025-05-07T20:32:04.2829937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.2830252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.2830561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.2830899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.2831225Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.2831519Z     )
2025-05-07T20:32:04.2831861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.2832299Z     def test_silu_mul_quant(
2025-05-07T20:32:04.2832545Z         self,
2025-05-07T20:32:04.2832739Z         T: int,
2025-05-07T20:32:04.2832938Z         D: int,
2025-05-07T20:32:04.2833256Z         scale_ub: Optional[float],
2025-05-07T20:32:04.2833520Z         contiguous: bool,
2025-05-07T20:32:04.2833761Z         compiled: bool,
2025-05-07T20:32:04.2833984Z     ) -> None:
2025-05-07T20:32:04.2834192Z         torch.manual_seed(2025)
2025-05-07T20:32:04.2834435Z     
2025-05-07T20:32:04.2834705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.2835050Z     
2025-05-07T20:32:04.2835239Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.2835529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.2835840Z         x = x_sign * x_clamp
2025-05-07T20:32:04.2836074Z         x0 = x[:, :D]
2025-05-07T20:32:04.2836288Z         x1 = x[:, D:]
2025-05-07T20:32:04.2836496Z     
2025-05-07T20:32:04.2836673Z         if contiguous:
2025-05-07T20:32:04.2836901Z             x0 = x0.contiguous()
2025-05-07T20:32:04.2837160Z             x1 = x1.contiguous()
2025-05-07T20:32:04.2837392Z     
2025-05-07T20:32:04.2837586Z         if scale_ub is not None:
2025-05-07T20:32:04.2837859Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.2838187Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.2838494Z             )
2025-05-07T20:32:04.2838687Z         else:
2025-05-07T20:32:04.2838897Z             scale_ub_tensor = None
2025-05-07T20:32:04.2839146Z     
2025-05-07T20:32:04.2839377Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.2839682Z             op = silu_mul_quant
2025-05-07T20:32:04.2840103Z             if compiled:
2025-05-07T20:32:04.2849007Z                 op = torch.compile(op)
2025-05-07T20:32:04.2849370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.2849664Z     
2025-05-07T20:32:04.2849867Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.2850047Z 
2025-05-07T20:32:04.2850151Z moe/activation_test.py:117: 
2025-05-07T20:32:04.2850462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.2850818Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.2851104Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.2851821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.2852531Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.2853079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.2853778Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.2854451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.2854998Z     kernel = self.compile(
2025-05-07T20:32:04.2855545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.2856210Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.2856708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.2856981Z 
2025-05-07T20:32:04.2857199Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8f9d810>
2025-05-07T20:32:04.2858285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.2859687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8e19d00>}
2025-05-07T20:32:04.2861037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.2862067Z context = <triton._C.libtriton.ir.context object at 0x7f39a8fda170>
2025-05-07T20:32:04.2862407Z 
2025-05-07T20:32:04.2862576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.2863105Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.2863580Z                            module_map=module_map)
2025-05-07T20:32:04.2863958Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.2864316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.2864594Z E       ^
2025-05-07T20:32:04.2865074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.2865526Z 
2025-05-07T20:32:04.2865945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.2866466Z 
2025-05-07T20:32:04.2866576Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.2867006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.2867418Z     T=1,
2025-05-07T20:32:04.2867611Z     D=7168,
2025-05-07T20:32:04.2867825Z     scale_ub=1200.0,
2025-05-07T20:32:04.2868063Z     contiguous=False,
2025-05-07T20:32:04.2868296Z     compiled=False,
2025-05-07T20:32:04.2868515Z )
2025-05-07T20:32:04.2868848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.2869428Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:04.2869708Z 
2025-05-07T20:32:04.2869792Z     @given(
2025-05-07T20:32:04.2870037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.2870355Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.2870675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.2871017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.2871357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.2871653Z     )
2025-05-07T20:32:04.2872019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.2872474Z     def test_silu_mul_quant(
2025-05-07T20:32:04.2872721Z         self,
2025-05-07T20:32:04.2872932Z         T: int,
2025-05-07T20:32:04.2873147Z         D: int,
2025-05-07T20:32:04.2873374Z         scale_ub: Optional[float],
2025-05-07T20:32:04.2873662Z         contiguous: bool,
2025-05-07T20:32:04.2873918Z         compiled: bool,
2025-05-07T20:32:04.2874153Z     ) -> None:
2025-05-07T20:32:04.2874384Z         torch.manual_seed(2025)
2025-05-07T20:32:04.2874639Z     
2025-05-07T20:32:04.2874918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.2875280Z     
2025-05-07T20:32:04.2875490Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.2875796Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.2876116Z         x = x_sign * x_clamp
2025-05-07T20:32:04.2876416Z         x0 = x[:, :D]
2025-05-07T20:32:04.2876652Z         x1 = x[:, D:]
2025-05-07T20:32:04.2876868Z     
2025-05-07T20:32:04.2877066Z         if contiguous:
2025-05-07T20:32:04.2877313Z             x0 = x0.contiguous()
2025-05-07T20:32:04.2877582Z             x1 = x1.contiguous()
2025-05-07T20:32:04.2877832Z     
2025-05-07T20:32:04.2878040Z         if scale_ub is not None:
2025-05-07T20:32:04.2878319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.2878667Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.2878991Z             )
2025-05-07T20:32:04.2879193Z         else:
2025-05-07T20:32:04.2879423Z             scale_ub_tensor = None
2025-05-07T20:32:04.2879684Z     
2025-05-07T20:32:04.2879922Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.2880250Z             op = silu_mul_quant
2025-05-07T20:32:04.2880520Z             if compiled:
2025-05-07T20:32:04.2880774Z                 op = torch.compile(op)
2025-05-07T20:32:04.2881086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.2881428Z     
2025-05-07T20:32:04.2881635Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.2881805Z 
2025-05-07T20:32:04.2881910Z moe/activation_test.py:117: 
2025-05-07T20:32:04.2882223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.2882571Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.2882859Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.2883561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.2884267Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.2884820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.2885509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.2886185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.2886740Z     kernel = self.compile(
2025-05-07T20:32:04.2887318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.2888063Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.2888470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.2888711Z 
2025-05-07T20:32:04.2889001Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8f5ca90>
2025-05-07T20:32:04.2890087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.2891465Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8e193a0>}
2025-05-07T20:32:04.2892821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.2893850Z context = <triton._C.libtriton.ir.context object at 0x7f39a8f290b0>
2025-05-07T20:32:04.2894147Z 
2025-05-07T20:32:04.2894315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.2894849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.2895325Z                            module_map=module_map)
2025-05-07T20:32:04.2895693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.2896054Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.2896320Z E       ^
2025-05-07T20:32:04.2896789Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.2897293Z 
2025-05-07T20:32:04.2897713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.2898234Z 
2025-05-07T20:32:04.6050128Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.6050807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.6051369Z     T=4096,
2025-05-07T20:32:04.6051634Z     D=7168,
2025-05-07T20:32:04.6051873Z     scale_ub=1200.0,
2025-05-07T20:32:04.6052101Z     contiguous=False,
2025-05-07T20:32:04.6052326Z     compiled=True,
2025-05-07T20:32:04.6052529Z )
2025-05-07T20:32:04.6052850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.6053353Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:04.6053626Z 
2025-05-07T20:32:04.6053710Z     @given(
2025-05-07T20:32:04.6053939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.6054559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.6054865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.6055188Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.6055517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.6055804Z     )
2025-05-07T20:32:04.6056145Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.6056591Z     def test_silu_mul_quant(
2025-05-07T20:32:04.6056833Z         self,
2025-05-07T20:32:04.6057026Z         T: int,
2025-05-07T20:32:04.6057217Z         D: int,
2025-05-07T20:32:04.6057434Z         scale_ub: Optional[float],
2025-05-07T20:32:04.6057705Z         contiguous: bool,
2025-05-07T20:32:04.6057942Z         compiled: bool,
2025-05-07T20:32:04.6058164Z     ) -> None:
2025-05-07T20:32:04.6058381Z         torch.manual_seed(2025)
2025-05-07T20:32:04.6058621Z     
2025-05-07T20:32:04.6058902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.6059243Z     
2025-05-07T20:32:04.6059431Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.6059728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.6060034Z         x = x_sign * x_clamp
2025-05-07T20:32:04.6060271Z         x0 = x[:, :D]
2025-05-07T20:32:04.6060482Z         x1 = x[:, D:]
2025-05-07T20:32:04.6060686Z     
2025-05-07T20:32:04.6060869Z         if contiguous:
2025-05-07T20:32:04.6061251Z             x0 = x0.contiguous()
2025-05-07T20:32:04.6061514Z             x1 = x1.contiguous()
2025-05-07T20:32:04.6061747Z     
2025-05-07T20:32:04.6061938Z         if scale_ub is not None:
2025-05-07T20:32:04.6062212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.6062550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.6062855Z             )
2025-05-07T20:32:04.6063061Z         else:
2025-05-07T20:32:04.6063274Z             scale_ub_tensor = None
2025-05-07T20:32:04.6063528Z     
2025-05-07T20:32:04.6063761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6064073Z             op = silu_mul_quant
2025-05-07T20:32:04.6064315Z             if compiled:
2025-05-07T20:32:04.6064566Z                 op = torch.compile(op)
2025-05-07T20:32:04.6064861Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6065128Z     
2025-05-07T20:32:04.6065321Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.6065483Z 
2025-05-07T20:32:04.6065592Z moe/activation_test.py:117: 
2025-05-07T20:32:04.6065882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6066221Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.6066509Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6067119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.6067677Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.6068419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.6069115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.6069646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.6070349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.6071021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.6071553Z     kernel = self.compile(
2025-05-07T20:32:04.6072098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.6072758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6073162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6073441Z 
2025-05-07T20:32:04.6073646Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8d0b490>
2025-05-07T20:32:04.6074730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.6076142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8d982c0>}
2025-05-07T20:32:04.6077500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.6078532Z context = <triton._C.libtriton.ir.context object at 0x7f39a8dbbab0>
2025-05-07T20:32:04.6078830Z 
2025-05-07T20:32:04.6078999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.6079540Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6080013Z                            module_map=module_map)
2025-05-07T20:32:04.6080382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6080743Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.6081011Z E       ^
2025-05-07T20:32:04.6081585Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6082043Z 
2025-05-07T20:32:04.6082468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.6082985Z 
2025-05-07T20:32:04.6083091Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.6083511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.6083917Z     T=128,
2025-05-07T20:32:04.6084117Z     D=7168,
2025-05-07T20:32:04.6084314Z     scale_ub=1200.0,
2025-05-07T20:32:04.6084541Z     contiguous=False,
2025-05-07T20:32:04.6084775Z     compiled=True,
2025-05-07T20:32:04.6084989Z )
2025-05-07T20:32:04.6804713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.6805454Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:04.6806058Z 
2025-05-07T20:32:04.6806169Z     @given(
2025-05-07T20:32:04.6806518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.6806950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.6807264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.6807696Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.6808025Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.6808316Z     )
2025-05-07T20:32:04.6808676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.6809393Z     def test_silu_mul_quant(
2025-05-07T20:32:04.6809646Z         self,
2025-05-07T20:32:04.6809846Z         T: int,
2025-05-07T20:32:04.6810059Z         D: int,
2025-05-07T20:32:04.6810290Z         scale_ub: Optional[float],
2025-05-07T20:32:04.6810566Z         contiguous: bool,
2025-05-07T20:32:04.6810828Z         compiled: bool,
2025-05-07T20:32:04.6811069Z     ) -> None:
2025-05-07T20:32:04.6811295Z         torch.manual_seed(2025)
2025-05-07T20:32:04.6811550Z     
2025-05-07T20:32:04.6811839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.6812189Z     
2025-05-07T20:32:04.6812397Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.6812701Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.6813020Z         x = x_sign * x_clamp
2025-05-07T20:32:04.6813273Z         x0 = x[:, :D]
2025-05-07T20:32:04.6813501Z         x1 = x[:, D:]
2025-05-07T20:32:04.6813717Z     
2025-05-07T20:32:04.6813922Z         if contiguous:
2025-05-07T20:32:04.6814257Z             x0 = x0.contiguous()
2025-05-07T20:32:04.6814521Z             x1 = x1.contiguous()
2025-05-07T20:32:04.6814770Z     
2025-05-07T20:32:04.6814980Z         if scale_ub is not None:
2025-05-07T20:32:04.6815262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.6815601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.6815920Z             )
2025-05-07T20:32:04.6816132Z         else:
2025-05-07T20:32:04.6816353Z             scale_ub_tensor = None
2025-05-07T20:32:04.6816617Z     
2025-05-07T20:32:04.6816859Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6817175Z             op = silu_mul_quant
2025-05-07T20:32:04.6817435Z             if compiled:
2025-05-07T20:32:04.6817692Z                 op = torch.compile(op)
2025-05-07T20:32:04.6817993Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6818286Z     
2025-05-07T20:32:04.6818492Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.6818667Z 
2025-05-07T20:32:04.6818770Z moe/activation_test.py:117: 
2025-05-07T20:32:04.6819077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6819419Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.6819709Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6820267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.6820835Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.6821647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.6822342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.6822884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.6823570Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.6824247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.6824789Z     kernel = self.compile(
2025-05-07T20:32:04.6825340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.6826004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6826409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6826648Z 
2025-05-07T20:32:04.6826862Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8dd8790>
2025-05-07T20:32:04.6827995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.6829393Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8d98e00>}
2025-05-07T20:32:04.6830791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.6831815Z context = <triton._C.libtriton.ir.context object at 0x7f39a8de8db0>
2025-05-07T20:32:04.6832113Z 
2025-05-07T20:32:04.6832288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.6832821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6833298Z                            module_map=module_map)
2025-05-07T20:32:04.6833666Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6834030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.6834301Z E       ^
2025-05-07T20:32:04.6834769Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6835275Z 
2025-05-07T20:32:04.6835696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.6836217Z 
2025-05-07T20:32:04.6836324Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.6836747Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.6837150Z     T=2048,
2025-05-07T20:32:04.6837356Z     D=7168,
2025-05-07T20:32:04.6837561Z     scale_ub=None,
2025-05-07T20:32:04.6837781Z     contiguous=True,
2025-05-07T20:32:04.6838019Z     compiled=True,
2025-05-07T20:32:04.6838236Z )
2025-05-07T20:32:04.6838556Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.6839052Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:04.6839324Z 
2025-05-07T20:32:04.6839413Z     @given(
2025-05-07T20:32:04.6839652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.6839963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.6840274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.6840606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.6840931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.6841224Z     )
2025-05-07T20:32:04.6841664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.6842103Z     def test_silu_mul_quant(
2025-05-07T20:32:04.6842351Z         self,
2025-05-07T20:32:04.6842552Z         T: int,
2025-05-07T20:32:04.6842750Z         D: int,
2025-05-07T20:32:04.6842974Z         scale_ub: Optional[float],
2025-05-07T20:32:04.6843258Z         contiguous: bool,
2025-05-07T20:32:04.6843504Z         compiled: bool,
2025-05-07T20:32:04.6843726Z     ) -> None:
2025-05-07T20:32:04.6843947Z         torch.manual_seed(2025)
2025-05-07T20:32:04.6844199Z     
2025-05-07T20:32:04.6844468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.6844818Z     
2025-05-07T20:32:04.6845022Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.6845311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.6845623Z         x = x_sign * x_clamp
2025-05-07T20:32:04.6845870Z         x0 = x[:, :D]
2025-05-07T20:32:04.6846087Z         x1 = x[:, D:]
2025-05-07T20:32:04.6846304Z     
2025-05-07T20:32:04.6846502Z         if contiguous:
2025-05-07T20:32:04.6846730Z             x0 = x0.contiguous()
2025-05-07T20:32:04.6846992Z             x1 = x1.contiguous()
2025-05-07T20:32:04.6847239Z     
2025-05-07T20:32:04.6847430Z         if scale_ub is not None:
2025-05-07T20:32:04.6847801Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.6848146Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.6848467Z             )
2025-05-07T20:32:04.6848716Z         else:
2025-05-07T20:32:04.6848946Z             scale_ub_tensor = None
2025-05-07T20:32:04.6849209Z     
2025-05-07T20:32:04.6849446Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6849770Z             op = silu_mul_quant
2025-05-07T20:32:04.6850035Z             if compiled:
2025-05-07T20:32:04.6850288Z                 op = torch.compile(op)
2025-05-07T20:32:04.6850593Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6850881Z     
2025-05-07T20:32:04.6851085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.6851258Z 
2025-05-07T20:32:04.6851360Z moe/activation_test.py:117: 
2025-05-07T20:32:04.6851667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6852011Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.6852294Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6852858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.6853478Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.6854137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.6854837Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.6855379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.6856066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.6856788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.6857327Z     kernel = self.compile(
2025-05-07T20:32:04.6857874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.6858537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6858937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6859180Z 
2025-05-07T20:32:04.6859390Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8a09bd0>
2025-05-07T20:32:04.6860476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.6861937Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8d99a80>}
2025-05-07T20:32:04.6863285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.6864319Z context = <triton._C.libtriton.ir.context object at 0x7f39a8a4e230>
2025-05-07T20:32:04.6864621Z 
2025-05-07T20:32:04.6864789Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.6865318Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6865785Z                            module_map=module_map)
2025-05-07T20:32:04.6866158Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6866521Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.6866783Z E       ^
2025-05-07T20:32:04.6867259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6867717Z 
2025-05-07T20:32:04.6868136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.6868651Z 
2025-05-07T20:32:04.7470538Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7471200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7471857Z     T=16384,
2025-05-07T20:32:04.7472082Z     D=5120,
2025-05-07T20:32:04.7472283Z     scale_ub=None,
2025-05-07T20:32:04.7472506Z     contiguous=False,
2025-05-07T20:32:04.7481093Z     compiled=False,
2025-05-07T20:32:04.7481362Z )
2025-05-07T20:32:04.7481703Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7482230Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:04.7482517Z 
2025-05-07T20:32:04.7482612Z     @given(
2025-05-07T20:32:04.7482865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7483193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7483512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7483860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7484202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7484496Z     )
2025-05-07T20:32:04.7484863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7485485Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7485734Z         self,
2025-05-07T20:32:04.7485945Z         T: int,
2025-05-07T20:32:04.7486157Z         D: int,
2025-05-07T20:32:04.7486407Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7486718Z         contiguous: bool,
2025-05-07T20:32:04.7486974Z         compiled: bool,
2025-05-07T20:32:04.7487219Z     ) -> None:
2025-05-07T20:32:04.7487448Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7487809Z     
2025-05-07T20:32:04.7488102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7488452Z     
2025-05-07T20:32:04.7488659Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.7488970Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.7490998Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7492902Z 
2025-05-07T20:32:04.7493176Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:04.7493396Z 
2025-05-07T20:32:04.7493505Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7493932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7494345Z     T=4096,
2025-05-07T20:32:04.7494548Z     D=7168,
2025-05-07T20:32:04.7494746Z     scale_ub=1200.0,
2025-05-07T20:32:04.7494986Z     contiguous=True,
2025-05-07T20:32:04.7495221Z     compiled=True,
2025-05-07T20:32:04.7495434Z )
2025-05-07T20:32:04.7495770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7496278Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:04.7496555Z 
2025-05-07T20:32:04.7496641Z     @given(
2025-05-07T20:32:04.7496892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7497219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7497534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7497881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7498224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7498528Z     )
2025-05-07T20:32:04.7498883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7499339Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7499594Z         self,
2025-05-07T20:32:04.7499796Z         T: int,
2025-05-07T20:32:04.7500010Z         D: int,
2025-05-07T20:32:04.7500297Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7500577Z         contiguous: bool,
2025-05-07T20:32:04.7500833Z         compiled: bool,
2025-05-07T20:32:04.7501071Z     ) -> None:
2025-05-07T20:32:04.7501293Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7501549Z     
2025-05-07T20:32:04.7501836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7502182Z     
2025-05-07T20:32:04.7502391Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.7502700Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.7504718Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7506951Z 
2025-05-07T20:32:04.7507083Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:04.7507301Z 
2025-05-07T20:32:04.7507410Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7507836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7508252Z     T=16384,
2025-05-07T20:32:04.7508457Z     D=7168,
2025-05-07T20:32:04.7508663Z     scale_ub=None,
2025-05-07T20:32:04.7508891Z     contiguous=False,
2025-05-07T20:32:04.7509127Z     compiled=False,
2025-05-07T20:32:04.7509348Z )
2025-05-07T20:32:04.7509676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7510190Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:04.7510472Z 
2025-05-07T20:32:04.7510555Z     @given(
2025-05-07T20:32:04.7510803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7511132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7511442Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7511779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7512116Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7512405Z     )
2025-05-07T20:32:04.7512764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7513336Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7513598Z         self,
2025-05-07T20:32:04.7513801Z         T: int,
2025-05-07T20:32:04.7514017Z         D: int,
2025-05-07T20:32:04.7514249Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7514526Z         contiguous: bool,
2025-05-07T20:32:04.7514783Z         compiled: bool,
2025-05-07T20:32:04.7515022Z     ) -> None:
2025-05-07T20:32:04.7515245Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7515503Z     
2025-05-07T20:32:04.7515791Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7517906Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7519784Z 
2025-05-07T20:32:04.7519909Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.7520134Z 
2025-05-07T20:32:04.7520241Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7520661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7521151Z     T=2048,
2025-05-07T20:32:04.7521344Z     D=7168,
2025-05-07T20:32:04.7521549Z     scale_ub=1200.0,
2025-05-07T20:32:04.7521780Z     contiguous=True,
2025-05-07T20:32:04.7522005Z     compiled=True,
2025-05-07T20:32:04.7522220Z )
2025-05-07T20:32:04.7522546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7523040Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:04.7523318Z 
2025-05-07T20:32:04.7523399Z     @given(
2025-05-07T20:32:04.7523640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7523954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7524255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7524575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7524903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7525188Z     )
2025-05-07T20:32:04.7525540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7526057Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7526296Z         self,
2025-05-07T20:32:04.7526496Z         T: int,
2025-05-07T20:32:04.7526697Z         D: int,
2025-05-07T20:32:04.7526914Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7527190Z         contiguous: bool,
2025-05-07T20:32:04.7527434Z         compiled: bool,
2025-05-07T20:32:04.7527730Z     ) -> None:
2025-05-07T20:32:04.7527949Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7528195Z     
2025-05-07T20:32:04.7528472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7528811Z     
2025-05-07T20:32:04.7529011Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.7529308Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.7531291Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7533143Z 
2025-05-07T20:32:04.7533261Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:04.7533569Z 
2025-05-07T20:32:04.7533676Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7534092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7534499Z     T=2048,
2025-05-07T20:32:04.7534689Z     D=7168,
2025-05-07T20:32:04.7534888Z     scale_ub=None,
2025-05-07T20:32:04.7535104Z     contiguous=True,
2025-05-07T20:32:04.7535332Z     compiled=False,
2025-05-07T20:32:04.7535543Z )
2025-05-07T20:32:04.8401625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.8402889Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.8403641Z 
2025-05-07T20:32:04.8403855Z     @given(
2025-05-07T20:32:04.8404480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.8405272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.8406170Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.8406644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.8406982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.8407267Z     )
2025-05-07T20:32:04.8407675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.8408124Z     def test_silu_mul_quant(
2025-05-07T20:32:04.8408374Z         self,
2025-05-07T20:32:04.8408568Z         T: int,
2025-05-07T20:32:04.8408773Z         D: int,
2025-05-07T20:32:04.8408995Z         scale_ub: Optional[float],
2025-05-07T20:32:04.8409510Z         contiguous: bool,
2025-05-07T20:32:04.8409752Z         compiled: bool,
2025-05-07T20:32:04.8409982Z     ) -> None:
2025-05-07T20:32:04.8410200Z         torch.manual_seed(2025)
2025-05-07T20:32:04.8410449Z     
2025-05-07T20:32:04.8410725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.8411067Z     
2025-05-07T20:32:04.8411271Z >       x_sign = torch.sign(x)
2025-05-07T20:32:04.8413216Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.8415166Z 
2025-05-07T20:32:04.8415292Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:04.8415504Z 
2025-05-07T20:32:04.8415616Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.8416025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.8416433Z     T=1,
2025-05-07T20:32:04.8416629Z     D=7168,
2025-05-07T20:32:04.8416823Z     scale_ub=1200.0,
2025-05-07T20:32:04.8417055Z     contiguous=True,
2025-05-07T20:32:04.8417288Z     compiled=False,
2025-05-07T20:32:04.8417497Z )
2025-05-07T20:32:04.8417820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.8418309Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.8418572Z 
2025-05-07T20:32:04.8418654Z     @given(
2025-05-07T20:32:04.8418894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.8419208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.8419529Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.8419862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.8420194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.8420485Z     )
2025-05-07T20:32:04.8420833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.8421283Z     def test_silu_mul_quant(
2025-05-07T20:32:04.8421531Z         self,
2025-05-07T20:32:04.8421746Z         T: int,
2025-05-07T20:32:04.8422119Z         D: int,
2025-05-07T20:32:04.8422348Z         scale_ub: Optional[float],
2025-05-07T20:32:04.8422630Z         contiguous: bool,
2025-05-07T20:32:04.8422871Z         compiled: bool,
2025-05-07T20:32:04.8423101Z     ) -> None:
2025-05-07T20:32:04.8423330Z         torch.manual_seed(2025)
2025-05-07T20:32:04.8423572Z     
2025-05-07T20:32:04.8423851Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.8424211Z     
2025-05-07T20:32:04.8424412Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.8424719Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.8425054Z         x = x_sign * x_clamp
2025-05-07T20:32:04.8425300Z         x0 = x[:, :D]
2025-05-07T20:32:04.8425515Z         x1 = x[:, D:]
2025-05-07T20:32:04.8425734Z     
2025-05-07T20:32:04.8425932Z         if contiguous:
2025-05-07T20:32:04.8426168Z             x0 = x0.contiguous()
2025-05-07T20:32:04.8426436Z             x1 = x1.contiguous()
2025-05-07T20:32:04.8426729Z     
2025-05-07T20:32:04.8426929Z         if scale_ub is not None:
2025-05-07T20:32:04.8427209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.8427550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.8427858Z             )
2025-05-07T20:32:04.8428063Z         else:
2025-05-07T20:32:04.8428280Z             scale_ub_tensor = None
2025-05-07T20:32:04.8428532Z     
2025-05-07T20:32:04.8428769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.8429143Z             op = silu_mul_quant
2025-05-07T20:32:04.8429398Z             if compiled:
2025-05-07T20:32:04.8429648Z                 op = torch.compile(op)
2025-05-07T20:32:04.8429948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.8430228Z     
2025-05-07T20:32:04.8430422Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.8430597Z 
2025-05-07T20:32:04.8430699Z moe/activation_test.py:117: 
2025-05-07T20:32:04.8431010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.8431344Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.8431631Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.8432338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.8433040Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.8433578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.8434326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.8435000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.8435541Z     kernel = self.compile(
2025-05-07T20:32:04.8436089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.8436760Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.8437215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.8437447Z 
2025-05-07T20:32:04.8437653Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8b53c90>
2025-05-07T20:32:04.8438744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.8440142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8a01440>}
2025-05-07T20:32:04.8441578Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.8442620Z context = <triton._C.libtriton.ir.context object at 0x7f39a8bd42f0>
2025-05-07T20:32:04.8442909Z 
2025-05-07T20:32:04.8443077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.8443607Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.8444084Z                            module_map=module_map)
2025-05-07T20:32:04.8444450Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.8444820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.8445090Z E       ^
2025-05-07T20:32:04.8445567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.8446016Z 
2025-05-07T20:32:04.8446429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.8446947Z 
2025-05-07T20:32:04.8447061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.8447479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.8447984Z     T=128,
2025-05-07T20:32:04.8448188Z     D=5120,
2025-05-07T20:32:04.8448388Z     scale_ub=None,
2025-05-07T20:32:04.8448609Z     contiguous=True,
2025-05-07T20:32:04.8448831Z     compiled=False,
2025-05-07T20:32:04.8449042Z )
2025-05-07T20:32:04.9004943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.9007051Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.9007490Z 
2025-05-07T20:32:04.9007642Z     @given(
2025-05-07T20:32:04.9007882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.9008194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.9008508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.9008844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.9009217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.9009510Z     )
2025-05-07T20:32:04.9009855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.9010302Z     def test_silu_mul_quant(
2025-05-07T20:32:04.9010549Z         self,
2025-05-07T20:32:04.9010749Z         T: int,
2025-05-07T20:32:04.9010947Z         D: int,
2025-05-07T20:32:04.9011173Z         scale_ub: Optional[float],
2025-05-07T20:32:04.9011616Z         contiguous: bool,
2025-05-07T20:32:04.9011854Z         compiled: bool,
2025-05-07T20:32:04.9012087Z     ) -> None:
2025-05-07T20:32:04.9012307Z         torch.manual_seed(2025)
2025-05-07T20:32:04.9012546Z     
2025-05-07T20:32:04.9012824Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.9013178Z     
2025-05-07T20:32:04.9013373Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.9013672Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.9013987Z         x = x_sign * x_clamp
2025-05-07T20:32:04.9014224Z         x0 = x[:, :D]
2025-05-07T20:32:04.9014452Z         x1 = x[:, D:]
2025-05-07T20:32:04.9014668Z     
2025-05-07T20:32:04.9014860Z         if contiguous:
2025-05-07T20:32:04.9015104Z             x0 = x0.contiguous()
2025-05-07T20:32:04.9015373Z             x1 = x1.contiguous()
2025-05-07T20:32:04.9015622Z     
2025-05-07T20:32:04.9015816Z         if scale_ub is not None:
2025-05-07T20:32:04.9016096Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.9016443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.9016753Z             )
2025-05-07T20:32:04.9016957Z         else:
2025-05-07T20:32:04.9017174Z             scale_ub_tensor = None
2025-05-07T20:32:04.9017425Z     
2025-05-07T20:32:04.9017665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.9017998Z             op = silu_mul_quant
2025-05-07T20:32:04.9018249Z             if compiled:
2025-05-07T20:32:04.9019388Z                 op = torch.compile(op)
2025-05-07T20:32:04.9019838Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9020242Z     
2025-05-07T20:32:04.9020515Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.9020742Z 
2025-05-07T20:32:04.9020891Z moe/activation_test.py:117: 
2025-05-07T20:32:04.9021315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9021749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.9022132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9023214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.9024315Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.9025193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.9026307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.9027393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.9028137Z     kernel = self.compile(
2025-05-07T20:32:04.9028792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.9029462Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.9029861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9030224Z 
2025-05-07T20:32:04.9030430Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8b3ac50>
2025-05-07T20:32:04.9031520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.9032913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8a02660>}
2025-05-07T20:32:04.9034265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.9035285Z context = <triton._C.libtriton.ir.context object at 0x7f39a8b472b0>
2025-05-07T20:32:04.9035582Z 
2025-05-07T20:32:04.9035798Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.9036325Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.9036800Z                            module_map=module_map)
2025-05-07T20:32:04.9037162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.9037524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.9037789Z E       ^
2025-05-07T20:32:04.9038255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.9038711Z 
2025-05-07T20:32:04.9039127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.9039644Z 
2025-05-07T20:32:04.9039750Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.9040166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.9040575Z     T=128,
2025-05-07T20:32:04.9040766Z     D=7168,
2025-05-07T20:32:04.9040972Z     scale_ub=None,
2025-05-07T20:32:04.9041186Z     contiguous=True,
2025-05-07T20:32:04.9041421Z     compiled=False,
2025-05-07T20:32:04.9041638Z )
2025-05-07T20:32:04.9041959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.9042451Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.9042729Z 
2025-05-07T20:32:04.9042809Z     @given(
2025-05-07T20:32:04.9043127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.9043440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.9043749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.9044079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.9044404Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.9044698Z     )
2025-05-07T20:32:04.9045049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.9045493Z     def test_silu_mul_quant(
2025-05-07T20:32:04.9045741Z         self,
2025-05-07T20:32:04.9045939Z         T: int,
2025-05-07T20:32:04.9046134Z         D: int,
2025-05-07T20:32:04.9046357Z         scale_ub: Optional[float],
2025-05-07T20:32:04.9046630Z         contiguous: bool,
2025-05-07T20:32:04.9046867Z         compiled: bool,
2025-05-07T20:32:04.9047096Z     ) -> None:
2025-05-07T20:32:04.9047318Z         torch.manual_seed(2025)
2025-05-07T20:32:04.9047672Z     
2025-05-07T20:32:04.9047938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.9048284Z     
2025-05-07T20:32:04.9048484Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.9048771Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.9049085Z         x = x_sign * x_clamp
2025-05-07T20:32:04.9049336Z         x0 = x[:, :D]
2025-05-07T20:32:04.9049553Z         x1 = x[:, D:]
2025-05-07T20:32:04.9049884Z     
2025-05-07T20:32:04.9050081Z         if contiguous:
2025-05-07T20:32:04.9050312Z             x0 = x0.contiguous()
2025-05-07T20:32:04.9050578Z             x1 = x1.contiguous()
2025-05-07T20:32:04.9050828Z     
2025-05-07T20:32:04.9051020Z         if scale_ub is not None:
2025-05-07T20:32:04.9051297Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.9059441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.9059769Z             )
2025-05-07T20:32:04.9059978Z         else:
2025-05-07T20:32:04.9060214Z             scale_ub_tensor = None
2025-05-07T20:32:04.9060472Z     
2025-05-07T20:32:04.9060725Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.9061059Z             op = silu_mul_quant
2025-05-07T20:32:04.9061318Z             if compiled:
2025-05-07T20:32:04.9061581Z                 op = torch.compile(op)
2025-05-07T20:32:04.9061890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9062173Z     
2025-05-07T20:32:04.9062464Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.9062642Z 
2025-05-07T20:32:04.9062749Z moe/activation_test.py:117: 
2025-05-07T20:32:04.9063063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9063405Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.9063700Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9064407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.9065114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.9065657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.9066354Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.9067030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.9067569Z     kernel = self.compile(
2025-05-07T20:32:04.9068130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.9068800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.9069203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9069445Z 
2025-05-07T20:32:04.9069658Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8b65e10>
2025-05-07T20:32:04.9070826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.9072215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8a036a0>}
2025-05-07T20:32:04.9073568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.9074601Z context = <triton._C.libtriton.ir.context object at 0x7f39a8b4a3f0>
2025-05-07T20:32:04.9074897Z 
2025-05-07T20:32:04.9075067Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.9075603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.9076082Z                            module_map=module_map)
2025-05-07T20:32:04.9076456Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.9076821Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.9077092Z E       ^
2025-05-07T20:32:04.9077564Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.9078067Z 
2025-05-07T20:32:04.9078491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.9079013Z 
2025-05-07T20:32:04.9079122Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.9079548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.9079955Z     T=2048,
2025-05-07T20:32:04.9080157Z     D=7168,
2025-05-07T20:32:04.9080365Z     scale_ub=1200.0,
2025-05-07T20:32:04.9080594Z     contiguous=True,
2025-05-07T20:32:04.9080836Z     compiled=False,
2025-05-07T20:32:04.9081057Z )
2025-05-07T20:32:04.9750334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.9751112Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.9751487Z 
2025-05-07T20:32:04.9751599Z     @given(
2025-05-07T20:32:04.9751875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.9752206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.9752817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.9753152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.9753483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.9753781Z     )
2025-05-07T20:32:04.9754131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.9754578Z     def test_silu_mul_quant(
2025-05-07T20:32:04.9754827Z         self,
2025-05-07T20:32:04.9755034Z         T: int,
2025-05-07T20:32:04.9755244Z         D: int,
2025-05-07T20:32:04.9755472Z         scale_ub: Optional[float],
2025-05-07T20:32:04.9755744Z         contiguous: bool,
2025-05-07T20:32:04.9755993Z         compiled: bool,
2025-05-07T20:32:04.9756233Z     ) -> None:
2025-05-07T20:32:04.9756450Z         torch.manual_seed(2025)
2025-05-07T20:32:04.9756728Z     
2025-05-07T20:32:04.9757034Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.9759385Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.9761268Z 
2025-05-07T20:32:04.9761399Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.9761611Z 
2025-05-07T20:32:04.9761721Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.9762134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.9762544Z     T=1,
2025-05-07T20:32:04.9762724Z     D=5120,
2025-05-07T20:32:04.9762929Z     scale_ub=1200.0,
2025-05-07T20:32:04.9763175Z     contiguous=True,
2025-05-07T20:32:04.9763398Z     compiled=False,
2025-05-07T20:32:04.9763624Z )
2025-05-07T20:32:04.9763949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.9764444Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.9764709Z 
2025-05-07T20:32:04.9764792Z     @given(
2025-05-07T20:32:04.9765029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.9765360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.9765741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.9766080Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.9766413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.9766701Z     )
2025-05-07T20:32:04.9767057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.9767729Z     def test_silu_mul_quant(
2025-05-07T20:32:04.9768096Z         self,
2025-05-07T20:32:04.9768306Z         T: int,
2025-05-07T20:32:04.9768592Z         D: int,
2025-05-07T20:32:04.9768884Z         scale_ub: Optional[float],
2025-05-07T20:32:04.9769254Z         contiguous: bool,
2025-05-07T20:32:04.9769594Z         compiled: bool,
2025-05-07T20:32:04.9769817Z     ) -> None:
2025-05-07T20:32:04.9770032Z         torch.manual_seed(2025)
2025-05-07T20:32:04.9770271Z     
2025-05-07T20:32:04.9770542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.9770888Z     
2025-05-07T20:32:04.9771080Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.9771371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.9771681Z         x = x_sign * x_clamp
2025-05-07T20:32:04.9771914Z         x0 = x[:, :D]
2025-05-07T20:32:04.9772132Z         x1 = x[:, D:]
2025-05-07T20:32:04.9772341Z     
2025-05-07T20:32:04.9772523Z         if contiguous:
2025-05-07T20:32:04.9772756Z             x0 = x0.contiguous()
2025-05-07T20:32:04.9773096Z             x1 = x1.contiguous()
2025-05-07T20:32:04.9773339Z     
2025-05-07T20:32:04.9773539Z         if scale_ub is not None:
2025-05-07T20:32:04.9773820Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.9774155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.9774473Z             )
2025-05-07T20:32:04.9774675Z         else:
2025-05-07T20:32:04.9774886Z             scale_ub_tensor = None
2025-05-07T20:32:04.9775146Z     
2025-05-07T20:32:04.9775390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.9775713Z             op = silu_mul_quant
2025-05-07T20:32:04.9775965Z             if compiled:
2025-05-07T20:32:04.9776226Z                 op = torch.compile(op)
2025-05-07T20:32:04.9776531Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9776808Z     
2025-05-07T20:32:04.9777012Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.9777179Z 
2025-05-07T20:32:04.9777292Z moe/activation_test.py:117: 
2025-05-07T20:32:04.9777594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9777936Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.9778226Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9778922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.9779617Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.9780245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.9780939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.9781598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.9782142Z     kernel = self.compile(
2025-05-07T20:32:04.9782689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.9783357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.9783757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9783999Z 
2025-05-07T20:32:04.9784210Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8c29c50>
2025-05-07T20:32:04.9785301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.9786679Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8c90b80>}
2025-05-07T20:32:04.9788069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.9789142Z context = <triton._C.libtriton.ir.context object at 0x7f39a8caa230>
2025-05-07T20:32:04.9789436Z 
2025-05-07T20:32:04.9789604Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.9790133Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.9790599Z                            module_map=module_map)
2025-05-07T20:32:04.9790975Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.9791338Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.9791603Z E       ^
2025-05-07T20:32:04.9792064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.9792520Z 
2025-05-07T20:32:04.9792935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.9793493Z 
2025-05-07T20:32:04.9793610Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.9794023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.9794432Z     T=2048,
2025-05-07T20:32:04.9794627Z     D=5120,
2025-05-07T20:32:04.9794826Z     scale_ub=None,
2025-05-07T20:32:04.9795041Z     contiguous=True,
2025-05-07T20:32:04.9795270Z     compiled=False,
2025-05-07T20:32:04.9795483Z )
2025-05-07T20:32:04.9795806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.9796307Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.9796579Z 
2025-05-07T20:32:04.9796667Z     @given(
2025-05-07T20:32:04.9796899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.9797224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.9797539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.9797873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.9798212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.9798510Z     )
2025-05-07T20:32:04.9798866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.9799312Z     def test_silu_mul_quant(
2025-05-07T20:32:04.9799563Z         self,
2025-05-07T20:32:04.9799765Z         T: int,
2025-05-07T20:32:04.9799964Z         D: int,
2025-05-07T20:32:04.9800188Z         scale_ub: Optional[float],
2025-05-07T20:32:04.9800575Z         contiguous: bool,
2025-05-07T20:32:04.9800822Z         compiled: bool,
2025-05-07T20:32:04.9801056Z     ) -> None:
2025-05-07T20:32:04.9801282Z         torch.manual_seed(2025)
2025-05-07T20:32:04.9801529Z     
2025-05-07T20:32:04.9801812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.9802165Z     
2025-05-07T20:32:04.9802361Z >       x_sign = torch.sign(x)
2025-05-07T20:32:04.9804331Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.9806743Z 
2025-05-07T20:32:04.9806868Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:04.9807090Z 
2025-05-07T20:32:04.9807196Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.9807686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.9808094Z     T=16384,
2025-05-07T20:32:04.9808290Z     D=5120,
2025-05-07T20:32:04.9808491Z     scale_ub=None,
2025-05-07T20:32:04.9808802Z     contiguous=True,
2025-05-07T20:32:04.9809024Z     compiled=False,
2025-05-07T20:32:04.9809235Z )
2025-05-07T20:32:05.0558739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0559551Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.0559950Z 
2025-05-07T20:32:05.0560059Z     @given(
2025-05-07T20:32:05.0560378Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0560697Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0561029Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0561359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0561682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0561967Z     )
2025-05-07T20:32:05.0562315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0562747Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0562988Z         self,
2025-05-07T20:32:05.0563463Z         T: int,
2025-05-07T20:32:05.0563655Z         D: int,
2025-05-07T20:32:05.0563873Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0564142Z         contiguous: bool,
2025-05-07T20:32:05.0564374Z         compiled: bool,
2025-05-07T20:32:05.0564599Z     ) -> None:
2025-05-07T20:32:05.0564815Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0565056Z     
2025-05-07T20:32:05.0565324Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0567473Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.0569492Z 
2025-05-07T20:32:05.0569616Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.0569831Z 
2025-05-07T20:32:05.0569942Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0570355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0570767Z     T=4096,
2025-05-07T20:32:05.0570962Z     D=5120,
2025-05-07T20:32:05.0571163Z     scale_ub=None,
2025-05-07T20:32:05.0571528Z     contiguous=True,
2025-05-07T20:32:05.0571762Z     compiled=False,
2025-05-07T20:32:05.0571984Z )
2025-05-07T20:32:05.0572301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0572804Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.0573076Z 
2025-05-07T20:32:05.0573168Z     @given(
2025-05-07T20:32:05.0573398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0573724Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0574033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0574362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0574703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0574998Z     )
2025-05-07T20:32:05.0575353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0575793Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0576046Z         self,
2025-05-07T20:32:05.0576255Z         T: int,
2025-05-07T20:32:05.0576491Z         D: int,
2025-05-07T20:32:05.0576789Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0577070Z         contiguous: bool,
2025-05-07T20:32:05.0577313Z         compiled: bool,
2025-05-07T20:32:05.0577547Z     ) -> None:
2025-05-07T20:32:05.0577769Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0578012Z     
2025-05-07T20:32:05.0578290Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0580459Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.0582336Z 
2025-05-07T20:32:05.0582459Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.0582675Z 
2025-05-07T20:32:05.0582787Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0583198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0583707Z     T=2048,
2025-05-07T20:32:05.0583903Z     D=5120,
2025-05-07T20:32:05.0584155Z     scale_ub=None,
2025-05-07T20:32:05.0584379Z     contiguous=False,
2025-05-07T20:32:05.0584609Z     compiled=False,
2025-05-07T20:32:05.0584812Z )
2025-05-07T20:32:05.0585137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0585632Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.0585907Z 
2025-05-07T20:32:05.0585993Z     @given(
2025-05-07T20:32:05.0586260Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0586620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0586929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0587297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0587766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0588176Z     )
2025-05-07T20:32:05.0588548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0589000Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0589244Z         self,
2025-05-07T20:32:05.0589440Z         T: int,
2025-05-07T20:32:05.0589649Z         D: int,
2025-05-07T20:32:05.0589871Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0590140Z         contiguous: bool,
2025-05-07T20:32:05.0590385Z         compiled: bool,
2025-05-07T20:32:05.0590615Z     ) -> None:
2025-05-07T20:32:05.0590837Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0591075Z     
2025-05-07T20:32:05.0591462Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0593540Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.0595425Z 
2025-05-07T20:32:05.0595556Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.0595770Z 
2025-05-07T20:32:05.0595876Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0596295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0596708Z     T=4096,
2025-05-07T20:32:05.0596904Z     D=7168,
2025-05-07T20:32:05.0597101Z     scale_ub=None,
2025-05-07T20:32:05.0597325Z     contiguous=True,
2025-05-07T20:32:05.0597552Z     compiled=True,
2025-05-07T20:32:05.0597753Z )
2025-05-07T20:32:05.0598074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0598570Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.0598837Z 
2025-05-07T20:32:05.0598919Z     @given(
2025-05-07T20:32:05.0599205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0599521Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0599825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0600157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0600491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0600781Z     )
2025-05-07T20:32:05.0601126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0601582Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0601836Z         self,
2025-05-07T20:32:05.0602027Z         T: int,
2025-05-07T20:32:05.0602231Z         D: int,
2025-05-07T20:32:05.0602450Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0602718Z         contiguous: bool,
2025-05-07T20:32:05.0602960Z         compiled: bool,
2025-05-07T20:32:05.0603186Z     ) -> None:
2025-05-07T20:32:05.0603400Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0603658Z     
2025-05-07T20:32:05.0603982Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0606364Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.0608334Z 
2025-05-07T20:32:05.0608464Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.0608677Z 
2025-05-07T20:32:05.0608779Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0609187Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0609595Z     T=2048,
2025-05-07T20:32:05.0609774Z     D=5120,
2025-05-07T20:32:05.0609963Z     scale_ub=1200.0,
2025-05-07T20:32:05.0610191Z     contiguous=False,
2025-05-07T20:32:05.0610410Z     compiled=False,
2025-05-07T20:32:05.0610612Z )
2025-05-07T20:32:05.0610930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0611418Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.0611700Z 
2025-05-07T20:32:05.0611777Z     @given(
2025-05-07T20:32:05.0612152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0612467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0612765Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0613099Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0613426Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0613705Z     )
2025-05-07T20:32:05.0614051Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0614500Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0614735Z         self,
2025-05-07T20:32:05.0614932Z         T: int,
2025-05-07T20:32:05.0615133Z         D: int,
2025-05-07T20:32:05.0615343Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0615615Z         contiguous: bool,
2025-05-07T20:32:05.0615855Z         compiled: bool,
2025-05-07T20:32:05.0616075Z     ) -> None:
2025-05-07T20:32:05.0616285Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0616529Z     
2025-05-07T20:32:05.0616796Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0618839Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.0620792Z 
2025-05-07T20:32:05.0620911Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.0621128Z 
2025-05-07T20:32:05.0621235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0621645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0622047Z     T=4096,
2025-05-07T20:32:05.0622233Z     D=7168,
2025-05-07T20:32:05.0622421Z     scale_ub=1200.0,
2025-05-07T20:32:05.0622640Z     contiguous=True,
2025-05-07T20:32:05.0622851Z     compiled=False,
2025-05-07T20:32:05.0623051Z )
2025-05-07T20:32:05.1568315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.1569023Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.1569630Z 
2025-05-07T20:32:05.1569715Z     @given(
2025-05-07T20:32:05.1569952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.1570271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.1570576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.1570910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.1571271Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.1571564Z     )
2025-05-07T20:32:05.1571935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.1572376Z     def test_silu_mul_quant(
2025-05-07T20:32:05.1572624Z         self,
2025-05-07T20:32:05.1572821Z         T: int,
2025-05-07T20:32:05.1573016Z         D: int,
2025-05-07T20:32:05.1573238Z         scale_ub: Optional[float],
2025-05-07T20:32:05.1573511Z         contiguous: bool,
2025-05-07T20:32:05.1573749Z         compiled: bool,
2025-05-07T20:32:05.1573984Z     ) -> None:
2025-05-07T20:32:05.1574213Z         torch.manual_seed(2025)
2025-05-07T20:32:05.1574459Z     
2025-05-07T20:32:05.1574736Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.1576981Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.1578888Z 
2025-05-07T20:32:05.1579011Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.1579227Z 
2025-05-07T20:32:05.1579343Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.1579759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.1580175Z     T=16384,
2025-05-07T20:32:05.1580379Z     D=7168,
2025-05-07T20:32:05.1589396Z     scale_ub=None,
2025-05-07T20:32:05.1589643Z     contiguous=False,
2025-05-07T20:32:05.1589880Z     compiled=True,
2025-05-07T20:32:05.1590099Z )
2025-05-07T20:32:05.1590434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.1590945Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.1591244Z 
2025-05-07T20:32:05.1591328Z     @given(
2025-05-07T20:32:05.1591575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.1591892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.1592216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.1592562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.1592902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.1593320Z     )
2025-05-07T20:32:05.1593680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.1594130Z     def test_silu_mul_quant(
2025-05-07T20:32:05.1594378Z         self,
2025-05-07T20:32:05.1594590Z         T: int,
2025-05-07T20:32:05.1594801Z         D: int,
2025-05-07T20:32:05.1595027Z         scale_ub: Optional[float],
2025-05-07T20:32:05.1595311Z         contiguous: bool,
2025-05-07T20:32:05.1595562Z         compiled: bool,
2025-05-07T20:32:05.1595793Z     ) -> None:
2025-05-07T20:32:05.1596029Z         torch.manual_seed(2025)
2025-05-07T20:32:05.1596283Z     
2025-05-07T20:32:05.1596560Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.1598623Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.1600715Z 
2025-05-07T20:32:05.1600843Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.1601070Z 
2025-05-07T20:32:05.1601178Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.1601608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.1602021Z     T=4096,
2025-05-07T20:32:05.1602230Z     D=7168,
2025-05-07T20:32:05.1602442Z     scale_ub=None,
2025-05-07T20:32:05.1602664Z     contiguous=True,
2025-05-07T20:32:05.1602906Z     compiled=False,
2025-05-07T20:32:05.1603127Z )
2025-05-07T20:32:05.1603451Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.1603964Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.1604250Z 
2025-05-07T20:32:05.1604337Z     @given(
2025-05-07T20:32:05.1604581Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.1604899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.1605223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.1605571Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.1606177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.1606640Z     )
2025-05-07T20:32:05.1607000Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.1607448Z     def test_silu_mul_quant(
2025-05-07T20:32:05.1607785Z         self,
2025-05-07T20:32:05.1607997Z         T: int,
2025-05-07T20:32:05.1608207Z         D: int,
2025-05-07T20:32:05.1608430Z         scale_ub: Optional[float],
2025-05-07T20:32:05.1608717Z         contiguous: bool,
2025-05-07T20:32:05.1608981Z         compiled: bool,
2025-05-07T20:32:05.1609213Z     ) -> None:
2025-05-07T20:32:05.1609443Z         torch.manual_seed(2025)
2025-05-07T20:32:05.1609701Z     
2025-05-07T20:32:05.1609980Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.1612418Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.1614322Z 
2025-05-07T20:32:05.1614451Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.1614673Z 
2025-05-07T20:32:05.1614885Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.1615313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.1615723Z     T=16384,
2025-05-07T20:32:05.1615931Z     D=7168,
2025-05-07T20:32:05.1616136Z     scale_ub=None,
2025-05-07T20:32:05.1616357Z     contiguous=True,
2025-05-07T20:32:05.1616595Z     compiled=False,
2025-05-07T20:32:05.1616818Z )
2025-05-07T20:32:05.1617144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.1617653Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.1617935Z 
2025-05-07T20:32:05.1618022Z     @given(
2025-05-07T20:32:05.1618259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.1618581Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.1618897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.1619240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.1619654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.1619954Z     )
2025-05-07T20:32:05.1620317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.1620763Z     def test_silu_mul_quant(
2025-05-07T20:32:05.1621022Z         self,
2025-05-07T20:32:05.1621233Z         T: int,
2025-05-07T20:32:05.1621440Z         D: int,
2025-05-07T20:32:05.1621670Z         scale_ub: Optional[float],
2025-05-07T20:32:05.1621956Z         contiguous: bool,
2025-05-07T20:32:05.1622205Z         compiled: bool,
2025-05-07T20:32:05.1622442Z     ) -> None:
2025-05-07T20:32:05.1622674Z         torch.manual_seed(2025)
2025-05-07T20:32:05.1622922Z     
2025-05-07T20:32:05.1623205Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.1625271Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.1627162Z 
2025-05-07T20:32:05.1627302Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.1627542Z 
2025-05-07T20:32:05.1627741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.1628162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.1628572Z     T=16384,
2025-05-07T20:32:05.1628779Z     D=7168,
2025-05-07T20:32:05.1628980Z     scale_ub=1200.0,
2025-05-07T20:32:05.1629218Z     contiguous=True,
2025-05-07T20:32:05.1629451Z     compiled=False,
2025-05-07T20:32:05.1629659Z )
2025-05-07T20:32:05.1629987Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.1630497Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.1630782Z 
2025-05-07T20:32:05.1630870Z     @given(
2025-05-07T20:32:05.1631103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.1631423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.1631736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.1632071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.1632417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.1632716Z     )
2025-05-07T20:32:05.1633066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.1633513Z     def test_silu_mul_quant(
2025-05-07T20:32:05.1633768Z         self,
2025-05-07T20:32:05.1633965Z         T: int,
2025-05-07T20:32:05.1634173Z         D: int,
2025-05-07T20:32:05.1634399Z         scale_ub: Optional[float],
2025-05-07T20:32:05.1634727Z         contiguous: bool,
2025-05-07T20:32:05.1634980Z         compiled: bool,
2025-05-07T20:32:05.1635212Z     ) -> None:
2025-05-07T20:32:05.1635438Z         torch.manual_seed(2025)
2025-05-07T20:32:05.1635685Z     
2025-05-07T20:32:05.1635965Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.1638027Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.1639885Z 
2025-05-07T20:32:05.1640010Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.1640274Z 
2025-05-07T20:32:05.1640378Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.1640791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.1641200Z     T=128,
2025-05-07T20:32:05.1641396Z     D=5120,
2025-05-07T20:32:05.1641588Z     scale_ub=1200.0,
2025-05-07T20:32:05.1641820Z     contiguous=False,
2025-05-07T20:32:05.1642051Z     compiled=False,
2025-05-07T20:32:05.1642253Z )
2025-05-07T20:32:05.2665080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.2665682Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.2665965Z 
2025-05-07T20:32:05.2666053Z     @given(
2025-05-07T20:32:05.2666292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.2666614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.2666927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.2667277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.2667613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.2667905Z     )
2025-05-07T20:32:05.2668264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.2668709Z     def test_silu_mul_quant(
2025-05-07T20:32:05.2668961Z         self,
2025-05-07T20:32:05.2669165Z         T: int,
2025-05-07T20:32:05.2669361Z         D: int,
2025-05-07T20:32:05.2669844Z         scale_ub: Optional[float],
2025-05-07T20:32:05.2670130Z         contiguous: bool,
2025-05-07T20:32:05.2670372Z         compiled: bool,
2025-05-07T20:32:05.2670608Z     ) -> None:
2025-05-07T20:32:05.2670828Z         torch.manual_seed(2025)
2025-05-07T20:32:05.2671075Z     
2025-05-07T20:32:05.2671363Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.2671709Z     
2025-05-07T20:32:05.2671903Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.2672208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.2672532Z         x = x_sign * x_clamp
2025-05-07T20:32:05.2672775Z         x0 = x[:, :D]
2025-05-07T20:32:05.2673005Z         x1 = x[:, D:]
2025-05-07T20:32:05.2673221Z     
2025-05-07T20:32:05.2673409Z         if contiguous:
2025-05-07T20:32:05.2673651Z             x0 = x0.contiguous()
2025-05-07T20:32:05.2673924Z             x1 = x1.contiguous()
2025-05-07T20:32:05.2674176Z     
2025-05-07T20:32:05.2674372Z         if scale_ub is not None:
2025-05-07T20:32:05.2674663Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.2675008Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.2675319Z             )
2025-05-07T20:32:05.2675524Z         else:
2025-05-07T20:32:05.2675744Z             scale_ub_tensor = None
2025-05-07T20:32:05.2675997Z     
2025-05-07T20:32:05.2676239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.2676585Z             op = silu_mul_quant
2025-05-07T20:32:05.2676948Z             if compiled:
2025-05-07T20:32:05.2677208Z                 op = torch.compile(op)
2025-05-07T20:32:05.2677514Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.2677790Z     
2025-05-07T20:32:05.2677997Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.2678163Z 
2025-05-07T20:32:05.2678275Z moe/activation_test.py:117: 
2025-05-07T20:32:05.2678582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.2678915Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.2679215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.2679922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.2680629Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.2681169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.2681861Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.2682619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.2683163Z     kernel = self.compile(
2025-05-07T20:32:05.2683706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.2684374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.2684784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.2685016Z 
2025-05-07T20:32:05.2685228Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8748750>
2025-05-07T20:32:05.2686324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.2687820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a89237e0>}
2025-05-07T20:32:05.2689177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.2690293Z context = <triton._C.libtriton.ir.context object at 0x7f39a8990d30>
2025-05-07T20:32:05.2690587Z 
2025-05-07T20:32:05.2690756Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.2691287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.2691758Z                            module_map=module_map)
2025-05-07T20:32:05.2692122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.2692483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.2692756Z E       ^
2025-05-07T20:32:05.2693226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.2693676Z 
2025-05-07T20:32:05.2694095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.2694614Z 
2025-05-07T20:32:05.2694721Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.2695146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.2695558Z     T=2048,
2025-05-07T20:32:05.2695745Z     D=7168,
2025-05-07T20:32:05.2695946Z     scale_ub=None,
2025-05-07T20:32:05.2696168Z     contiguous=False,
2025-05-07T20:32:05.2696396Z     compiled=False,
2025-05-07T20:32:05.2696611Z )
2025-05-07T20:32:05.2696971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.2697477Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.2697808Z 
2025-05-07T20:32:05.2697887Z     @given(
2025-05-07T20:32:05.2698126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.2698441Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.2698752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.2699084Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.2699420Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.2699706Z     )
2025-05-07T20:32:05.2700063Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.2700511Z     def test_silu_mul_quant(
2025-05-07T20:32:05.2700756Z         self,
2025-05-07T20:32:05.2700959Z         T: int,
2025-05-07T20:32:05.2701165Z         D: int,
2025-05-07T20:32:05.2701382Z         scale_ub: Optional[float],
2025-05-07T20:32:05.2701661Z         contiguous: bool,
2025-05-07T20:32:05.2701908Z         compiled: bool,
2025-05-07T20:32:05.2702187Z     ) -> None:
2025-05-07T20:32:05.2702417Z         torch.manual_seed(2025)
2025-05-07T20:32:05.2702666Z     
2025-05-07T20:32:05.2702942Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.2705014Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.2707319Z 
2025-05-07T20:32:05.2707447Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.2707668Z 
2025-05-07T20:32:05.2707773Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.2708196Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.2708596Z     T=128,
2025-05-07T20:32:05.2708792Z     D=7168,
2025-05-07T20:32:05.2708990Z     scale_ub=1200.0,
2025-05-07T20:32:05.2709212Z     contiguous=True,
2025-05-07T20:32:05.2709444Z     compiled=True,
2025-05-07T20:32:05.2709654Z )
2025-05-07T20:32:05.3016489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.3017229Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.3017541Z 
2025-05-07T20:32:05.3017663Z     @given(
2025-05-07T20:32:05.3017997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.3018375Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.3018686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.3019019Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.3019345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.3019682Z     )
2025-05-07T20:32:05.3020031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.3020474Z     def test_silu_mul_quant(
2025-05-07T20:32:05.3020719Z         self,
2025-05-07T20:32:05.3020911Z         T: int,
2025-05-07T20:32:05.3021114Z         D: int,
2025-05-07T20:32:05.3021335Z         scale_ub: Optional[float],
2025-05-07T20:32:05.3021601Z         contiguous: bool,
2025-05-07T20:32:05.3021844Z         compiled: bool,
2025-05-07T20:32:05.3022081Z     ) -> None:
2025-05-07T20:32:05.3022303Z         torch.manual_seed(2025)
2025-05-07T20:32:05.3022547Z     
2025-05-07T20:32:05.3022820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.3023161Z     
2025-05-07T20:32:05.3023354Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.3023645Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.3023958Z         x = x_sign * x_clamp
2025-05-07T20:32:05.3024285Z         x0 = x[:, :D]
2025-05-07T20:32:05.3024507Z         x1 = x[:, D:]
2025-05-07T20:32:05.3024718Z     
2025-05-07T20:32:05.3024905Z         if contiguous:
2025-05-07T20:32:05.3025142Z             x0 = x0.contiguous()
2025-05-07T20:32:05.3025403Z             x1 = x1.contiguous()
2025-05-07T20:32:05.3025640Z     
2025-05-07T20:32:05.3025837Z         if scale_ub is not None:
2025-05-07T20:32:05.3026116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.3026453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.3026767Z             )
2025-05-07T20:32:05.3026970Z         else:
2025-05-07T20:32:05.3027183Z             scale_ub_tensor = None
2025-05-07T20:32:05.3027441Z     
2025-05-07T20:32:05.3027681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.3027999Z             op = silu_mul_quant
2025-05-07T20:32:05.3028252Z             if compiled:
2025-05-07T20:32:05.3028506Z                 op = torch.compile(op)
2025-05-07T20:32:05.3028880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.3029161Z     
2025-05-07T20:32:05.3029365Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.3029532Z 
2025-05-07T20:32:05.3029639Z moe/activation_test.py:117: 
2025-05-07T20:32:05.3029933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.3030273Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.3030556Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.3031126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.3031686Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.3032350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.3033042Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.3033577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.3034264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.3034931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.3035470Z     kernel = self.compile(
2025-05-07T20:32:05.3036008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.3036747Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.3037150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.3037381Z 
2025-05-07T20:32:05.3037589Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a87efe10>
2025-05-07T20:32:05.3038670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.3040047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8786a20>}
2025-05-07T20:32:05.3041389Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.3042417Z context = <triton._C.libtriton.ir.context object at 0x7f39a8730d30>
2025-05-07T20:32:05.3042704Z 
2025-05-07T20:32:05.3042871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.3043395Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.3043863Z                            module_map=module_map)
2025-05-07T20:32:05.3044230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.3044657Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.3044922Z E       ^
2025-05-07T20:32:05.3045388Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.3045836Z 
2025-05-07T20:32:05.3046252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.3046765Z 
2025-05-07T20:32:05.3046876Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.3047291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.3047785Z     T=128,
2025-05-07T20:32:05.3047971Z     D=7168,
2025-05-07T20:32:05.3048178Z     scale_ub=1200.0,
2025-05-07T20:32:05.3048406Z     contiguous=True,
2025-05-07T20:32:05.3048630Z     compiled=False,
2025-05-07T20:32:05.3048841Z )
2025-05-07T20:32:05.3049163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.3049704Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.3049984Z 
2025-05-07T20:32:05.3050064Z     @given(
2025-05-07T20:32:05.3050306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.3050624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.3050928Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.3051261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.3051596Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.3051885Z     )
2025-05-07T20:32:05.3052237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.3052681Z     def test_silu_mul_quant(
2025-05-07T20:32:05.3052922Z         self,
2025-05-07T20:32:05.3053124Z         T: int,
2025-05-07T20:32:05.3053331Z         D: int,
2025-05-07T20:32:05.3053549Z         scale_ub: Optional[float],
2025-05-07T20:32:05.3053826Z         contiguous: bool,
2025-05-07T20:32:05.3054071Z         compiled: bool,
2025-05-07T20:32:05.3054291Z     ) -> None:
2025-05-07T20:32:05.3054515Z         torch.manual_seed(2025)
2025-05-07T20:32:05.3054760Z     
2025-05-07T20:32:05.3055030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.3055377Z     
2025-05-07T20:32:05.3055576Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.3055870Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.3057949Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.3059806Z 
2025-05-07T20:32:05.3059928Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.3060145Z 
2025-05-07T20:32:05.3060250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.3060667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.3061070Z     T=128,
2025-05-07T20:32:05.3061260Z     D=5120,
2025-05-07T20:32:05.3061458Z     scale_ub=1200.0,
2025-05-07T20:32:05.3061686Z     contiguous=True,
2025-05-07T20:32:05.3061909Z     compiled=True,
2025-05-07T20:32:05.3062121Z )
2025-05-07T20:32:05.3062442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.3062923Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.3063194Z 
2025-05-07T20:32:05.3063276Z     @given(
2025-05-07T20:32:05.3063510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.3063819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.3065526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.3065860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.3066186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.3066481Z     )
2025-05-07T20:32:05.3066843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.3067288Z     def test_silu_mul_quant(
2025-05-07T20:32:05.3067530Z         self,
2025-05-07T20:32:05.3067739Z         T: int,
2025-05-07T20:32:05.3067946Z         D: int,
2025-05-07T20:32:05.3068162Z         scale_ub: Optional[float],
2025-05-07T20:32:05.3068440Z         contiguous: bool,
2025-05-07T20:32:05.3068685Z         compiled: bool,
2025-05-07T20:32:05.3068907Z     ) -> None:
2025-05-07T20:32:05.3069129Z         torch.manual_seed(2025)
2025-05-07T20:32:05.3078275Z     
2025-05-07T20:32:05.3078574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.3079016Z     
2025-05-07T20:32:05.3079225Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.3079527Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.3081538Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.3083393Z 
2025-05-07T20:32:05.3083518Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.3083744Z 
2025-05-07T20:32:05.3083852Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.3084278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.3084689Z     T=128,
2025-05-07T20:32:05.3084891Z     D=7168,
2025-05-07T20:32:05.3085095Z     scale_ub=None,
2025-05-07T20:32:05.3085314Z     contiguous=True,
2025-05-07T20:32:05.3085550Z     compiled=True,
2025-05-07T20:32:05.3085767Z )
2025-05-07T20:32:05.7406934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7407630Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.7408073Z 
2025-05-07T20:32:05.7408168Z     @given(
2025-05-07T20:32:05.7408400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7408715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7409018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7409345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7409682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7409981Z     )
2025-05-07T20:32:05.7410341Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7410783Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7411038Z         self,
2025-05-07T20:32:05.7411250Z         T: int,
2025-05-07T20:32:05.7411453Z         D: int,
2025-05-07T20:32:05.7411682Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7411962Z         contiguous: bool,
2025-05-07T20:32:05.7412204Z         compiled: bool,
2025-05-07T20:32:05.7412442Z     ) -> None:
2025-05-07T20:32:05.7412678Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7412923Z     
2025-05-07T20:32:05.7413204Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7415262Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.7417256Z 
2025-05-07T20:32:05.7417380Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.7417595Z 
2025-05-07T20:32:05.7449834Z FAILED
2025-05-07T20:32:05.7450011Z 
2025-05-07T20:32:05.7450201Z =================================== FAILURES ===================================
2025-05-07T20:32:05.7450668Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:05.7451223Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:05.7451975Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:05.7452686Z   |     yield
2025-05-07T20:32:05.7453149Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:05.7453902Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:05.7454581Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:05.7455144Z   |     if method() is not None:
2025-05-07T20:32:05.7455407Z   |        ^^^^^^^^
2025-05-07T20:32:05.7456090Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:05.7456815Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7457125Z   |            ^^^^^^^
2025-05-07T20:32:05.7457705Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:05.7458358Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:05.7458789Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:05.7459228Z   +-+---------------- 1 ----------------
2025-05-07T20:32:05.7459545Z     | Traceback (most recent call last):
2025-05-07T20:32:05.7460266Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:05.7461058Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7461548Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7463622Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.7465802Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:05.7466261Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7466688Z     |     T=2048,
2025-05-07T20:32:05.7467000Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:05.7467362Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:05.7467727Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:05.7468105Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:05.7468405Z     | )
2025-05-07T20:32:05.7468591Z     | 
2025-05-07T20:32:05.7469137Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:05.7469807Z     +---------------- 2 ----------------
2025-05-07T20:32:05.7470098Z     | Traceback (most recent call last):
2025-05-07T20:32:05.7470814Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:05.7471590Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7471964Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7474024Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.7476045Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:05.7476560Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7476990Z     |     T=128,
2025-05-07T20:32:05.7477202Z     |     D=7168,
2025-05-07T20:32:05.7477422Z     |     scale_ub=None,
2025-05-07T20:32:05.7477674Z     |     contiguous=True,
2025-05-07T20:32:05.7477923Z     |     compiled=True,
2025-05-07T20:32:05.7478163Z     | )
2025-05-07T20:32:05.7478349Z     | 
2025-05-07T20:32:05.7478954Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:05.7479598Z     +---------------- 3 ----------------
2025-05-07T20:32:05.7479892Z     | Traceback (most recent call last):
2025-05-07T20:32:05.7480613Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:05.7481417Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7481810Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7483878Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.7485875Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:05.7486324Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7486744Z     |     T=128,
2025-05-07T20:32:05.7486951Z     |     D=5120,
2025-05-07T20:32:05.7487170Z     |     scale_ub=1200.0,
2025-05-07T20:32:05.7487414Z     |     contiguous=True,
2025-05-07T20:32:05.7487776Z     |     compiled=True,
2025-05-07T20:32:05.7488006Z     | )
2025-05-07T20:32:05.7488188Z     | 
2025-05-07T20:32:05.7488732Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:05.7489348Z     +---------------- 4 ----------------
2025-05-07T20:32:05.7489689Z     | Traceback (most recent call last):
2025-05-07T20:32:05.7490412Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:05.7491249Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7491558Z     |                              ^^^^^^^^
2025-05-07T20:32:05.7492278Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:05.7492974Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7493320Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7494126Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:05.7494922Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7495531Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:05.7496260Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7496716Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7497454Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:05.7498225Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7498698Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7499380Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:05.7500185Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7500657Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7501299Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:05.7502014Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7502392Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7502993Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:05.7503565Z     |     fn()
2025-05-07T20:32:05.7504258Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:05.7504891Z     |     self.fn.run(
2025-05-07T20:32:05.7505438Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:05.7506314Z     |     kernel = self.compile(
2025-05-07T20:32:05.7506585Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:05.7507184Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:05.7507902Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7508290Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7508929Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.7509724Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7510217Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.7510603Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7510957Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7511225Z     | ^
2025-05-07T20:32:05.7511688Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7512350Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:05.7512750Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:05.7513270Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7513705Z     |     T=1,  # or any other generated value
2025-05-07T20:32:05.7514024Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:05.7514367Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:05.7514743Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:05.7515103Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:05.7515416Z     | )
2025-05-07T20:32:05.7515603Z     | 
2025-05-07T20:32:05.7516126Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:05.7516739Z     +------------------------------------
2025-05-07T20:32:05.7517186Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:05.7517572Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7517982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7518391Z     T=1,
2025-05-07T20:32:05.7518586Z     D=5120,
2025-05-07T20:32:05.7518786Z     scale_ub=None,
2025-05-07T20:32:05.7519014Z     contiguous=True,
2025-05-07T20:32:05.7519246Z     compiled=True,
2025-05-07T20:32:05.7519454Z )
2025-05-07T20:32:05.7519789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7520275Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.7520536Z 
2025-05-07T20:32:05.7520627Z     @given(
2025-05-07T20:32:05.7520864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7521189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7521500Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7521840Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7522174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7522472Z     )
2025-05-07T20:32:05.7522819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7523272Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7523522Z         self,
2025-05-07T20:32:05.7523716Z         T: int,
2025-05-07T20:32:05.7523925Z         D: int,
2025-05-07T20:32:05.7524278Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7524567Z         contiguous: bool,
2025-05-07T20:32:05.7524812Z         compiled: bool,
2025-05-07T20:32:05.7525050Z     ) -> None:
2025-05-07T20:32:05.7525277Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7525521Z     
2025-05-07T20:32:05.7525799Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7526157Z     
2025-05-07T20:32:05.7526358Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7526667Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7526987Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7527233Z         x0 = x[:, :D]
2025-05-07T20:32:05.7527463Z         x1 = x[:, D:]
2025-05-07T20:32:05.7527776Z     
2025-05-07T20:32:05.7527966Z         if contiguous:
2025-05-07T20:32:05.7528208Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7528478Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7528722Z     
2025-05-07T20:32:05.7528920Z         if scale_ub is not None:
2025-05-07T20:32:05.7529213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7529549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7529868Z             )
2025-05-07T20:32:05.7530068Z         else:
2025-05-07T20:32:05.7530290Z             scale_ub_tensor = None
2025-05-07T20:32:05.7530539Z     
2025-05-07T20:32:05.7530779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7531097Z             op = silu_mul_quant
2025-05-07T20:32:05.7531402Z             if compiled:
2025-05-07T20:32:05.7531658Z                 op = torch.compile(op)
2025-05-07T20:32:05.7531964Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7532246Z     
2025-05-07T20:32:05.7532447Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7532747Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7533040Z     
2025-05-07T20:32:05.7533289Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7533636Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7533934Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7534256Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7534631Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7534946Z     
2025-05-07T20:32:05.7535156Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7535360Z 
2025-05-07T20:32:05.7535462Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7535829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7536170Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7536509Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7537304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7538058Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7538608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7539299Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7539988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7540711Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7541466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7542221Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7542947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7543588Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7544276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7544810Z     fn()
2025-05-07T20:32:05.7545328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7545904Z     self.fn.run(
2025-05-07T20:32:05.7546385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7546925Z     kernel = self.compile(
2025-05-07T20:32:05.7547469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7548132Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7548530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7548762Z 
2025-05-07T20:32:05.7548978Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a97fb7950>
2025-05-07T20:32:05.7550061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7551445Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a97697ce0>}
2025-05-07T20:32:05.7552842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7553873Z context = <triton._C.libtriton.ir.context object at 0x7f3a9d5fbcf0>
2025-05-07T20:32:05.7554161Z 
2025-05-07T20:32:05.7554335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7554860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7555334Z                            module_map=module_map)
2025-05-07T20:32:05.7555704Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7556060Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7556332Z E       ^
2025-05-07T20:32:05.7556926Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7557613Z 
2025-05-07T20:32:05.7558052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7558568Z 
2025-05-07T20:32:05.7558675Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7559098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7559506Z     T=2048,
2025-05-07T20:32:05.7559701Z     D=5120,
2025-05-07T20:32:05.7559904Z     scale_ub=1200.0,
2025-05-07T20:32:05.7560142Z     contiguous=True,
2025-05-07T20:32:05.7560369Z     compiled=False,
2025-05-07T20:32:05.7560591Z )
2025-05-07T20:32:05.7560913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7561421Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.7561695Z 
2025-05-07T20:32:05.7561776Z     @given(
2025-05-07T20:32:05.7562019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7562344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7562659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7562995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7563332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7563614Z     )
2025-05-07T20:32:05.7563964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7564407Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7564654Z         self,
2025-05-07T20:32:05.7564946Z         T: int,
2025-05-07T20:32:05.7565150Z         D: int,
2025-05-07T20:32:05.7565375Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7565645Z         contiguous: bool,
2025-05-07T20:32:05.7565893Z         compiled: bool,
2025-05-07T20:32:05.7566126Z     ) -> None:
2025-05-07T20:32:05.7566340Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7566618Z     
2025-05-07T20:32:05.7566923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7567277Z     
2025-05-07T20:32:05.7567487Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7567909Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7568223Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7568471Z         x0 = x[:, :D]
2025-05-07T20:32:05.7568696Z         x1 = x[:, D:]
2025-05-07T20:32:05.7568909Z     
2025-05-07T20:32:05.7569100Z         if contiguous:
2025-05-07T20:32:05.7569341Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7569608Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7569858Z     
2025-05-07T20:32:05.7570062Z         if scale_ub is not None:
2025-05-07T20:32:05.7570337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7570680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7571003Z             )
2025-05-07T20:32:05.7571210Z         else:
2025-05-07T20:32:05.7571422Z             scale_ub_tensor = None
2025-05-07T20:32:05.7571685Z     
2025-05-07T20:32:05.7571983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7572302Z             op = silu_mul_quant
2025-05-07T20:32:05.7572558Z             if compiled:
2025-05-07T20:32:05.7572812Z                 op = torch.compile(op)
2025-05-07T20:32:05.7573110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7573396Z     
2025-05-07T20:32:05.7573595Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7573763Z 
2025-05-07T20:32:05.7573865Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7574178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7595238Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7595684Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7596676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7597702Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7598449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7599572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7600506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7601268Z     kernel = self.compile(
2025-05-07T20:32:05.7602035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7602936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7603474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7603797Z 
2025-05-07T20:32:05.7604078Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a960f9310>
2025-05-07T20:32:05.7605591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7607897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a97911da0>}
2025-05-07T20:32:05.7609984Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7611336Z context = <triton._C.libtriton.ir.context object at 0x7f3a960eb2b0>
2025-05-07T20:32:05.7611720Z 
2025-05-07T20:32:05.7611934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7612674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7613335Z                            module_map=module_map)
2025-05-07T20:32:05.7613859Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7614355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7614711Z E       ^
2025-05-07T20:32:05.7615363Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7616006Z 
2025-05-07T20:32:05.7616590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7617364Z 
2025-05-07T20:32:05.7617519Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7618084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7618647Z     T=2048,
2025-05-07T20:32:05.7618906Z     D=5120,
2025-05-07T20:32:05.7619169Z     scale_ub=1200.0,
2025-05-07T20:32:05.7619486Z     contiguous=True,
2025-05-07T20:32:05.7619798Z     compiled=True,
2025-05-07T20:32:05.7620091Z )
2025-05-07T20:32:05.7620638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7621330Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.7621703Z 
2025-05-07T20:32:05.7621835Z     @given(
2025-05-07T20:32:05.7622143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7622555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7622953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7623391Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7623831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7624207Z     )
2025-05-07T20:32:05.7624656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7625225Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7625539Z         self,
2025-05-07T20:32:05.7625793Z         T: int,
2025-05-07T20:32:05.7626050Z         D: int,
2025-05-07T20:32:05.7626354Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7626920Z         contiguous: bool,
2025-05-07T20:32:05.7627254Z         compiled: bool,
2025-05-07T20:32:05.7627587Z     ) -> None:
2025-05-07T20:32:05.7627897Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7628227Z     
2025-05-07T20:32:05.7628611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7629099Z     
2025-05-07T20:32:05.7629359Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7629761Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7630187Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7630521Z         x0 = x[:, :D]
2025-05-07T20:32:05.7630838Z         x1 = x[:, D:]
2025-05-07T20:32:05.7631133Z     
2025-05-07T20:32:05.7631390Z         if contiguous:
2025-05-07T20:32:05.7631721Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7632088Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7632436Z     
2025-05-07T20:32:05.7632706Z         if scale_ub is not None:
2025-05-07T20:32:05.7633108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7633583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7634012Z             )
2025-05-07T20:32:05.7634290Z         else:
2025-05-07T20:32:05.7634591Z             scale_ub_tensor = None
2025-05-07T20:32:05.7634925Z     
2025-05-07T20:32:05.7635247Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7635679Z             op = silu_mul_quant
2025-05-07T20:32:05.7636020Z             if compiled:
2025-05-07T20:32:05.7636474Z                 op = torch.compile(op)
2025-05-07T20:32:05.7636889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7637274Z     
2025-05-07T20:32:05.7637553Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7637949Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7638346Z     
2025-05-07T20:32:05.7638672Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7639141Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7639566Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7639975Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7640450Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7640835Z     
2025-05-07T20:32:05.7641079Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7641331Z 
2025-05-07T20:32:05.7641453Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7641827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7642236Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7642660Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7643774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7644798Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7645651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7646629Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7647708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7648738Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7651011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7652069Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7653091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7653991Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7654915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7655658Z     fn()
2025-05-07T20:32:05.7656390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7657240Z     self.fn.run(
2025-05-07T20:32:05.7657902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7658646Z     kernel = self.compile(
2025-05-07T20:32:05.7659411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7660328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7660883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7661207Z 
2025-05-07T20:32:05.7661490Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a85e9a750>
2025-05-07T20:32:05.7663011Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7664994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a9642f2e0>}
2025-05-07T20:32:05.7667007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7668479Z context = <triton._C.libtriton.ir.context object at 0x7f3a85e9e4b0>
2025-05-07T20:32:05.7668906Z 
2025-05-07T20:32:05.7669144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7669886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7670538Z                            module_map=module_map)
2025-05-07T20:32:05.7671021Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7671512Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7671893Z E       ^
2025-05-07T20:32:05.7672541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7673187Z 
2025-05-07T20:32:05.7673780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7674513Z 
2025-05-07T20:32:05.7674661Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7675243Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7675805Z     T=16384,
2025-05-07T20:32:05.7676090Z     D=7168,
2025-05-07T20:32:05.7676370Z     scale_ub=1200.0,
2025-05-07T20:32:05.7676687Z     contiguous=False,
2025-05-07T20:32:05.7677073Z     compiled=False,
2025-05-07T20:32:05.7677367Z )
2025-05-07T20:32:05.7677808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7678511Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.7678914Z 
2025-05-07T20:32:05.7679027Z     @given(
2025-05-07T20:32:05.7679356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7679798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7680247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7680725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7681182Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7681589Z     )
2025-05-07T20:32:05.7682089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7682705Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7683052Z         self,
2025-05-07T20:32:05.7683394Z         T: int,
2025-05-07T20:32:05.7683674Z         D: int,
2025-05-07T20:32:05.7683988Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7684375Z         contiguous: bool,
2025-05-07T20:32:05.7684723Z         compiled: bool,
2025-05-07T20:32:05.7685029Z     ) -> None:
2025-05-07T20:32:05.7685332Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7685681Z     
2025-05-07T20:32:05.7686048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7686527Z     
2025-05-07T20:32:05.7686840Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7687251Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7687769Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7688092Z         x0 = x[:, :D]
2025-05-07T20:32:05.7688396Z         x1 = x[:, D:]
2025-05-07T20:32:05.7688701Z     
2025-05-07T20:32:05.7688969Z         if contiguous:
2025-05-07T20:32:05.7689291Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7689661Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7690014Z     
2025-05-07T20:32:05.7690279Z         if scale_ub is not None:
2025-05-07T20:32:05.7690662Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7691137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7691591Z             )
2025-05-07T20:32:05.7691869Z         else:
2025-05-07T20:32:05.7692172Z             scale_ub_tensor = None
2025-05-07T20:32:05.7692537Z     
2025-05-07T20:32:05.7693007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7693462Z             op = silu_mul_quant
2025-05-07T20:32:05.7693826Z             if compiled:
2025-05-07T20:32:05.7694185Z                 op = torch.compile(op)
2025-05-07T20:32:05.7694610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7695007Z     
2025-05-07T20:32:05.7695276Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7695512Z 
2025-05-07T20:32:05.7695648Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7696070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7696539Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7696943Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7697924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7698894Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7699666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7700634Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7701555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7702288Z     kernel = self.compile(
2025-05-07T20:32:05.7703030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7704028Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7704584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7704905Z 
2025-05-07T20:32:05.7705192Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a85ef6290>
2025-05-07T20:32:05.7706913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7708303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a96238860>}
2025-05-07T20:32:05.7709647Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7710822Z context = <triton._C.libtriton.ir.context object at 0x7f3a97790d30>
2025-05-07T20:32:05.7711110Z 
2025-05-07T20:32:05.7711285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7711808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7712283Z                            module_map=module_map)
2025-05-07T20:32:05.7712656Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7713007Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7713276Z E       ^
2025-05-07T20:32:05.7713746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7714196Z 
2025-05-07T20:32:05.7714617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7715132Z 
2025-05-07T20:32:05.7715237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7715656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7716061Z     T=1,
2025-05-07T20:32:05.7716247Z     D=7168,
2025-05-07T20:32:05.7716445Z     scale_ub=None,
2025-05-07T20:32:05.7716664Z     contiguous=True,
2025-05-07T20:32:05.7716884Z     compiled=True,
2025-05-07T20:32:05.7717093Z )
2025-05-07T20:32:05.7717413Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7718024Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.7718285Z 
2025-05-07T20:32:05.7718369Z     @given(
2025-05-07T20:32:05.7718605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7718922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7719230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7719561Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7719899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7720183Z     )
2025-05-07T20:32:05.7720533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7720981Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7721220Z         self,
2025-05-07T20:32:05.7721420Z         T: int,
2025-05-07T20:32:05.7721622Z         D: int,
2025-05-07T20:32:05.7721847Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7722116Z         contiguous: bool,
2025-05-07T20:32:05.7722367Z         compiled: bool,
2025-05-07T20:32:05.7722596Z     ) -> None:
2025-05-07T20:32:05.7722812Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7723057Z     
2025-05-07T20:32:05.7723336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7723676Z     
2025-05-07T20:32:05.7723878Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7724179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7724556Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7724808Z         x0 = x[:, :D]
2025-05-07T20:32:05.7725030Z         x1 = x[:, D:]
2025-05-07T20:32:05.7725239Z     
2025-05-07T20:32:05.7725434Z         if contiguous:
2025-05-07T20:32:05.7725669Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7725927Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7726174Z     
2025-05-07T20:32:05.7726375Z         if scale_ub is not None:
2025-05-07T20:32:05.7726645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7726992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7727309Z             )
2025-05-07T20:32:05.7727511Z         else:
2025-05-07T20:32:05.7727840Z             scale_ub_tensor = None
2025-05-07T20:32:05.7728093Z     
2025-05-07T20:32:05.7728325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7728633Z             op = silu_mul_quant
2025-05-07T20:32:05.7728887Z             if compiled:
2025-05-07T20:32:05.7729194Z                 op = torch.compile(op)
2025-05-07T20:32:05.7729486Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7729766Z     
2025-05-07T20:32:05.7729967Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7730248Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7730542Z     
2025-05-07T20:32:05.7730784Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7731116Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7731419Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7731736Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7732100Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7732410Z     
2025-05-07T20:32:05.7732616Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7732809Z 
2025-05-07T20:32:05.7732916Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7733210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7733556Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7733890Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7734667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7735427Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7736061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7736749Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7737427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7738150Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7738904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7739656Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7740376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7741014Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7741622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7742149Z     fn()
2025-05-07T20:32:05.7742652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7743235Z     self.fn.run(
2025-05-07T20:32:05.7743703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7744230Z     kernel = self.compile(
2025-05-07T20:32:05.7744817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7745476Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7745879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7746107Z 
2025-05-07T20:32:05.7746315Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a859d4d10>
2025-05-07T20:32:05.7747399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7748771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a96239080>}
2025-05-07T20:32:05.7750115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7751185Z context = <triton._C.libtriton.ir.context object at 0x7f3a859cfe70>
2025-05-07T20:32:05.7751477Z 
2025-05-07T20:32:05.7751644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7752165Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7752279Z                            module_map=module_map)
2025-05-07T20:32:05.7752448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7752552Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7752631Z E       ^
2025-05-07T20:32:05.7752992Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7752997Z 
2025-05-07T20:32:05.7753409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7753420Z 
2025-05-07T20:32:05.7753532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7753754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7753831Z     T=4096,
2025-05-07T20:32:05.7753918Z     D=5120,
2025-05-07T20:32:05.7754005Z     scale_ub=None,
2025-05-07T20:32:05.7754095Z     contiguous=False,
2025-05-07T20:32:05.7754186Z     compiled=False,
2025-05-07T20:32:05.7754343Z )
2025-05-07T20:32:05.7754565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7754749Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.7754754Z 
2025-05-07T20:32:05.7754834Z     @given(
2025-05-07T20:32:05.7754962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7755064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7755183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7755310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7755425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7755500Z     )
2025-05-07T20:32:05.7755752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7755849Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7755936Z         self,
2025-05-07T20:32:05.7756019Z         T: int,
2025-05-07T20:32:05.7756096Z         D: int,
2025-05-07T20:32:05.7756209Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7756302Z         contiguous: bool,
2025-05-07T20:32:05.7756389Z         compiled: bool,
2025-05-07T20:32:05.7756477Z     ) -> None:
2025-05-07T20:32:05.7756574Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7756651Z     
2025-05-07T20:32:05.7756827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7756906Z     
2025-05-07T20:32:05.7756999Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7757186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7757277Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7757368Z         x0 = x[:, :D]
2025-05-07T20:32:05.7757451Z         x1 = x[:, D:]
2025-05-07T20:32:05.7757528Z     
2025-05-07T20:32:05.7757622Z         if contiguous:
2025-05-07T20:32:05.7757716Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7757808Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7757886Z     
2025-05-07T20:32:05.7757985Z         if scale_ub is not None:
2025-05-07T20:32:05.7758094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7758239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7758317Z             )
2025-05-07T20:32:05.7758394Z         else:
2025-05-07T20:32:05.7758497Z             scale_ub_tensor = None
2025-05-07T20:32:05.7758572Z     
2025-05-07T20:32:05.7758704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7758856Z             op = silu_mul_quant
2025-05-07T20:32:05.7758943Z             if compiled:
2025-05-07T20:32:05.7759054Z                 op = torch.compile(op)
2025-05-07T20:32:05.7759161Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7759238Z     
2025-05-07T20:32:05.7759337Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7759342Z 
2025-05-07T20:32:05.7759440Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7759572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7759686Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7759789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7760290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7760401Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7760760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7760995Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7761337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7761434Z     kernel = self.compile(
2025-05-07T20:32:05.7761824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7762107Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7762247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7762252Z 
2025-05-07T20:32:05.7762457Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a85837390>
2025-05-07T20:32:05.7763233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7763747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85f2f6a0>}
2025-05-07T20:32:05.7764493Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7764696Z context = <triton._C.libtriton.ir.context object at 0x7f3a85985330>
2025-05-07T20:32:05.7764701Z 
2025-05-07T20:32:05.7764868Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7765133Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7765251Z                            module_map=module_map)
2025-05-07T20:32:05.7765415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7765565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7765646Z E       ^
2025-05-07T20:32:05.7766002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7766007Z 
2025-05-07T20:32:05.7766426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7766431Z 
2025-05-07T20:32:05.7766543Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7766801Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7766900Z     T=4096,
2025-05-07T20:32:05.7766989Z     D=7168,
2025-05-07T20:32:05.7767081Z     scale_ub=None,
2025-05-07T20:32:05.7767170Z     contiguous=False,
2025-05-07T20:32:05.7767257Z     compiled=False,
2025-05-07T20:32:05.7767342Z )
2025-05-07T20:32:05.7767652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7767878Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.7767883Z 
2025-05-07T20:32:05.7767969Z     @given(
2025-05-07T20:32:05.7768090Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7768195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7768313Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7768436Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7768556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7768640Z     )
2025-05-07T20:32:05.7768884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7768987Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7769068Z         self,
2025-05-07T20:32:05.7769146Z         T: int,
2025-05-07T20:32:05.7769230Z         D: int,
2025-05-07T20:32:05.7769331Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7769421Z         contiguous: bool,
2025-05-07T20:32:05.7769519Z         compiled: bool,
2025-05-07T20:32:05.7769602Z     ) -> None:
2025-05-07T20:32:05.7769705Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7769781Z     
2025-05-07T20:32:05.7769951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7770034Z     
2025-05-07T20:32:05.7770127Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7770255Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7770356Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7770516Z         x0 = x[:, :D]
2025-05-07T20:32:05.7770601Z         x1 = x[:, D:]
2025-05-07T20:32:05.7770680Z     
2025-05-07T20:32:05.7770766Z         if contiguous:
2025-05-07T20:32:05.7770862Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7770959Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7771035Z     
2025-05-07T20:32:05.7771126Z         if scale_ub is not None:
2025-05-07T20:32:05.7771240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7771378Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7771467Z             )
2025-05-07T20:32:05.7771547Z         else:
2025-05-07T20:32:05.7771644Z             scale_ub_tensor = None
2025-05-07T20:32:05.7771722Z     
2025-05-07T20:32:05.7771854Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7771949Z             op = silu_mul_quant
2025-05-07T20:32:05.7772045Z             if compiled:
2025-05-07T20:32:05.7772151Z                 op = torch.compile(op)
2025-05-07T20:32:05.7772264Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7772347Z     
2025-05-07T20:32:05.7772443Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7772448Z 
2025-05-07T20:32:05.7772554Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7772685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7772789Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7772895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7773439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7773539Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7773910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7774133Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7774482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7784958Z     kernel = self.compile(
2025-05-07T20:32:05.7785376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7785552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7785683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7785770Z 
2025-05-07T20:32:05.7785974Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a8580f7d0>
2025-05-07T20:32:05.7786751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7787260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85f2dda0>}
2025-05-07T20:32:05.7788013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7788202Z context = <triton._C.libtriton.ir.context object at 0x7f3a85803df0>
2025-05-07T20:32:05.7788207Z 
2025-05-07T20:32:05.7788370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7788637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7788744Z                            module_map=module_map)
2025-05-07T20:32:05.7788908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7789006Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7789085Z E       ^
2025-05-07T20:32:05.7789542Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7789548Z 
2025-05-07T20:32:05.7789959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7789963Z 
2025-05-07T20:32:05.7790067Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7790290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7790367Z     T=128,
2025-05-07T20:32:05.7790453Z     D=7168,
2025-05-07T20:32:05.7790537Z     scale_ub=None,
2025-05-07T20:32:05.7790623Z     contiguous=False,
2025-05-07T20:32:05.7790709Z     compiled=True,
2025-05-07T20:32:05.7790785Z )
2025-05-07T20:32:05.7791000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7791171Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.7791176Z 
2025-05-07T20:32:05.7791253Z     @given(
2025-05-07T20:32:05.7791378Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7791480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7791595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7791712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7791824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7791898Z     )
2025-05-07T20:32:05.7792141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7792306Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7792384Z         self,
2025-05-07T20:32:05.7792463Z         T: int,
2025-05-07T20:32:05.7792541Z         D: int,
2025-05-07T20:32:05.7792638Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7792730Z         contiguous: bool,
2025-05-07T20:32:05.7792815Z         compiled: bool,
2025-05-07T20:32:05.7792893Z     ) -> None:
2025-05-07T20:32:05.7792991Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7793065Z     
2025-05-07T20:32:05.7793239Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7793316Z     
2025-05-07T20:32:05.7793408Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7793535Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7793623Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7793704Z         x0 = x[:, :D]
2025-05-07T20:32:05.7793786Z         x1 = x[:, D:]
2025-05-07T20:32:05.7793859Z     
2025-05-07T20:32:05.7793945Z         if contiguous:
2025-05-07T20:32:05.7794086Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7794174Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7794248Z     
2025-05-07T20:32:05.7794340Z         if scale_ub is not None:
2025-05-07T20:32:05.7794444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7794581Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7794656Z             )
2025-05-07T20:32:05.7794733Z         else:
2025-05-07T20:32:05.7794833Z             scale_ub_tensor = None
2025-05-07T20:32:05.7794905Z     
2025-05-07T20:32:05.7795033Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7795124Z             op = silu_mul_quant
2025-05-07T20:32:05.7795208Z             if compiled:
2025-05-07T20:32:05.7795307Z                 op = torch.compile(op)
2025-05-07T20:32:05.7795415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7795488Z     
2025-05-07T20:32:05.7795578Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7795710Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7795784Z     
2025-05-07T20:32:05.7795919Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7796021Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7796119Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7796243Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7796384Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7796539Z     
2025-05-07T20:32:05.7796643Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7796647Z 
2025-05-07T20:32:05.7796746Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7796876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7796983Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7797116Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7797676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7797782Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7798139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7798363Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7798733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7798988Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7799383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7799634Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7800054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7800218Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7800557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7800639Z     fn()
2025-05-07T20:32:05.7801034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7801122Z     self.fn.run(
2025-05-07T20:32:05.7801456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7801549Z     kernel = self.compile(
2025-05-07T20:32:05.7801928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7802100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7802299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7802307Z 
2025-05-07T20:32:05.7802508Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a8551e4d0>
2025-05-07T20:32:05.7803280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7803785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c81620>}
2025-05-07T20:32:05.7804525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7804715Z context = <triton._C.libtriton.ir.context object at 0x7f3a85528030>
2025-05-07T20:32:05.7804727Z 
2025-05-07T20:32:05.7804888Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7805148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7805256Z                            module_map=module_map)
2025-05-07T20:32:05.7805416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7805520Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7805985Z E       ^
2025-05-07T20:32:05.7806403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7806409Z 
2025-05-07T20:32:05.7806827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7806831Z 
2025-05-07T20:32:05.7806935Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7807157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7807243Z     T=128,
2025-05-07T20:32:05.7807321Z     D=7168,
2025-05-07T20:32:05.7807406Z     scale_ub=None,
2025-05-07T20:32:05.7807493Z     contiguous=False,
2025-05-07T20:32:05.7807633Z     compiled=False,
2025-05-07T20:32:05.7807710Z )
2025-05-07T20:32:05.7807926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7808095Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.7808105Z 
2025-05-07T20:32:05.7808186Z     @given(
2025-05-07T20:32:05.7808304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7808401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7808519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7808634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7808747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7808898Z     )
2025-05-07T20:32:05.7809143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7809241Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7809318Z         self,
2025-05-07T20:32:05.7809395Z         T: int,
2025-05-07T20:32:05.7809475Z         D: int,
2025-05-07T20:32:05.7809577Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7809667Z         contiguous: bool,
2025-05-07T20:32:05.7809756Z         compiled: bool,
2025-05-07T20:32:05.7809835Z     ) -> None:
2025-05-07T20:32:05.7809936Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7810011Z     
2025-05-07T20:32:05.7810194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7810271Z     
2025-05-07T20:32:05.7810364Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7810494Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7810586Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7810668Z         x0 = x[:, :D]
2025-05-07T20:32:05.7810820Z         x1 = x[:, D:]
2025-05-07T20:32:05.7810897Z     
2025-05-07T20:32:05.7810980Z         if contiguous:
2025-05-07T20:32:05.7811069Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7811160Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7811232Z     
2025-05-07T20:32:05.7811322Z         if scale_ub is not None:
2025-05-07T20:32:05.7811428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7811561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7811643Z             )
2025-05-07T20:32:05.7811720Z         else:
2025-05-07T20:32:05.7811815Z             scale_ub_tensor = None
2025-05-07T20:32:05.7811891Z     
2025-05-07T20:32:05.7812018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7812109Z             op = silu_mul_quant
2025-05-07T20:32:05.7812197Z             if compiled:
2025-05-07T20:32:05.7812297Z                 op = torch.compile(op)
2025-05-07T20:32:05.7812401Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7812483Z     
2025-05-07T20:32:05.7812574Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7812579Z 
2025-05-07T20:32:05.7812675Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7812808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7812909Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7813010Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7813603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7813702Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7814060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7814281Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7814622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7814721Z     kernel = self.compile(
2025-05-07T20:32:05.7815100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7815274Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7815400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7815405Z 
2025-05-07T20:32:05.7815612Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a855984d0>
2025-05-07T20:32:05.7816394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7816904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c825c0>}
2025-05-07T20:32:05.7817704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7817896Z context = <triton._C.libtriton.ir.context object at 0x7f3a8559caf0>
2025-05-07T20:32:05.7817901Z 
2025-05-07T20:32:05.7818073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7818345Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7818461Z                            module_map=module_map)
2025-05-07T20:32:05.7818625Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7818723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7818804Z E       ^
2025-05-07T20:32:05.7819154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7819203Z 
2025-05-07T20:32:05.7819617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7819622Z 
2025-05-07T20:32:05.7819725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7819944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7820031Z     T=4096,
2025-05-07T20:32:05.7820110Z     D=5120,
2025-05-07T20:32:05.7820201Z     scale_ub=1200.0,
2025-05-07T20:32:05.7820294Z     contiguous=True,
2025-05-07T20:32:05.7820382Z     compiled=False,
2025-05-07T20:32:05.7820467Z )
2025-05-07T20:32:05.7820686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7820862Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.7820866Z 
2025-05-07T20:32:05.7820954Z     @given(
2025-05-07T20:32:05.7821076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7821181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7821305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7821424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7821537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7821621Z     )
2025-05-07T20:32:05.7821866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7821968Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7822131Z         self,
2025-05-07T20:32:05.7822212Z         T: int,
2025-05-07T20:32:05.7822296Z         D: int,
2025-05-07T20:32:05.7822398Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7822488Z         contiguous: bool,
2025-05-07T20:32:05.7822580Z         compiled: bool,
2025-05-07T20:32:05.7822662Z     ) -> None:
2025-05-07T20:32:05.7822759Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7822842Z     
2025-05-07T20:32:05.7823013Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7823095Z     
2025-05-07T20:32:05.7823196Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7823323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7823420Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7823504Z         x0 = x[:, :D]
2025-05-07T20:32:05.7823588Z         x1 = x[:, D:]
2025-05-07T20:32:05.7823673Z     
2025-05-07T20:32:05.7823758Z         if contiguous:
2025-05-07T20:32:05.7823851Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7823957Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7824034Z     
2025-05-07T20:32:05.7824129Z         if scale_ub is not None:
2025-05-07T20:32:05.7824241Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7824378Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7824454Z             )
2025-05-07T20:32:05.7824540Z         else:
2025-05-07T20:32:05.7824635Z             scale_ub_tensor = None
2025-05-07T20:32:05.7824760Z     
2025-05-07T20:32:05.7824898Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7824993Z             op = silu_mul_quant
2025-05-07T20:32:05.7825085Z             if compiled:
2025-05-07T20:32:05.7825186Z                 op = torch.compile(op)
2025-05-07T20:32:05.7825293Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7825376Z     
2025-05-07T20:32:05.7825471Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7825475Z 
2025-05-07T20:32:05.7825580Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7825718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7825822Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7825924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7826431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7826533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7826948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7827170Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7827511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7827613Z     kernel = self.compile(
2025-05-07T20:32:05.7828003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7828185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7828317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7828321Z 
2025-05-07T20:32:05.7828527Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a855bce10>
2025-05-07T20:32:05.7829306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7829818Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c839c0>}
2025-05-07T20:32:05.7830645Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7830838Z context = <triton._C.libtriton.ir.context object at 0x7f3a855dd470>
2025-05-07T20:32:05.7830843Z 
2025-05-07T20:32:05.7831008Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7831278Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7831387Z                            module_map=module_map)
2025-05-07T20:32:05.7831558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7831658Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7831740Z E       ^
2025-05-07T20:32:05.7832104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7832108Z 
2025-05-07T20:32:05.7832529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7832534Z 
2025-05-07T20:32:05.7832645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7832868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7832949Z     T=1,
2025-05-07T20:32:05.7833036Z     D=5120,
2025-05-07T20:32:05.7833122Z     scale_ub=None,
2025-05-07T20:32:05.7833208Z     contiguous=True,
2025-05-07T20:32:05.7833301Z     compiled=True,
2025-05-07T20:32:05.7833445Z )
2025-05-07T20:32:05.7833665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7833835Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.7833840Z 
2025-05-07T20:32:05.7833920Z     @given(
2025-05-07T20:32:05.7834051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7834151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7834268Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7834395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7834511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7834591Z     )
2025-05-07T20:32:05.7834844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7834940Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7835021Z         self,
2025-05-07T20:32:05.7835108Z         T: int,
2025-05-07T20:32:05.7835188Z         D: int,
2025-05-07T20:32:05.7835337Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7835435Z         contiguous: bool,
2025-05-07T20:32:05.7835522Z         compiled: bool,
2025-05-07T20:32:05.7835606Z     ) -> None:
2025-05-07T20:32:05.7835702Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7835777Z     
2025-05-07T20:32:05.7835951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7836029Z     
2025-05-07T20:32:05.7836123Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7836257Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7836347Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7836429Z         x0 = x[:, :D]
2025-05-07T20:32:05.7836516Z         x1 = x[:, D:]
2025-05-07T20:32:05.7836592Z     
2025-05-07T20:32:05.7836679Z         if contiguous:
2025-05-07T20:32:05.7836780Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7836872Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7836953Z     
2025-05-07T20:32:05.7837047Z         if scale_ub is not None:
2025-05-07T20:32:05.7837160Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7837301Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7837376Z             )
2025-05-07T20:32:05.7837456Z         else:
2025-05-07T20:32:05.7837562Z             scale_ub_tensor = None
2025-05-07T20:32:05.7837637Z     
2025-05-07T20:32:05.7837769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7837869Z             op = silu_mul_quant
2025-05-07T20:32:05.7838092Z             if compiled:
2025-05-07T20:32:05.7838197Z                 op = torch.compile(op)
2025-05-07T20:32:05.7838310Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7838387Z     
2025-05-07T20:32:05.7838489Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7838611Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7838687Z     
2025-05-07T20:32:05.7838832Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7838941Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7839041Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7839170Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7839312Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7839391Z     
2025-05-07T20:32:05.7839500Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7839505Z 
2025-05-07T20:32:05.7839605Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7839749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7839857Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7839991Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7840555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7840661Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7841065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7841293Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7841661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7841923Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7842325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7842580Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7842960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7843129Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7843521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7843602Z     fn()
2025-05-07T20:32:05.7844000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7844093Z     self.fn.run(
2025-05-07T20:32:05.7844431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7844529Z     kernel = self.compile(
2025-05-07T20:32:05.7844918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7845094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7845228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7845233Z 
2025-05-07T20:32:05.7845438Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a84a56cd0>
2025-05-07T20:32:05.7846217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7846724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c8c7c0>}
2025-05-07T20:32:05.7847603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7847801Z context = <triton._C.libtriton.ir.context object at 0x7f3a84a372f0>
2025-05-07T20:32:05.7847806Z 
2025-05-07T20:32:05.7847973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7848242Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7848358Z                            module_map=module_map)
2025-05-07T20:32:05.7848525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7848637Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7848718Z E       ^
2025-05-07T20:32:05.7849079Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7849084Z 
2025-05-07T20:32:05.7849503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7849508Z 
2025-05-07T20:32:05.7849616Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7849845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7849927Z     T=2048,
2025-05-07T20:32:05.7850009Z     D=5120,
2025-05-07T20:32:05.7850103Z     scale_ub=None,
2025-05-07T20:32:05.7850237Z     contiguous=True,
2025-05-07T20:32:05.7850330Z     compiled=True,
2025-05-07T20:32:05.7850408Z )
2025-05-07T20:32:05.7850628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7850808Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.7850813Z 
2025-05-07T20:32:05.7850892Z     @given(
2025-05-07T20:32:05.7851015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7851123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7851247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7851367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7851492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7851570Z     )
2025-05-07T20:32:05.7851826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7851928Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7852010Z         self,
2025-05-07T20:32:05.7852143Z         T: int,
2025-05-07T20:32:05.7852223Z         D: int,
2025-05-07T20:32:05.7852324Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7852423Z         contiguous: bool,
2025-05-07T20:32:05.7852511Z         compiled: bool,
2025-05-07T20:32:05.7852591Z     ) -> None:
2025-05-07T20:32:05.7852696Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7852771Z     
2025-05-07T20:32:05.7852941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7853027Z     
2025-05-07T20:32:05.7853127Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7853260Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7853352Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7853435Z         x0 = x[:, :D]
2025-05-07T20:32:05.7853522Z         x1 = x[:, D:]
2025-05-07T20:32:05.7853598Z     
2025-05-07T20:32:05.7853684Z         if contiguous:
2025-05-07T20:32:05.7853786Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7853880Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7853960Z     
2025-05-07T20:32:05.7854057Z         if scale_ub is not None:
2025-05-07T20:32:05.7854164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7854300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7854385Z             )
2025-05-07T20:32:05.7854465Z         else:
2025-05-07T20:32:05.7854567Z             scale_ub_tensor = None
2025-05-07T20:32:05.7854644Z     
2025-05-07T20:32:05.7854858Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7854958Z             op = silu_mul_quant
2025-05-07T20:32:05.7855046Z             if compiled:
2025-05-07T20:32:05.7855150Z                 op = torch.compile(op)
2025-05-07T20:32:05.7855265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7855341Z     
2025-05-07T20:32:05.7855433Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7855565Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7855647Z     
2025-05-07T20:32:05.7855784Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7855892Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7855994Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7856125Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7856266Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7856342Z     
2025-05-07T20:32:05.7856451Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7856461Z 
2025-05-07T20:32:05.7856561Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7856699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7856833Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7856993Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7857560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7857710Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7858070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7858301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7858666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7858930Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7859335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7859590Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7859972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7860185Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7860529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7860617Z     fn()
2025-05-07T20:32:05.7861016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7861108Z     self.fn.run(
2025-05-07T20:32:05.7861450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7861548Z     kernel = self.compile(
2025-05-07T20:32:05.7861932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7862108Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7862240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7862249Z 
2025-05-07T20:32:05.7862464Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a85040fd0>
2025-05-07T20:32:05.7863240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7863854Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a85c8d4e0>}
2025-05-07T20:32:05.7864604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7864806Z context = <triton._C.libtriton.ir.context object at 0x7f3a85044d30>
2025-05-07T20:32:05.7864811Z 
2025-05-07T20:32:05.7864977Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7865246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7865365Z                            module_map=module_map)
2025-05-07T20:32:05.7865531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7865636Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7865721Z E       ^
2025-05-07T20:32:05.7866080Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7866085Z 
2025-05-07T20:32:05.7866507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7866512Z 
2025-05-07T20:32:05.7866618Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7866841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7866976Z     T=128,
2025-05-07T20:32:05.7867061Z     D=5120,
2025-05-07T20:32:05.7867147Z     scale_ub=None,
2025-05-07T20:32:05.7867240Z     contiguous=True,
2025-05-07T20:32:05.7867326Z     compiled=True,
2025-05-07T20:32:05.7867409Z )
2025-05-07T20:32:05.7867627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7867798Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.7867802Z 
2025-05-07T20:32:05.7867886Z     @given(
2025-05-07T20:32:05.7868011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7868112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7868237Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7868357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7868472Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7868559Z     )
2025-05-07T20:32:05.7868803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7868964Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7869045Z         self,
2025-05-07T20:32:05.7869126Z         T: int,
2025-05-07T20:32:05.7869210Z         D: int,
2025-05-07T20:32:05.7869310Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7869402Z         contiguous: bool,
2025-05-07T20:32:05.7869496Z         compiled: bool,
2025-05-07T20:32:05.7869578Z     ) -> None:
2025-05-07T20:32:05.7869675Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7869756Z     
2025-05-07T20:32:05.7869928Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7870005Z     
2025-05-07T20:32:05.7870103Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7870229Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7870329Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7870413Z         x0 = x[:, :D]
2025-05-07T20:32:05.7870496Z         x1 = x[:, D:]
2025-05-07T20:32:05.7870577Z     
2025-05-07T20:32:05.7870665Z         if contiguous:
2025-05-07T20:32:05.7870763Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7870859Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7870936Z     
2025-05-07T20:32:05.7871028Z         if scale_ub is not None:
2025-05-07T20:32:05.7871140Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7871277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7871356Z             )
2025-05-07T20:32:05.7871444Z         else:
2025-05-07T20:32:05.7871623Z             scale_ub_tensor = None
2025-05-07T20:32:05.7871699Z     
2025-05-07T20:32:05.7871836Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7871929Z             op = silu_mul_quant
2025-05-07T20:32:05.7872026Z             if compiled:
2025-05-07T20:32:05.7872128Z                 op = torch.compile(op)
2025-05-07T20:32:05.7872235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7872317Z     
2025-05-07T20:32:05.7872410Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7872536Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7872614Z     
2025-05-07T20:32:05.7872751Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7872855Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7872960Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7873084Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7873233Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7873318Z     
2025-05-07T20:32:05.7873421Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7873426Z 
2025-05-07T20:32:05.7873534Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7873664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7873770Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7873913Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7874517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7874629Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7874990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7875213Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7875591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7875847Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7876244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7876511Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7876929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7877110Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7877455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7877540Z     fn()
2025-05-07T20:32:05.7877951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7878039Z     self.fn.run(
2025-05-07T20:32:05.7878383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7878479Z     kernel = self.compile(
2025-05-07T20:32:05.7878858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7879041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7879177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7879182Z 
2025-05-07T20:32:05.7879387Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a849f5cd0>
2025-05-07T20:32:05.7880168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7880749Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84e442c0>}
2025-05-07T20:32:05.7881503Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7881696Z context = <triton._C.libtriton.ir.context object at 0x7f3a848221f0>
2025-05-07T20:32:05.7881707Z 
2025-05-07T20:32:05.7881878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7882144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7882253Z                            module_map=module_map)
2025-05-07T20:32:05.7882423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7882527Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7882614Z E       ^
2025-05-07T20:32:05.7882976Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7882980Z 
2025-05-07T20:32:05.7883395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7883399Z 
2025-05-07T20:32:05.7883512Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7883779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7883856Z     T=4096,
2025-05-07T20:32:05.7883944Z     D=5120,
2025-05-07T20:32:05.7884029Z     scale_ub=None,
2025-05-07T20:32:05.7884117Z     contiguous=True,
2025-05-07T20:32:05.7884210Z     compiled=True,
2025-05-07T20:32:05.7884287Z )
2025-05-07T20:32:05.7884510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7884682Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.7884692Z 
2025-05-07T20:32:05.7884769Z     @given(
2025-05-07T20:32:05.7884899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7885000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7885116Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7885241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7885356Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7885479Z     )
2025-05-07T20:32:05.7885731Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7885827Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7885913Z         self,
2025-05-07T20:32:05.7885994Z         T: int,
2025-05-07T20:32:05.7886072Z         D: int,
2025-05-07T20:32:05.7886178Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7886271Z         contiguous: bool,
2025-05-07T20:32:05.7886359Z         compiled: bool,
2025-05-07T20:32:05.7886447Z     ) -> None:
2025-05-07T20:32:05.7886548Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7886622Z     
2025-05-07T20:32:05.7886802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7886878Z     
2025-05-07T20:32:05.7886970Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7887101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7887193Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7887283Z         x0 = x[:, :D]
2025-05-07T20:32:05.7887372Z         x1 = x[:, D:]
2025-05-07T20:32:05.7887448Z     
2025-05-07T20:32:05.7887588Z         if contiguous:
2025-05-07T20:32:05.7887684Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7887777Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7887859Z     
2025-05-07T20:32:05.7887950Z         if scale_ub is not None:
2025-05-07T20:32:05.7888057Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7888202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7888360Z             )
2025-05-07T20:32:05.7888439Z         else:
2025-05-07T20:32:05.7888543Z             scale_ub_tensor = None
2025-05-07T20:32:05.7888620Z     
2025-05-07T20:32:05.7888752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7888850Z             op = silu_mul_quant
2025-05-07T20:32:05.7888936Z             if compiled:
2025-05-07T20:32:05.7889048Z                 op = torch.compile(op)
2025-05-07T20:32:05.7889155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7889235Z     
2025-05-07T20:32:05.7889338Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7889461Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7889538Z     
2025-05-07T20:32:05.7889681Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7889784Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7889886Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7890023Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7890164Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7890250Z     
2025-05-07T20:32:05.7890350Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7890355Z 
2025-05-07T20:32:05.7890457Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7890596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7890703Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7890885Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7891451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7891556Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7891920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7892150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7892515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7892782Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7893180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7893486Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7893860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7894029Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7894378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7894459Z     fn()
2025-05-07T20:32:05.7894864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7894957Z     self.fn.run(
2025-05-07T20:32:05.7895296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7895399Z     kernel = self.compile(
2025-05-07T20:32:05.7895779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7895961Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7896098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7896103Z 
2025-05-07T20:32:05.7896309Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a847eecd0>
2025-05-07T20:32:05.7897191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7897699Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84e46e80>}
2025-05-07T20:32:05.7898445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7898653Z context = <triton._C.libtriton.ir.context object at 0x7f3a84d069f0>
2025-05-07T20:32:05.7898658Z 
2025-05-07T20:32:05.7898824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7899093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7899210Z                            module_map=module_map)
2025-05-07T20:32:05.7899378Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7899489Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7899570Z E       ^
2025-05-07T20:32:05.7899927Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7899939Z 
2025-05-07T20:32:05.7900355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7900401Z 
2025-05-07T20:32:05.7900510Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7900745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7900827Z     T=16384,
2025-05-07T20:32:05.7900909Z     D=5120,
2025-05-07T20:32:05.7901001Z     scale_ub=None,
2025-05-07T20:32:05.7901088Z     contiguous=True,
2025-05-07T20:32:05.7901175Z     compiled=True,
2025-05-07T20:32:05.7901258Z )
2025-05-07T20:32:05.7901482Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7901665Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.7901669Z 
2025-05-07T20:32:05.7901749Z     @given(
2025-05-07T20:32:05.7901868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7901975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7902090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7902208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7902379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7902458Z     )
2025-05-07T20:32:05.7902703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7902805Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7902885Z         self,
2025-05-07T20:32:05.7902971Z         T: int,
2025-05-07T20:32:05.7903051Z         D: int,
2025-05-07T20:32:05.7903152Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7903253Z         contiguous: bool,
2025-05-07T20:32:05.7903341Z         compiled: bool,
2025-05-07T20:32:05.7903421Z     ) -> None:
2025-05-07T20:32:05.7903522Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7903600Z     
2025-05-07T20:32:05.7903770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7903855Z     
2025-05-07T20:32:05.7903949Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7904075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7904178Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7904264Z         x0 = x[:, :D]
2025-05-07T20:32:05.7904354Z         x1 = x[:, D:]
2025-05-07T20:32:05.7904430Z     
2025-05-07T20:32:05.7904516Z         if contiguous:
2025-05-07T20:32:05.7904615Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7904708Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7904786Z     
2025-05-07T20:32:05.7904885Z         if scale_ub is not None:
2025-05-07T20:32:05.7905071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7905211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7905296Z             )
2025-05-07T20:32:05.7905376Z         else:
2025-05-07T20:32:05.7905472Z             scale_ub_tensor = None
2025-05-07T20:32:05.7905554Z     
2025-05-07T20:32:05.7905908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7906043Z             op = silu_mul_quant
2025-05-07T20:32:05.7906146Z             if compiled:
2025-05-07T20:32:05.7906251Z                 op = torch.compile(op)
2025-05-07T20:32:05.7906363Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7906438Z     
2025-05-07T20:32:05.7906530Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7906659Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7906730Z     
2025-05-07T20:32:05.7906865Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7906973Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7907078Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7907199Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7907344Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7907419Z     
2025-05-07T20:32:05.7907525Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7907530Z 
2025-05-07T20:32:05.7907627Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7907756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7907955Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7908091Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7908649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7908760Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7909124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7909354Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7909721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7922183Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7922627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7923016Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7923398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7923565Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7923914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7923992Z     fn()
2025-05-07T20:32:05.7924390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7924480Z     self.fn.run(
2025-05-07T20:32:05.7924816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7924911Z     kernel = self.compile(
2025-05-07T20:32:05.7925293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7925466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7925604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7925610Z 
2025-05-07T20:32:05.7925816Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a84184850>
2025-05-07T20:32:05.7926709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7927218Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a854ea5c0>}
2025-05-07T20:32:05.7928056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7928257Z context = <triton._C.libtriton.ir.context object at 0x7f3a841c0530>
2025-05-07T20:32:05.7928262Z 
2025-05-07T20:32:05.7928426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7928690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7928805Z                            module_map=module_map)
2025-05-07T20:32:05.7928969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7929074Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7929151Z E       ^
2025-05-07T20:32:05.7929505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7929513Z 
2025-05-07T20:32:05.7929924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7930001Z 
2025-05-07T20:32:05.7930105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7930327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7930404Z     T=1,
2025-05-07T20:32:05.7930481Z     D=5120,
2025-05-07T20:32:05.7930567Z     scale_ub=1200.0,
2025-05-07T20:32:05.7930651Z     contiguous=True,
2025-05-07T20:32:05.7930735Z     compiled=True,
2025-05-07T20:32:05.7930817Z )
2025-05-07T20:32:05.7931036Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7931205Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.7931210Z 
2025-05-07T20:32:05.7931288Z     @given(
2025-05-07T20:32:05.7931407Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7931516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7931677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7931793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7931908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7931982Z     )
2025-05-07T20:32:05.7932227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7932320Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7932396Z         self,
2025-05-07T20:32:05.7932474Z         T: int,
2025-05-07T20:32:05.7932554Z         D: int,
2025-05-07T20:32:05.7932652Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7932744Z         contiguous: bool,
2025-05-07T20:32:05.7932828Z         compiled: bool,
2025-05-07T20:32:05.7932913Z     ) -> None:
2025-05-07T20:32:05.7933016Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7933089Z     
2025-05-07T20:32:05.7933256Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7933333Z     
2025-05-07T20:32:05.7933430Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7933555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7933650Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7933731Z         x0 = x[:, :D]
2025-05-07T20:32:05.7933813Z         x1 = x[:, D:]
2025-05-07T20:32:05.7933887Z     
2025-05-07T20:32:05.7933972Z         if contiguous:
2025-05-07T20:32:05.7934070Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7934158Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7934229Z     
2025-05-07T20:32:05.7934402Z         if scale_ub is not None:
2025-05-07T20:32:05.7934508Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7934642Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7934722Z             )
2025-05-07T20:32:05.7934798Z         else:
2025-05-07T20:32:05.7934894Z             scale_ub_tensor = None
2025-05-07T20:32:05.7934969Z     
2025-05-07T20:32:05.7935097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7935194Z             op = silu_mul_quant
2025-05-07T20:32:05.7935277Z             if compiled:
2025-05-07T20:32:05.7935379Z                 op = torch.compile(op)
2025-05-07T20:32:05.7935488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7935559Z     
2025-05-07T20:32:05.7935650Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7935655Z 
2025-05-07T20:32:05.7935753Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7935889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7935989Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7936092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7936462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.7936558Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.7937104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7937261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7937619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7937840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7938179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7938277Z     kernel = self.compile(
2025-05-07T20:32:05.7938663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7938839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7938966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7938971Z 
2025-05-07T20:32:05.7939174Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a96a1990>
2025-05-07T20:32:05.7939999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7940502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8455df80>}
2025-05-07T20:32:05.7941254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7941443Z context = <triton._C.libtriton.ir.context object at 0x7f39a9689a30>
2025-05-07T20:32:05.7941448Z 
2025-05-07T20:32:05.7941618Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7941878Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7941990Z                            module_map=module_map)
2025-05-07T20:32:05.7942154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7942253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7942332Z E       ^
2025-05-07T20:32:05.7942686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7942691Z 
2025-05-07T20:32:05.7943177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7943182Z 
2025-05-07T20:32:05.7943292Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7943513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7943594Z     T=1,
2025-05-07T20:32:05.7943675Z     D=5120,
2025-05-07T20:32:05.7943758Z     scale_ub=None,
2025-05-07T20:32:05.7943844Z     contiguous=False,
2025-05-07T20:32:05.7943933Z     compiled=True,
2025-05-07T20:32:05.7944007Z )
2025-05-07T20:32:05.7944222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7944392Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.7944396Z 
2025-05-07T20:32:05.7944475Z     @given(
2025-05-07T20:32:05.7944596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7944696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7944813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7944932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7945049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7945122Z     )
2025-05-07T20:32:05.7945368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7945462Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7945543Z         self,
2025-05-07T20:32:05.7945666Z         T: int,
2025-05-07T20:32:05.7945743Z         D: int,
2025-05-07T20:32:05.7945844Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7945933Z         contiguous: bool,
2025-05-07T20:32:05.7946018Z         compiled: bool,
2025-05-07T20:32:05.7946098Z     ) -> None:
2025-05-07T20:32:05.7946192Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7946265Z     
2025-05-07T20:32:05.7946438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7946513Z     
2025-05-07T20:32:05.7946608Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7946736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7946824Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7946903Z         x0 = x[:, :D]
2025-05-07T20:32:05.7946987Z         x1 = x[:, D:]
2025-05-07T20:32:05.7947059Z     
2025-05-07T20:32:05.7947145Z         if contiguous:
2025-05-07T20:32:05.7947235Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7947323Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7947448Z     
2025-05-07T20:32:05.7947538Z         if scale_ub is not None:
2025-05-07T20:32:05.7947647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7947784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7947859Z             )
2025-05-07T20:32:05.7947935Z         else:
2025-05-07T20:32:05.7948034Z             scale_ub_tensor = None
2025-05-07T20:32:05.7948106Z     
2025-05-07T20:32:05.7948235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7948337Z             op = silu_mul_quant
2025-05-07T20:32:05.7948422Z             if compiled:
2025-05-07T20:32:05.7948528Z                 op = torch.compile(op)
2025-05-07T20:32:05.7948633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7948708Z     
2025-05-07T20:32:05.7948803Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.7948923Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.7948998Z     
2025-05-07T20:32:05.7949141Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7949244Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.7949346Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.7949480Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.7949617Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7949689Z     
2025-05-07T20:32:05.7949788Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.7949793Z 
2025-05-07T20:32:05.7949969Z moe/activation_test.py:126: 
2025-05-07T20:32:05.7950101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7950203Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.7950336Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.7950895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.7950998Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.7951358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7951578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7951942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.7952201Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7952594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.7952844Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.7953214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.7953420Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.7953758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.7953834Z     fn()
2025-05-07T20:32:05.7954227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.7954312Z     self.fn.run(
2025-05-07T20:32:05.7954648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7954739Z     kernel = self.compile(
2025-05-07T20:32:05.7955125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7955301Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7955433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7955487Z 
2025-05-07T20:32:05.7955689Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a96e4950>
2025-05-07T20:32:05.7956460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7957015Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8455d4e0>}
2025-05-07T20:32:05.7957771Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7957960Z context = <triton._C.libtriton.ir.context object at 0x7f39a9624c30>
2025-05-07T20:32:05.7957965Z 
2025-05-07T20:32:05.7958126Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7958398Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7958502Z                            module_map=module_map)
2025-05-07T20:32:05.7958664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7958762Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.7958837Z E       ^
2025-05-07T20:32:05.7959292Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7959297Z 
2025-05-07T20:32:05.7959708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7959713Z 
2025-05-07T20:32:05.7959816Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7960032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7960108Z     T=1,
2025-05-07T20:32:05.7960189Z     D=5120,
2025-05-07T20:32:05.7960269Z     scale_ub=None,
2025-05-07T20:32:05.7960353Z     contiguous=True,
2025-05-07T20:32:05.7960438Z     compiled=False,
2025-05-07T20:32:05.7960507Z )
2025-05-07T20:32:05.7960721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7960888Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.7960893Z 
2025-05-07T20:32:05.7960968Z     @given(
2025-05-07T20:32:05.7961095Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7961193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7961306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7961426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7961534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7961609Z     )
2025-05-07T20:32:05.7961853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7961989Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7962064Z         self,
2025-05-07T20:32:05.7962145Z         T: int,
2025-05-07T20:32:05.7962221Z         D: int,
2025-05-07T20:32:05.7962319Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7962405Z         contiguous: bool,
2025-05-07T20:32:05.7962488Z         compiled: bool,
2025-05-07T20:32:05.7962568Z     ) -> None:
2025-05-07T20:32:05.7962659Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7962729Z     
2025-05-07T20:32:05.7962906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7962982Z     
2025-05-07T20:32:05.7963073Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7963204Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7963291Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7963372Z         x0 = x[:, :D]
2025-05-07T20:32:05.7963457Z         x1 = x[:, D:]
2025-05-07T20:32:05.7963529Z     
2025-05-07T20:32:05.7963612Z         if contiguous:
2025-05-07T20:32:05.7963747Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7963832Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7963909Z     
2025-05-07T20:32:05.7963997Z         if scale_ub is not None:
2025-05-07T20:32:05.7964103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7964241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7964311Z             )
2025-05-07T20:32:05.7964386Z         else:
2025-05-07T20:32:05.7964484Z             scale_ub_tensor = None
2025-05-07T20:32:05.7964561Z     
2025-05-07T20:32:05.7964688Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7964781Z             op = silu_mul_quant
2025-05-07T20:32:05.7964867Z             if compiled:
2025-05-07T20:32:05.7964963Z                 op = torch.compile(op)
2025-05-07T20:32:05.7965070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7965140Z     
2025-05-07T20:32:05.7965232Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7965242Z 
2025-05-07T20:32:05.7965335Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7965461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7965566Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7965662Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7966158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7966258Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7966691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7966936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7967302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7967395Z     kernel = self.compile(
2025-05-07T20:32:05.7967829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7968004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7968130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7968138Z 
2025-05-07T20:32:05.7968339Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a961ab90>
2025-05-07T20:32:05.7969121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7969627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8455cea0>}
2025-05-07T20:32:05.7970367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7970606Z context = <triton._C.libtriton.ir.context object at 0x7f39a96cf170>
2025-05-07T20:32:05.7970610Z 
2025-05-07T20:32:05.7970771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7971031Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7971144Z                            module_map=module_map)
2025-05-07T20:32:05.7971303Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7971406Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7971481Z E       ^
2025-05-07T20:32:05.7971833Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7971838Z 
2025-05-07T20:32:05.7972249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7972298Z 
2025-05-07T20:32:05.7972400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7972617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7972702Z     T=128,
2025-05-07T20:32:05.7972778Z     D=5120,
2025-05-07T20:32:05.7972863Z     scale_ub=None,
2025-05-07T20:32:05.7972946Z     contiguous=False,
2025-05-07T20:32:05.7973024Z     compiled=True,
2025-05-07T20:32:05.7973100Z )
2025-05-07T20:32:05.7973318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7973488Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.7973493Z 
2025-05-07T20:32:05.7973574Z     @given(
2025-05-07T20:32:05.7973692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7973790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7973912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7974031Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7974144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7974213Z     )
2025-05-07T20:32:05.7974452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7974546Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7974620Z         self,
2025-05-07T20:32:05.7974695Z         T: int,
2025-05-07T20:32:05.7974778Z         D: int,
2025-05-07T20:32:05.7974956Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7975045Z         contiguous: bool,
2025-05-07T20:32:05.7975133Z         compiled: bool,
2025-05-07T20:32:05.7975211Z     ) -> None:
2025-05-07T20:32:05.7975302Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7975377Z     
2025-05-07T20:32:05.7975543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7975622Z     
2025-05-07T20:32:05.7975712Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7975838Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7975928Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7976006Z         x0 = x[:, :D]
2025-05-07T20:32:05.7976084Z         x1 = x[:, D:]
2025-05-07T20:32:05.7976165Z     
2025-05-07T20:32:05.7976246Z         if contiguous:
2025-05-07T20:32:05.7976335Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7976428Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7976498Z     
2025-05-07T20:32:05.7976593Z         if scale_ub is not None:
2025-05-07T20:32:05.7976703Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7976860Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7976944Z             )
2025-05-07T20:32:05.7977036Z         else:
2025-05-07T20:32:05.7977128Z             scale_ub_tensor = None
2025-05-07T20:32:05.7977198Z     
2025-05-07T20:32:05.7977323Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7977529Z             op = silu_mul_quant
2025-05-07T20:32:05.7977619Z             if compiled:
2025-05-07T20:32:05.7977716Z                 op = torch.compile(op)
2025-05-07T20:32:05.7977821Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7977894Z     
2025-05-07T20:32:05.7977984Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7977989Z 
2025-05-07T20:32:05.7978085Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7978215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7978322Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7978424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7978786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.7978876Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.7979368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7979512Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7979863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7980085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7980420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7980516Z     kernel = self.compile(
2025-05-07T20:32:05.7980897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7981069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7981202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7981207Z 
2025-05-07T20:32:05.7981409Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a956c190>
2025-05-07T20:32:05.7982188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7982694Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9c09c60>}
2025-05-07T20:32:05.7983510Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7983706Z context = <triton._C.libtriton.ir.context object at 0x7f39a9580730>
2025-05-07T20:32:05.7983711Z 
2025-05-07T20:32:05.7983874Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7984139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7984250Z                            module_map=module_map)
2025-05-07T20:32:05.7984408Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7984508Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7984586Z E       ^
2025-05-07T20:32:05.7984939Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7984944Z 
2025-05-07T20:32:05.7985360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7985365Z 
2025-05-07T20:32:05.7985466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7985689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7985768Z     T=128,
2025-05-07T20:32:05.7985847Z     D=7168,
2025-05-07T20:32:05.7985930Z     scale_ub=1200.0,
2025-05-07T20:32:05.7986011Z     contiguous=False,
2025-05-07T20:32:05.7986138Z     compiled=False,
2025-05-07T20:32:05.7986209Z )
2025-05-07T20:32:05.7986420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7986593Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.7986598Z 
2025-05-07T20:32:05.7986677Z     @given(
2025-05-07T20:32:05.7986806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7986904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7987023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7987143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7987254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.7987329Z     )
2025-05-07T20:32:05.7987577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.7987668Z     def test_silu_mul_quant(
2025-05-07T20:32:05.7987745Z         self,
2025-05-07T20:32:05.7987831Z         T: int,
2025-05-07T20:32:05.7987947Z         D: int,
2025-05-07T20:32:05.7988052Z         scale_ub: Optional[float],
2025-05-07T20:32:05.7988142Z         contiguous: bool,
2025-05-07T20:32:05.7988227Z         compiled: bool,
2025-05-07T20:32:05.7988313Z     ) -> None:
2025-05-07T20:32:05.7988406Z         torch.manual_seed(2025)
2025-05-07T20:32:05.7988477Z     
2025-05-07T20:32:05.7988649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.7988725Z     
2025-05-07T20:32:05.7988819Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.7988947Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.7989035Z         x = x_sign * x_clamp
2025-05-07T20:32:05.7989116Z         x0 = x[:, :D]
2025-05-07T20:32:05.7989202Z         x1 = x[:, D:]
2025-05-07T20:32:05.7989275Z     
2025-05-07T20:32:05.7989359Z         if contiguous:
2025-05-07T20:32:05.7989455Z             x0 = x0.contiguous()
2025-05-07T20:32:05.7989544Z             x1 = x1.contiguous()
2025-05-07T20:32:05.7989629Z     
2025-05-07T20:32:05.7989719Z         if scale_ub is not None:
2025-05-07T20:32:05.7989825Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.7989963Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.7990040Z             )
2025-05-07T20:32:05.7990116Z         else:
2025-05-07T20:32:05.7990216Z             scale_ub_tensor = None
2025-05-07T20:32:05.7990289Z     
2025-05-07T20:32:05.7990417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.7990617Z             op = silu_mul_quant
2025-05-07T20:32:05.7990703Z             if compiled:
2025-05-07T20:32:05.7990803Z                 op = torch.compile(op)
2025-05-07T20:32:05.7990914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7990987Z     
2025-05-07T20:32:05.7991082Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.7991086Z 
2025-05-07T20:32:05.7991182Z moe/activation_test.py:117: 
2025-05-07T20:32:05.7991310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7991420Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.7991518Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.7992012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.7992113Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.7992473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.7992698Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.7993035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.7993130Z     kernel = self.compile(
2025-05-07T20:32:05.7993513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.7993732Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7993858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.7993869Z 
2025-05-07T20:32:05.7994071Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a954c490>
2025-05-07T20:32:05.7994847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.7995357Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9c09800>}
2025-05-07T20:32:05.7996099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.7996339Z context = <triton._C.libtriton.ir.context object at 0x7f39a9570c30>
2025-05-07T20:32:05.7996344Z 
2025-05-07T20:32:05.7996505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.7996763Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7996874Z                            module_map=module_map)
2025-05-07T20:32:05.7997035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7997140Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7997215Z E       ^
2025-05-07T20:32:05.7997568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.7997572Z 
2025-05-07T20:32:05.7997989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.7997994Z 
2025-05-07T20:32:05.7998101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.7998320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.7998405Z     T=128,
2025-05-07T20:32:05.7998484Z     D=5120,
2025-05-07T20:32:05.7998570Z     scale_ub=None,
2025-05-07T20:32:05.7998654Z     contiguous=False,
2025-05-07T20:32:05.7998737Z     compiled=False,
2025-05-07T20:32:05.7998818Z )
2025-05-07T20:32:05.7999035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.7999282Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.7999287Z 
2025-05-07T20:32:05.7999368Z     @given(
2025-05-07T20:32:05.7999487Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.7999583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.7999703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.7999820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.7999945Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8000020Z     )
2025-05-07T20:32:05.8000261Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8000358Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8000436Z         self,
2025-05-07T20:32:05.8000513Z         T: int,
2025-05-07T20:32:05.8000596Z         D: int,
2025-05-07T20:32:05.8000693Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8000781Z         contiguous: bool,
2025-05-07T20:32:05.8000882Z         compiled: bool,
2025-05-07T20:32:05.8000965Z     ) -> None:
2025-05-07T20:32:05.8001061Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8001138Z     
2025-05-07T20:32:05.8001302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8001383Z     
2025-05-07T20:32:05.8001475Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8001601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8001738Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8001820Z         x0 = x[:, :D]
2025-05-07T20:32:05.8001898Z         x1 = x[:, D:]
2025-05-07T20:32:05.8001981Z     
2025-05-07T20:32:05.8002063Z         if contiguous:
2025-05-07T20:32:05.8002152Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8002252Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8002325Z     
2025-05-07T20:32:05.8002416Z         if scale_ub is not None:
2025-05-07T20:32:05.8002529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8002666Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8002745Z             )
2025-05-07T20:32:05.8002826Z         else:
2025-05-07T20:32:05.8002921Z             scale_ub_tensor = None
2025-05-07T20:32:05.8003000Z     
2025-05-07T20:32:05.8003128Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8003217Z             op = silu_mul_quant
2025-05-07T20:32:05.8003306Z             if compiled:
2025-05-07T20:32:05.8003406Z                 op = torch.compile(op)
2025-05-07T20:32:05.8003558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8003636Z     
2025-05-07T20:32:05.8003726Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8003730Z 
2025-05-07T20:32:05.8003825Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8003959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8004057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8004162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8004658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8004755Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8005115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8005332Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8005913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8006048Z     kernel = self.compile(
2025-05-07T20:32:05.8006479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8006659Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8006784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8006930Z 
2025-05-07T20:32:05.8007153Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9a4b510>
2025-05-07T20:32:05.8008066Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8008571Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9c0bce0>}
2025-05-07T20:32:05.8009327Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8009515Z context = <triton._C.libtriton.ir.context object at 0x7f39a9affb70>
2025-05-07T20:32:05.8009519Z 
2025-05-07T20:32:05.8009694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8009958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8010066Z                            module_map=module_map)
2025-05-07T20:32:05.8010232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8010333Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8010413Z E       ^
2025-05-07T20:32:05.8010768Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8010840Z 
2025-05-07T20:32:05.8011251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8011256Z 
2025-05-07T20:32:05.8011364Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8011583Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8011662Z     T=128,
2025-05-07T20:32:05.8011752Z     D=5120,
2025-05-07T20:32:05.8011836Z     scale_ub=1200.0,
2025-05-07T20:32:05.8011921Z     contiguous=True,
2025-05-07T20:32:05.8012009Z     compiled=False,
2025-05-07T20:32:05.8012082Z )
2025-05-07T20:32:05.8012298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8012475Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8012480Z 
2025-05-07T20:32:05.8012560Z     @given(
2025-05-07T20:32:05.8012783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8012883Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8012996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8013119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8013231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8013307Z     )
2025-05-07T20:32:05.8013559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8013654Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8013734Z         self,
2025-05-07T20:32:05.8013811Z         T: int,
2025-05-07T20:32:05.8013888Z         D: int,
2025-05-07T20:32:05.8013993Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8014083Z         contiguous: bool,
2025-05-07T20:32:05.8014172Z         compiled: bool,
2025-05-07T20:32:05.8014256Z     ) -> None:
2025-05-07T20:32:05.8014351Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8014431Z     
2025-05-07T20:32:05.8014602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8014677Z     
2025-05-07T20:32:05.8014767Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8014896Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8014985Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8015075Z         x0 = x[:, :D]
2025-05-07T20:32:05.8015156Z         x1 = x[:, D:]
2025-05-07T20:32:05.8015231Z     
2025-05-07T20:32:05.8015402Z         if contiguous:
2025-05-07T20:32:05.8015496Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8015586Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8015667Z     
2025-05-07T20:32:05.8015756Z         if scale_ub is not None:
2025-05-07T20:32:05.8015861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8016002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8016076Z             )
2025-05-07T20:32:05.8016153Z         else:
2025-05-07T20:32:05.8016257Z             scale_ub_tensor = None
2025-05-07T20:32:05.8016331Z     
2025-05-07T20:32:05.8016459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8016557Z             op = silu_mul_quant
2025-05-07T20:32:05.8016643Z             if compiled:
2025-05-07T20:32:05.8016748Z                 op = torch.compile(op)
2025-05-07T20:32:05.8016853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8016926Z     
2025-05-07T20:32:05.8017026Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8017036Z 
2025-05-07T20:32:05.8017133Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8017263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8017368Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8017467Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8017964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8018116Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8018472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8018699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8019040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8019135Z     kernel = self.compile(
2025-05-07T20:32:05.8019523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8019696Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8019830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8019835Z 
2025-05-07T20:32:05.8020039Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9a7a3d0>
2025-05-07T20:32:05.8020879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8021384Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9fc3920>}
2025-05-07T20:32:05.8022133Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8022326Z context = <triton._C.libtriton.ir.context object at 0x7f39a9a6a9b0>
2025-05-07T20:32:05.8022331Z 
2025-05-07T20:32:05.8022495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8022756Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8022873Z                            module_map=module_map)
2025-05-07T20:32:05.8023036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8023138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8023216Z E       ^
2025-05-07T20:32:05.8023567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8023571Z 
2025-05-07T20:32:05.8024063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8024069Z 
2025-05-07T20:32:05.8024173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8024399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8024478Z     T=1,
2025-05-07T20:32:05.8024557Z     D=7168,
2025-05-07T20:32:05.8024646Z     scale_ub=1200.0,
2025-05-07T20:32:05.8024730Z     contiguous=True,
2025-05-07T20:32:05.8024819Z     compiled=True,
2025-05-07T20:32:05.8024897Z )
2025-05-07T20:32:05.8025112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8025276Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.8025280Z 
2025-05-07T20:32:05.8025365Z     @given(
2025-05-07T20:32:05.8025485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8025592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8025712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8025829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8025948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8026025Z     )
2025-05-07T20:32:05.8026268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8026370Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8026443Z         self,
2025-05-07T20:32:05.8026563Z         T: int,
2025-05-07T20:32:05.8026648Z         D: int,
2025-05-07T20:32:05.8026744Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8026833Z         contiguous: bool,
2025-05-07T20:32:05.8026925Z         compiled: bool,
2025-05-07T20:32:05.8027003Z     ) -> None:
2025-05-07T20:32:05.8027103Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8027178Z     
2025-05-07T20:32:05.8027346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8027429Z     
2025-05-07T20:32:05.8027526Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8027652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8027748Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8027827Z         x0 = x[:, :D]
2025-05-07T20:32:05.8027908Z         x1 = x[:, D:]
2025-05-07T20:32:05.8027991Z     
2025-05-07T20:32:05.8028074Z         if contiguous:
2025-05-07T20:32:05.8028163Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8028260Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8028382Z     
2025-05-07T20:32:05.8028472Z         if scale_ub is not None:
2025-05-07T20:32:05.8028585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8028720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8028806Z             )
2025-05-07T20:32:05.8028879Z         else:
2025-05-07T20:32:05.8028973Z             scale_ub_tensor = None
2025-05-07T20:32:05.8029050Z     
2025-05-07T20:32:05.8029179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8029276Z             op = silu_mul_quant
2025-05-07T20:32:05.8029369Z             if compiled:
2025-05-07T20:32:05.8029470Z                 op = torch.compile(op)
2025-05-07T20:32:05.8029575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8029652Z     
2025-05-07T20:32:05.8029745Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8029749Z 
2025-05-07T20:32:05.8029851Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8029979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8030084Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8030189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8030554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8030647Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8031226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8031329Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8031692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8031913Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8032250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8032357Z     kernel = self.compile(
2025-05-07T20:32:05.8032736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8032910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8033043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8033048Z 
2025-05-07T20:32:05.8033252Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9ac0fd0>
2025-05-07T20:32:05.8034037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8034537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84950220>}
2025-05-07T20:32:05.8035332Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8035521Z context = <triton._C.libtriton.ir.context object at 0x7f39a9a19670>
2025-05-07T20:32:05.8035525Z 
2025-05-07T20:32:05.8035690Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8035963Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8036070Z                            module_map=module_map)
2025-05-07T20:32:05.8036238Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8036337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8036415Z E       ^
2025-05-07T20:32:05.8036771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8036778Z 
2025-05-07T20:32:05.8037233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8037238Z 
2025-05-07T20:32:05.8037338Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8037564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8037644Z     T=1,
2025-05-07T20:32:05.8037728Z     D=7168,
2025-05-07T20:32:05.8037811Z     scale_ub=1200.0,
2025-05-07T20:32:05.8037900Z     contiguous=False,
2025-05-07T20:32:05.8037992Z     compiled=True,
2025-05-07T20:32:05.8038068Z )
2025-05-07T20:32:05.8038284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8038455Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8038460Z 
2025-05-07T20:32:05.8038538Z     @given(
2025-05-07T20:32:05.8038657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8038762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8038880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8039002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8039117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8039190Z     )
2025-05-07T20:32:05.8039437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8039531Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8039606Z         self,
2025-05-07T20:32:05.8039767Z         T: int,
2025-05-07T20:32:05.8039844Z         D: int,
2025-05-07T20:32:05.8039940Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8040034Z         contiguous: bool,
2025-05-07T20:32:05.8040118Z         compiled: bool,
2025-05-07T20:32:05.8040197Z     ) -> None:
2025-05-07T20:32:05.8040297Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8040370Z     
2025-05-07T20:32:05.8040544Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8040620Z     
2025-05-07T20:32:05.8040712Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8040841Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8040929Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8041008Z         x0 = x[:, :D]
2025-05-07T20:32:05.8041097Z         x1 = x[:, D:]
2025-05-07T20:32:05.8041170Z     
2025-05-07T20:32:05.8041253Z         if contiguous:
2025-05-07T20:32:05.8041348Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8041440Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8041515Z     
2025-05-07T20:32:05.8041611Z         if scale_ub is not None:
2025-05-07T20:32:05.8041714Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8041856Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8041932Z             )
2025-05-07T20:32:05.8042009Z         else:
2025-05-07T20:32:05.8042108Z             scale_ub_tensor = None
2025-05-07T20:32:05.8042179Z     
2025-05-07T20:32:05.8042311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8042457Z             op = silu_mul_quant
2025-05-07T20:32:05.8042541Z             if compiled:
2025-05-07T20:32:05.8042637Z                 op = torch.compile(op)
2025-05-07T20:32:05.8042748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8042819Z     
2025-05-07T20:32:05.8046733Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8046742Z 
2025-05-07T20:32:05.8046852Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8046994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8047097Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8047199Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8047675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8047775Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8048273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8048447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8048805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8049031Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8049371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8049469Z     kernel = self.compile(
2025-05-07T20:32:05.8049854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8050029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8050160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8050165Z 
2025-05-07T20:32:05.8050369Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9d16c10>
2025-05-07T20:32:05.8051350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8051862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a849518a0>}
2025-05-07T20:32:05.8052703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8052901Z context = <triton._C.libtriton.ir.context object at 0x7f39a9dfb1f0>
2025-05-07T20:32:05.8052906Z 
2025-05-07T20:32:05.8053071Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8053339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8053452Z                            module_map=module_map)
2025-05-07T20:32:05.8053613Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8053716Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8053793Z E       ^
2025-05-07T20:32:05.8054145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8054150Z 
2025-05-07T20:32:05.8054569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8054573Z 
2025-05-07T20:32:05.8054674Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8054897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8054976Z     T=1,
2025-05-07T20:32:05.8055055Z     D=7168,
2025-05-07T20:32:05.8055140Z     scale_ub=None,
2025-05-07T20:32:05.8055298Z     contiguous=False,
2025-05-07T20:32:05.8055383Z     compiled=True,
2025-05-07T20:32:05.8055460Z )
2025-05-07T20:32:05.8055680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8055842Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.8055846Z 
2025-05-07T20:32:05.8055929Z     @given(
2025-05-07T20:32:05.8056047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8056154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8056266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8056381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8056496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8056570Z     )
2025-05-07T20:32:05.8056813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8056909Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8057031Z         self,
2025-05-07T20:32:05.8057109Z         T: int,
2025-05-07T20:32:05.8057190Z         D: int,
2025-05-07T20:32:05.8057287Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8057382Z         contiguous: bool,
2025-05-07T20:32:05.8057470Z         compiled: bool,
2025-05-07T20:32:05.8057549Z     ) -> None:
2025-05-07T20:32:05.8057646Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8057719Z     
2025-05-07T20:32:05.8057889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8057970Z     
2025-05-07T20:32:05.8058062Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8058185Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8058276Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8058355Z         x0 = x[:, :D]
2025-05-07T20:32:05.8058434Z         x1 = x[:, D:]
2025-05-07T20:32:05.8058510Z     
2025-05-07T20:32:05.8058593Z         if contiguous:
2025-05-07T20:32:05.8058683Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8058782Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8058853Z     
2025-05-07T20:32:05.8058947Z         if scale_ub is not None:
2025-05-07T20:32:05.8059051Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8059184Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8059262Z             )
2025-05-07T20:32:05.8059339Z         else:
2025-05-07T20:32:05.8059432Z             scale_ub_tensor = None
2025-05-07T20:32:05.8059507Z     
2025-05-07T20:32:05.8059716Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8059808Z             op = silu_mul_quant
2025-05-07T20:32:05.8059895Z             if compiled:
2025-05-07T20:32:05.8059993Z                 op = torch.compile(op)
2025-05-07T20:32:05.8060098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8060172Z     
2025-05-07T20:32:05.8060264Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.8060385Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.8060463Z     
2025-05-07T20:32:05.8060600Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8060703Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.8060800Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.8060919Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.8061059Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.8061131Z     
2025-05-07T20:32:05.8061232Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.8061243Z 
2025-05-07T20:32:05.8061344Z moe/activation_test.py:126: 
2025-05-07T20:32:05.8061473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8061580Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.8061712Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.8062272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.8062424Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.8062781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8062999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8063363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.8063621Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.8064018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.8064268Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.8064639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.8064854Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.8065192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.8065274Z     fn()
2025-05-07T20:32:05.8065671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.8065757Z     self.fn.run(
2025-05-07T20:32:05.8066102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8066196Z     kernel = self.compile(
2025-05-07T20:32:05.8066572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8066748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8066874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8066884Z 
2025-05-07T20:32:05.8067089Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9d71310>
2025-05-07T20:32:05.8067862Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8068446Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84952e80>}
2025-05-07T20:32:05.8069190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8069379Z context = <triton._C.libtriton.ir.context object at 0x7f39a9d635f0>
2025-05-07T20:32:05.8069383Z 
2025-05-07T20:32:05.8069553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8069817Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8069927Z                            module_map=module_map)
2025-05-07T20:32:05.8070087Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8070189Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.8070269Z E       ^
2025-05-07T20:32:05.8070625Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8070630Z 
2025-05-07T20:32:05.8071040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8071049Z 
2025-05-07T20:32:05.8071152Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8071372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8071521Z     T=1,
2025-05-07T20:32:05.8071599Z     D=5120,
2025-05-07T20:32:05.8071686Z     scale_ub=1200.0,
2025-05-07T20:32:05.8071776Z     contiguous=False,
2025-05-07T20:32:05.8071860Z     compiled=True,
2025-05-07T20:32:05.8071933Z )
2025-05-07T20:32:05.8072156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8072321Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8072327Z 
2025-05-07T20:32:05.8072404Z     @given(
2025-05-07T20:32:05.8072531Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8072631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8072750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8072865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8072978Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8073058Z     )
2025-05-07T20:32:05.8073302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8073443Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8073523Z         self,
2025-05-07T20:32:05.8073601Z         T: int,
2025-05-07T20:32:05.8073679Z         D: int,
2025-05-07T20:32:05.8073783Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8073872Z         contiguous: bool,
2025-05-07T20:32:05.8073962Z         compiled: bool,
2025-05-07T20:32:05.8074041Z     ) -> None:
2025-05-07T20:32:05.8074139Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8074219Z     
2025-05-07T20:32:05.8074391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8074465Z     
2025-05-07T20:32:05.8074559Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8074684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8074774Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8074863Z         x0 = x[:, :D]
2025-05-07T20:32:05.8074944Z         x1 = x[:, D:]
2025-05-07T20:32:05.8075017Z     
2025-05-07T20:32:05.8075111Z         if contiguous:
2025-05-07T20:32:05.8075202Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8075291Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8075369Z     
2025-05-07T20:32:05.8075459Z         if scale_ub is not None:
2025-05-07T20:32:05.8075568Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8075704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8075778Z             )
2025-05-07T20:32:05.8075857Z         else:
2025-05-07T20:32:05.8076030Z             scale_ub_tensor = None
2025-05-07T20:32:05.8076103Z     
2025-05-07T20:32:05.8076238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8076326Z             op = silu_mul_quant
2025-05-07T20:32:05.8076411Z             if compiled:
2025-05-07T20:32:05.8076512Z                 op = torch.compile(op)
2025-05-07T20:32:05.8076616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8076688Z     
2025-05-07T20:32:05.8076784Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8076791Z 
2025-05-07T20:32:05.8076887Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8077021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8077121Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8077220Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8077588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8077685Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8078176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8078276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8078630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8078858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8079243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8079339Z     kernel = self.compile(
2025-05-07T20:32:05.8079720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8079891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8080024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8080029Z 
2025-05-07T20:32:05.8080234Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a842b57d0>
2025-05-07T20:32:05.8081226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8081857Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a84953a60>}
2025-05-07T20:32:05.8082846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8083062Z context = <triton._C.libtriton.ir.context object at 0x7f39a9250c30>
2025-05-07T20:32:05.8083067Z 
2025-05-07T20:32:05.8083252Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8083564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8083674Z                            module_map=module_map)
2025-05-07T20:32:05.8083849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8083955Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8084032Z E       ^
2025-05-07T20:32:05.8084468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8084472Z 
2025-05-07T20:32:05.8084977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8084981Z 
2025-05-07T20:32:05.8085089Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8085351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8085528Z     T=1,
2025-05-07T20:32:05.8085612Z     D=5120,
2025-05-07T20:32:05.8085698Z     scale_ub=1200.0,
2025-05-07T20:32:05.8085786Z     contiguous=False,
2025-05-07T20:32:05.8085872Z     compiled=False,
2025-05-07T20:32:05.8085947Z )
2025-05-07T20:32:05.8086164Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8086334Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.8086343Z 
2025-05-07T20:32:05.8086420Z     @given(
2025-05-07T20:32:05.8086541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8086645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8086758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8086879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8086991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8087065Z     )
2025-05-07T20:32:05.8087316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8087410Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8087486Z         self,
2025-05-07T20:32:05.8087643Z         T: int,
2025-05-07T20:32:05.8087734Z         D: int,
2025-05-07T20:32:05.8087832Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8087924Z         contiguous: bool,
2025-05-07T20:32:05.8088008Z         compiled: bool,
2025-05-07T20:32:05.8088087Z     ) -> None:
2025-05-07T20:32:05.8088237Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8088313Z     
2025-05-07T20:32:05.8088480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8088556Z     
2025-05-07T20:32:05.8088647Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8088775Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8088864Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8088945Z         x0 = x[:, :D]
2025-05-07T20:32:05.8089028Z         x1 = x[:, D:]
2025-05-07T20:32:05.8089099Z     
2025-05-07T20:32:05.8089188Z         if contiguous:
2025-05-07T20:32:05.8089284Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8089374Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8089445Z     
2025-05-07T20:32:05.8089538Z         if scale_ub is not None:
2025-05-07T20:32:05.8089644Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8089777Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8089859Z             )
2025-05-07T20:32:05.8089980Z         else:
2025-05-07T20:32:05.8090080Z             scale_ub_tensor = None
2025-05-07T20:32:05.8090152Z     
2025-05-07T20:32:05.8090279Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8090372Z             op = silu_mul_quant
2025-05-07T20:32:05.8090455Z             if compiled:
2025-05-07T20:32:05.8090555Z                 op = torch.compile(op)
2025-05-07T20:32:05.8090663Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8090734Z     
2025-05-07T20:32:05.8090827Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8090831Z 
2025-05-07T20:32:05.8090929Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8091059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8091164Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8091262Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8091759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8091864Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8092219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8092439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8092781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8092956Z     kernel = self.compile(
2025-05-07T20:32:05.8093345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8093518Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8093644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8093648Z 
2025-05-07T20:32:05.8093856Z self = <triton.compiler.compiler.ASTSource object at 0x7f3a84297250>
2025-05-07T20:32:05.8094631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8095136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8420c9a0>}
2025-05-07T20:32:05.8095882Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8096070Z context = <triton._C.libtriton.ir.context object at 0x7f3a84203870>
2025-05-07T20:32:05.8096078Z 
2025-05-07T20:32:05.8096240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8096500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8096657Z                            module_map=module_map)
2025-05-07T20:32:05.8096815Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8096914Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8096995Z E       ^
2025-05-07T20:32:05.8097345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8097350Z 
2025-05-07T20:32:05.8097769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8097773Z 
2025-05-07T20:32:05.8097877Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8098096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8098178Z     T=16384,
2025-05-07T20:32:05.8098257Z     D=5120,
2025-05-07T20:32:05.8098341Z     scale_ub=1200.0,
2025-05-07T20:32:05.8098473Z     contiguous=False,
2025-05-07T20:32:05.8098555Z     compiled=True,
2025-05-07T20:32:05.8098629Z )
2025-05-07T20:32:05.8098850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8099027Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8099032Z 
2025-05-07T20:32:05.8099116Z     @given(
2025-05-07T20:32:05.8099236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8099342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8099459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8099574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8099687Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8099763Z     )
2025-05-07T20:32:05.8100003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8100098Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8100182Z         self,
2025-05-07T20:32:05.8100262Z         T: int,
2025-05-07T20:32:05.8100341Z         D: int,
2025-05-07T20:32:05.8100438Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8100529Z         contiguous: bool,
2025-05-07T20:32:05.8100616Z         compiled: bool,
2025-05-07T20:32:05.8100694Z     ) -> None:
2025-05-07T20:32:05.8100790Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8100867Z     
2025-05-07T20:32:05.8101035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8101190Z     
2025-05-07T20:32:05.8101285Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8101410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8101499Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8101581Z         x0 = x[:, :D]
2025-05-07T20:32:05.8101660Z         x1 = x[:, D:]
2025-05-07T20:32:05.8101737Z     
2025-05-07T20:32:05.8101820Z         if contiguous:
2025-05-07T20:32:05.8101910Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8102007Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8102079Z     
2025-05-07T20:32:05.8102169Z         if scale_ub is not None:
2025-05-07T20:32:05.8102278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8102409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8102484Z             )
2025-05-07T20:32:05.8102566Z         else:
2025-05-07T20:32:05.8102658Z             scale_ub_tensor = None
2025-05-07T20:32:05.8102732Z     
2025-05-07T20:32:05.8102869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8102959Z             op = silu_mul_quant
2025-05-07T20:32:05.8103044Z             if compiled:
2025-05-07T20:32:05.8103148Z                 op = torch.compile(op)
2025-05-07T20:32:05.8103253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8103327Z     
2025-05-07T20:32:05.8103418Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8103422Z 
2025-05-07T20:32:05.8103519Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8103698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8103797Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8103894Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8104264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8104357Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8104857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8104955Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8105310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8105531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8106187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8106394Z     kernel = self.compile(
2025-05-07T20:32:05.8106778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8106954Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8107088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8107093Z 
2025-05-07T20:32:05.8107301Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9771f10>
2025-05-07T20:32:05.8108071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8108577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8420de40>}
2025-05-07T20:32:05.8109323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8109517Z context = <triton._C.libtriton.ir.context object at 0x7f39a97e2570>
2025-05-07T20:32:05.8109522Z 
2025-05-07T20:32:05.8109683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8110133Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8110242Z                            module_map=module_map)
2025-05-07T20:32:05.8110402Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8110504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8110583Z E       ^
2025-05-07T20:32:05.8110936Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8110946Z 
2025-05-07T20:32:05.8111362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8111366Z 
2025-05-07T20:32:05.8111472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8111701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8111782Z     T=2048,
2025-05-07T20:32:05.8111867Z     D=7168,
2025-05-07T20:32:05.8111954Z     scale_ub=1200.0,
2025-05-07T20:32:05.8112047Z     contiguous=False,
2025-05-07T20:32:05.8112144Z     compiled=True,
2025-05-07T20:32:05.8112221Z )
2025-05-07T20:32:05.8112439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8112621Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8112626Z 
2025-05-07T20:32:05.8112707Z     @given(
2025-05-07T20:32:05.8112828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8112998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8113115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8113243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8113360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8113435Z     )
2025-05-07T20:32:05.8113683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8113780Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8113864Z         self,
2025-05-07T20:32:05.8113953Z         T: int,
2025-05-07T20:32:05.8114034Z         D: int,
2025-05-07T20:32:05.8114134Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8114234Z         contiguous: bool,
2025-05-07T20:32:05.8114324Z         compiled: bool,
2025-05-07T20:32:05.8114404Z     ) -> None:
2025-05-07T20:32:05.8114504Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8114580Z     
2025-05-07T20:32:05.8114758Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8114880Z     
2025-05-07T20:32:05.8114975Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8115108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8115199Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8115281Z         x0 = x[:, :D]
2025-05-07T20:32:05.8115367Z         x1 = x[:, D:]
2025-05-07T20:32:05.8115443Z     
2025-05-07T20:32:05.8115529Z         if contiguous:
2025-05-07T20:32:05.8115627Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8115724Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8115800Z     
2025-05-07T20:32:05.8115900Z         if scale_ub is not None:
2025-05-07T20:32:05.8116007Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8116147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8116227Z             )
2025-05-07T20:32:05.8116304Z         else:
2025-05-07T20:32:05.8116405Z             scale_ub_tensor = None
2025-05-07T20:32:05.8116488Z     
2025-05-07T20:32:05.8116619Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8116715Z             op = silu_mul_quant
2025-05-07T20:32:05.8116802Z             if compiled:
2025-05-07T20:32:05.8116903Z                 op = torch.compile(op)
2025-05-07T20:32:05.8117017Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8117094Z     
2025-05-07T20:32:05.8117188Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8117193Z 
2025-05-07T20:32:05.8117298Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8117533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8117643Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8117744Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8118111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8118209Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8118704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8118810Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8119174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8119394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8119747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8119844Z     kernel = self.compile(
2025-05-07T20:32:05.8120225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8120407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8120535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8120540Z 
2025-05-07T20:32:05.8120795Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a97eb210>
2025-05-07T20:32:05.8121567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8122076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3a8420e980>}
2025-05-07T20:32:05.8122823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8123015Z context = <triton._C.libtriton.ir.context object at 0x7f39a9727870>
2025-05-07T20:32:05.8123020Z 
2025-05-07T20:32:05.8123190Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8123499Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8123607Z                            module_map=module_map)
2025-05-07T20:32:05.8123776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8123878Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8123963Z E       ^
2025-05-07T20:32:05.8124322Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8124327Z 
2025-05-07T20:32:05.8124739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8124744Z 
2025-05-07T20:32:05.8124856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8125079Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8125161Z     T=1,
2025-05-07T20:32:05.8125250Z     D=5120,
2025-05-07T20:32:05.8125337Z     scale_ub=None,
2025-05-07T20:32:05.8125433Z     contiguous=False,
2025-05-07T20:32:05.8125521Z     compiled=False,
2025-05-07T20:32:05.8125597Z )
2025-05-07T20:32:05.8125821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8125989Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.8125994Z 
2025-05-07T20:32:05.8126075Z     @given(
2025-05-07T20:32:05.8126203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8126381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8126498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8126620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8126734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8126816Z     )
2025-05-07T20:32:05.8127060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8127159Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8127246Z         self,
2025-05-07T20:32:05.8127326Z         T: int,
2025-05-07T20:32:05.8127409Z         D: int,
2025-05-07T20:32:05.8127516Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8127661Z         contiguous: bool,
2025-05-07T20:32:05.8127750Z         compiled: bool,
2025-05-07T20:32:05.8127837Z     ) -> None:
2025-05-07T20:32:05.8127935Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8128010Z     
2025-05-07T20:32:05.8128191Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8128269Z     
2025-05-07T20:32:05.8128373Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8128499Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8128591Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8128679Z         x0 = x[:, :D]
2025-05-07T20:32:05.8128761Z         x1 = x[:, D:]
2025-05-07T20:32:05.8128836Z     
2025-05-07T20:32:05.8128927Z         if contiguous:
2025-05-07T20:32:05.8129070Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8129162Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8129246Z     
2025-05-07T20:32:05.8129340Z         if scale_ub is not None:
2025-05-07T20:32:05.8129450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8129591Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8129670Z             )
2025-05-07T20:32:05.8129749Z         else:
2025-05-07T20:32:05.8129851Z             scale_ub_tensor = None
2025-05-07T20:32:05.8129931Z     
2025-05-07T20:32:05.8130067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8130159Z             op = silu_mul_quant
2025-05-07T20:32:05.8130246Z             if compiled:
2025-05-07T20:32:05.8130353Z                 op = torch.compile(op)
2025-05-07T20:32:05.8130460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8130537Z     
2025-05-07T20:32:05.8130639Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8130646Z 
2025-05-07T20:32:05.8130791Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8130924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8131032Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8131134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8131637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8131736Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8132098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8132325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8132665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8132767Z     kernel = self.compile(
2025-05-07T20:32:05.8133152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8133333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8133468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8133473Z 
2025-05-07T20:32:05.8133678Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a97ecf10>
2025-05-07T20:32:05.8134525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8135036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9740220>}
2025-05-07T20:32:05.8135783Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8135983Z context = <triton._C.libtriton.ir.context object at 0x7f39a979d4f0>
2025-05-07T20:32:05.8135988Z 
2025-05-07T20:32:05.8136153Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8136420Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8136531Z                            module_map=module_map)
2025-05-07T20:32:05.8136714Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8136828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8136927Z E       ^
2025-05-07T20:32:05.8137282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8137287Z 
2025-05-07T20:32:05.8137705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8137754Z 
2025-05-07T20:32:05.8137860Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8138088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8138169Z     T=4096,
2025-05-07T20:32:05.8138250Z     D=7168,
2025-05-07T20:32:05.8138341Z     scale_ub=1200.0,
2025-05-07T20:32:05.8138431Z     contiguous=False,
2025-05-07T20:32:05.8138516Z     compiled=False,
2025-05-07T20:32:05.8138599Z )
2025-05-07T20:32:05.8138822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8138998Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.8139009Z 
2025-05-07T20:32:05.8139093Z     @given(
2025-05-07T20:32:05.8139213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8139319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8139436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8139596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8139719Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8139797Z     )
2025-05-07T20:32:05.8140041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8140143Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8140223Z         self,
2025-05-07T20:32:05.8140309Z         T: int,
2025-05-07T20:32:05.8140387Z         D: int,
2025-05-07T20:32:05.8140491Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8140588Z         contiguous: bool,
2025-05-07T20:32:05.8140677Z         compiled: bool,
2025-05-07T20:32:05.8140759Z     ) -> None:
2025-05-07T20:32:05.8140861Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8140939Z     
2025-05-07T20:32:05.8141111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8141198Z     
2025-05-07T20:32:05.8141293Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8141425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8141524Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8141609Z         x0 = x[:, :D]
2025-05-07T20:32:05.8141694Z         x1 = x[:, D:]
2025-05-07T20:32:05.8141777Z     
2025-05-07T20:32:05.8141863Z         if contiguous:
2025-05-07T20:32:05.8141961Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8142053Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8142130Z     
2025-05-07T20:32:05.8142302Z         if scale_ub is not None:
2025-05-07T20:32:05.8142411Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8142548Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8142633Z             )
2025-05-07T20:32:05.8142712Z         else:
2025-05-07T20:32:05.8142807Z             scale_ub_tensor = None
2025-05-07T20:32:05.8142889Z     
2025-05-07T20:32:05.8143021Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8143114Z             op = silu_mul_quant
2025-05-07T20:32:05.8143211Z             if compiled:
2025-05-07T20:32:05.8143313Z                 op = torch.compile(op)
2025-05-07T20:32:05.8143425Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8143501Z     
2025-05-07T20:32:05.8143595Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8143599Z 
2025-05-07T20:32:05.8143704Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8143836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8143942Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8144049Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8144547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8144652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8145010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8145280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8145629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8145728Z     kernel = self.compile(
2025-05-07T20:32:05.8146108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8146286Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8146421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8146425Z 
2025-05-07T20:32:05.8146637Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9167dd0>
2025-05-07T20:32:05.8147407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8147974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9741440>}
2025-05-07T20:32:05.8148722Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8148920Z context = <triton._C.libtriton.ir.context object at 0x7f39a914c430>
2025-05-07T20:32:05.8148924Z 
2025-05-07T20:32:05.8149094Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8149358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8149466Z                            module_map=module_map)
2025-05-07T20:32:05.8149632Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8149741Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8149826Z E       ^
2025-05-07T20:32:05.8150179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8150184Z 
2025-05-07T20:32:05.8150597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8150602Z 
2025-05-07T20:32:05.8150715Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8151015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8151103Z     T=16384,
2025-05-07T20:32:05.8151181Z     D=7168,
2025-05-07T20:32:05.8151267Z     scale_ub=None,
2025-05-07T20:32:05.8151361Z     contiguous=True,
2025-05-07T20:32:05.8151444Z     compiled=True,
2025-05-07T20:32:05.8151520Z )
2025-05-07T20:32:05.8151744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8151917Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.8151927Z 
2025-05-07T20:32:05.8152008Z     @given(
2025-05-07T20:32:05.8152135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8152238Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8152361Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8152480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8152594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8152683Z     )
2025-05-07T20:32:05.8152927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8153023Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8153109Z         self,
2025-05-07T20:32:05.8153190Z         T: int,
2025-05-07T20:32:05.8153268Z         D: int,
2025-05-07T20:32:05.8153376Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8153466Z         contiguous: bool,
2025-05-07T20:32:05.8153552Z         compiled: bool,
2025-05-07T20:32:05.8153687Z     ) -> None:
2025-05-07T20:32:05.8153785Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8153867Z     
2025-05-07T20:32:05.8154038Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8154116Z     
2025-05-07T20:32:05.8154215Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8154342Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8154433Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8154520Z         x0 = x[:, :D]
2025-05-07T20:32:05.8154609Z         x1 = x[:, D:]
2025-05-07T20:32:05.8154685Z     
2025-05-07T20:32:05.8154776Z         if contiguous:
2025-05-07T20:32:05.8154868Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8154960Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8155042Z     
2025-05-07T20:32:05.8155135Z         if scale_ub is not None:
2025-05-07T20:32:05.8155245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8155389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8155514Z             )
2025-05-07T20:32:05.8155601Z         else:
2025-05-07T20:32:05.8155698Z             scale_ub_tensor = None
2025-05-07T20:32:05.8155775Z     
2025-05-07T20:32:05.8155911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8156004Z             op = silu_mul_quant
2025-05-07T20:32:05.8156093Z             if compiled:
2025-05-07T20:32:05.8156201Z                 op = torch.compile(op)
2025-05-07T20:32:05.8156312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8156390Z     
2025-05-07T20:32:05.8156491Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8156496Z 
2025-05-07T20:32:05.8156595Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8156733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8156838Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8156963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8157364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8157463Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8157956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8158063Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8158419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8158729Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8159069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8159165Z     kernel = self.compile(
2025-05-07T20:32:05.8159555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8159732Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8159864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8159876Z 
2025-05-07T20:32:05.8160083Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a911f390>
2025-05-07T20:32:05.8160859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8161367Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9742520>}
2025-05-07T20:32:05.8162110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8162352Z context = <triton._C.libtriton.ir.context object at 0x7f39a91faf70>
2025-05-07T20:32:05.8162356Z 
2025-05-07T20:32:05.8162521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8162783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8162895Z                            module_map=module_map)
2025-05-07T20:32:05.8163057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8163162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8163251Z E       ^
2025-05-07T20:32:05.8163610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8163615Z 
2025-05-07T20:32:05.8164033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8164038Z 
2025-05-07T20:32:05.8164144Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8164406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8164497Z     T=4096,
2025-05-07T20:32:05.8164577Z     D=5120,
2025-05-07T20:32:05.8164666Z     scale_ub=None,
2025-05-07T20:32:05.8164755Z     contiguous=False,
2025-05-07T20:32:05.8164842Z     compiled=True,
2025-05-07T20:32:05.8164926Z )
2025-05-07T20:32:05.8165144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8165321Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.8165326Z 
2025-05-07T20:32:05.8165408Z     @given(
2025-05-07T20:32:05.8165526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8165626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8165750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8165869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8165991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8166071Z     )
2025-05-07T20:32:05.8166319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8166421Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8166503Z         self,
2025-05-07T20:32:05.8166583Z         T: int,
2025-05-07T20:32:05.8166667Z         D: int,
2025-05-07T20:32:05.8166766Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8166859Z         contiguous: bool,
2025-05-07T20:32:05.8167031Z         compiled: bool,
2025-05-07T20:32:05.8167116Z     ) -> None:
2025-05-07T20:32:05.8167214Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8167292Z     
2025-05-07T20:32:05.8167461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8167598Z     
2025-05-07T20:32:05.8167695Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8167820Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8167913Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8168001Z         x0 = x[:, :D]
2025-05-07T20:32:05.8168081Z         x1 = x[:, D:]
2025-05-07T20:32:05.8168160Z     
2025-05-07T20:32:05.8168245Z         if contiguous:
2025-05-07T20:32:05.8168339Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8168434Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8168510Z     
2025-05-07T20:32:05.8168601Z         if scale_ub is not None:
2025-05-07T20:32:05.8168717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8168858Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8168937Z             )
2025-05-07T20:32:05.8169020Z         else:
2025-05-07T20:32:05.8169115Z             scale_ub_tensor = None
2025-05-07T20:32:05.8169195Z     
2025-05-07T20:32:05.8169326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8169419Z             op = silu_mul_quant
2025-05-07T20:32:05.8173127Z             if compiled:
2025-05-07T20:32:05.8173246Z                 op = torch.compile(op)
2025-05-07T20:32:05.8173429Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8173501Z     
2025-05-07T20:32:05.8173596Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8173601Z 
2025-05-07T20:32:05.8173699Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8173829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8173932Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8174031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8174414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8174513Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8175004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8175104Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8175459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8175729Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8176070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8176164Z     kernel = self.compile(
2025-05-07T20:32:05.8176543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8176725Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8176855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8176860Z 
2025-05-07T20:32:05.8177068Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a92abd50>
2025-05-07T20:32:05.8177841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8178352Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9742c00>}
2025-05-07T20:32:05.8179173Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8179366Z context = <triton._C.libtriton.ir.context object at 0x7f39a92543f0>
2025-05-07T20:32:05.8179371Z 
2025-05-07T20:32:05.8179537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8179797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8179908Z                            module_map=module_map)
2025-05-07T20:32:05.8180068Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8180172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8180252Z E       ^
2025-05-07T20:32:05.8180606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8180611Z 
2025-05-07T20:32:05.8181021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8181029Z 
2025-05-07T20:32:05.8181136Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8181356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8181435Z     T=4096,
2025-05-07T20:32:05.8181512Z     D=5120,
2025-05-07T20:32:05.8181595Z     scale_ub=1200.0,
2025-05-07T20:32:05.8181686Z     contiguous=False,
2025-05-07T20:32:05.8181770Z     compiled=False,
2025-05-07T20:32:05.8181842Z )
2025-05-07T20:32:05.8182060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8182306Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.8182311Z 
2025-05-07T20:32:05.8182387Z     @given(
2025-05-07T20:32:05.8182508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8182605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8182720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8182834Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8182950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8183027Z     )
2025-05-07T20:32:05.8183270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8183363Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8183444Z         self,
2025-05-07T20:32:05.8183522Z         T: int,
2025-05-07T20:32:05.8183599Z         D: int,
2025-05-07T20:32:05.8183698Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8183833Z         contiguous: bool,
2025-05-07T20:32:05.8183923Z         compiled: bool,
2025-05-07T20:32:05.8184003Z     ) -> None:
2025-05-07T20:32:05.8184099Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8184174Z     
2025-05-07T20:32:05.8184340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8184414Z     
2025-05-07T20:32:05.8184508Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8184630Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8184721Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8184803Z         x0 = x[:, :D]
2025-05-07T20:32:05.8184882Z         x1 = x[:, D:]
2025-05-07T20:32:05.8184954Z     
2025-05-07T20:32:05.8185040Z         if contiguous:
2025-05-07T20:32:05.8185130Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8185217Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8185293Z     
2025-05-07T20:32:05.8185383Z         if scale_ub is not None:
2025-05-07T20:32:05.8185490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8185628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8185703Z             )
2025-05-07T20:32:05.8185782Z         else:
2025-05-07T20:32:05.8185874Z             scale_ub_tensor = None
2025-05-07T20:32:05.8185946Z     
2025-05-07T20:32:05.8186076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8186165Z             op = silu_mul_quant
2025-05-07T20:32:05.8186251Z             if compiled:
2025-05-07T20:32:05.8186432Z                 op = torch.compile(op)
2025-05-07T20:32:05.8186538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8186612Z     
2025-05-07T20:32:05.8186703Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8186707Z 
2025-05-07T20:32:05.8186814Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8186963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8187083Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8187184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8187688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8187787Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8188145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8188367Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8188710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8188806Z     kernel = self.compile(
2025-05-07T20:32:05.8189187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8189361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8189489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8189539Z 
2025-05-07T20:32:05.8189741Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a92e8f90>
2025-05-07T20:32:05.8190516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8191023Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a92d8400>}
2025-05-07T20:32:05.8191767Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8191956Z context = <triton._C.libtriton.ir.context object at 0x7f39a92c55b0>
2025-05-07T20:32:05.8192003Z 
2025-05-07T20:32:05.8192167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8192430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8192534Z                            module_map=module_map)
2025-05-07T20:32:05.8192694Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8192803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8192880Z E       ^
2025-05-07T20:32:05.8193238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8193243Z 
2025-05-07T20:32:05.8193653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8193658Z 
2025-05-07T20:32:05.8193759Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8193984Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8194066Z     T=4096,
2025-05-07T20:32:05.8194141Z     D=5120,
2025-05-07T20:32:05.8194229Z     scale_ub=1200.0,
2025-05-07T20:32:05.8194318Z     contiguous=False,
2025-05-07T20:32:05.8194403Z     compiled=True,
2025-05-07T20:32:05.8194476Z )
2025-05-07T20:32:05.8194693Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8194866Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8194871Z 
2025-05-07T20:32:05.8195024Z     @given(
2025-05-07T20:32:05.8195144Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8195246Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8195359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8195475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8195590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8195664Z     )
2025-05-07T20:32:05.8195906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8196003Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8196078Z         self,
2025-05-07T20:32:05.8196156Z         T: int,
2025-05-07T20:32:05.8196231Z         D: int,
2025-05-07T20:32:05.8196329Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8196419Z         contiguous: bool,
2025-05-07T20:32:05.8196504Z         compiled: bool,
2025-05-07T20:32:05.8196581Z     ) -> None:
2025-05-07T20:32:05.8196678Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8196758Z     
2025-05-07T20:32:05.8196952Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8197048Z     
2025-05-07T20:32:05.8197141Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8197268Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8197355Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8197435Z         x0 = x[:, :D]
2025-05-07T20:32:05.8197517Z         x1 = x[:, D:]
2025-05-07T20:32:05.8197637Z     
2025-05-07T20:32:05.8197720Z         if contiguous:
2025-05-07T20:32:05.8197815Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8197903Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8197974Z     
2025-05-07T20:32:05.8198066Z         if scale_ub is not None:
2025-05-07T20:32:05.8198174Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8198307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8198385Z             )
2025-05-07T20:32:05.8198460Z         else:
2025-05-07T20:32:05.8198559Z             scale_ub_tensor = None
2025-05-07T20:32:05.8198634Z     
2025-05-07T20:32:05.8198762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8198857Z             op = silu_mul_quant
2025-05-07T20:32:05.8198941Z             if compiled:
2025-05-07T20:32:05.8199040Z                 op = torch.compile(op)
2025-05-07T20:32:05.8199148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8199223Z     
2025-05-07T20:32:05.8199359Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8199363Z 
2025-05-07T20:32:05.8199464Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8199593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8199692Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8199792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8200157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8200256Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8200744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8200840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8201198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8201416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8201758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8201851Z     kernel = self.compile(
2025-05-07T20:32:05.8202230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8202410Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8202611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8202615Z 
2025-05-07T20:32:05.8202817Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a925ba90>
2025-05-07T20:32:05.8203590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8204093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a92d9620>}
2025-05-07T20:32:05.8204838Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8205025Z context = <triton._C.libtriton.ir.context object at 0x7f39a99180f0>
2025-05-07T20:32:05.8205034Z 
2025-05-07T20:32:05.8205200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8205459Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8205564Z                            module_map=module_map)
2025-05-07T20:32:05.8206021Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8206159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8206372Z E       ^
2025-05-07T20:32:05.8206759Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8206765Z 
2025-05-07T20:32:05.8207175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8207179Z 
2025-05-07T20:32:05.8207280Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8207504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8207642Z     T=2048,
2025-05-07T20:32:05.8207720Z     D=7168,
2025-05-07T20:32:05.8207801Z     scale_ub=1200.0,
2025-05-07T20:32:05.8207889Z     contiguous=False,
2025-05-07T20:32:05.8207972Z     compiled=False,
2025-05-07T20:32:05.8208043Z )
2025-05-07T20:32:05.8208259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8208429Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.8208510Z 
2025-05-07T20:32:05.8208587Z     @given(
2025-05-07T20:32:05.8208703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8208797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8208911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8209025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8209134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8209207Z     )
2025-05-07T20:32:05.8209450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8209544Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8209619Z         self,
2025-05-07T20:32:05.8209692Z         T: int,
2025-05-07T20:32:05.8209769Z         D: int,
2025-05-07T20:32:05.8209863Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8209949Z         contiguous: bool,
2025-05-07T20:32:05.8210034Z         compiled: bool,
2025-05-07T20:32:05.8210108Z     ) -> None:
2025-05-07T20:32:05.8210208Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8210285Z     
2025-05-07T20:32:05.8210449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8210522Z     
2025-05-07T20:32:05.8210614Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8210735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8210820Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8210899Z         x0 = x[:, :D]
2025-05-07T20:32:05.8210977Z         x1 = x[:, D:]
2025-05-07T20:32:05.8211204Z     
2025-05-07T20:32:05.8211285Z         if contiguous:
2025-05-07T20:32:05.8211372Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8211459Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8211529Z     
2025-05-07T20:32:05.8211617Z         if scale_ub is not None:
2025-05-07T20:32:05.8211722Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8211852Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8211932Z             )
2025-05-07T20:32:05.8212006Z         else:
2025-05-07T20:32:05.8212097Z             scale_ub_tensor = None
2025-05-07T20:32:05.8212168Z     
2025-05-07T20:32:05.8212297Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8212383Z             op = silu_mul_quant
2025-05-07T20:32:05.8212467Z             if compiled:
2025-05-07T20:32:05.8212564Z                 op = torch.compile(op)
2025-05-07T20:32:05.8212664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8212737Z     
2025-05-07T20:32:05.8212832Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8212836Z 
2025-05-07T20:32:05.8212930Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8213060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8213157Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8213253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8213750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8213893Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8214249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8214466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8214801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8214901Z     kernel = self.compile(
2025-05-07T20:32:05.8215276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8215445Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8215573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8215577Z 
2025-05-07T20:32:05.8215776Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a99f0c50>
2025-05-07T20:32:05.8216588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8217082Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a92da480>}
2025-05-07T20:32:05.8217829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8218015Z context = <triton._C.libtriton.ir.context object at 0x7f39a9985270>
2025-05-07T20:32:05.8218020Z 
2025-05-07T20:32:05.8218179Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8218448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8218553Z                            module_map=module_map)
2025-05-07T20:32:05.8218716Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8218813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8218888Z E       ^
2025-05-07T20:32:05.8219242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8219325Z 
2025-05-07T20:32:05.8219732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8219737Z 
2025-05-07T20:32:05.8219841Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8220058Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8220136Z     T=1,
2025-05-07T20:32:05.8220217Z     D=7168,
2025-05-07T20:32:05.8220299Z     scale_ub=None,
2025-05-07T20:32:05.8220383Z     contiguous=True,
2025-05-07T20:32:05.8220469Z     compiled=False,
2025-05-07T20:32:05.8220539Z )
2025-05-07T20:32:05.8220753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8220919Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8220924Z 
2025-05-07T20:32:05.8221000Z     @given(
2025-05-07T20:32:05.8221115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8221223Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8221337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8221454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8221566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8221640Z     )
2025-05-07T20:32:05.8221881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8221973Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8222089Z         self,
2025-05-07T20:32:05.8222166Z         T: int,
2025-05-07T20:32:05.8222239Z         D: int,
2025-05-07T20:32:05.8222331Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8222420Z         contiguous: bool,
2025-05-07T20:32:05.8222502Z         compiled: bool,
2025-05-07T20:32:05.8222580Z     ) -> None:
2025-05-07T20:32:05.8222673Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8222743Z     
2025-05-07T20:32:05.8222911Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8222986Z     
2025-05-07T20:32:05.8223073Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8223194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8223280Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8223359Z         x0 = x[:, :D]
2025-05-07T20:32:05.8223441Z         x1 = x[:, D:]
2025-05-07T20:32:05.8223509Z     
2025-05-07T20:32:05.8223590Z         if contiguous:
2025-05-07T20:32:05.8223683Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8223819Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8223888Z     
2025-05-07T20:32:05.8223978Z         if scale_ub is not None:
2025-05-07T20:32:05.8224082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8224219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8224290Z             )
2025-05-07T20:32:05.8224363Z         else:
2025-05-07T20:32:05.8224456Z             scale_ub_tensor = None
2025-05-07T20:32:05.8224527Z     
2025-05-07T20:32:05.8224656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8224748Z             op = silu_mul_quant
2025-05-07T20:32:05.8224829Z             if compiled:
2025-05-07T20:32:05.8224924Z                 op = torch.compile(op)
2025-05-07T20:32:05.8225028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8225097Z     
2025-05-07T20:32:05.8225186Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8225193Z 
2025-05-07T20:32:05.8225285Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8225417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8225516Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8225611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8226105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8226202Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8226639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8226885Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8227249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8227342Z     kernel = self.compile(
2025-05-07T20:32:05.8227718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8227895Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8228017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8228022Z 
2025-05-07T20:32:05.8228230Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a99804d0>
2025-05-07T20:32:05.8228999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8229497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a92d9da0>}
2025-05-07T20:32:05.8230232Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8230465Z context = <triton._C.libtriton.ir.context object at 0x7f39a9994b30>
2025-05-07T20:32:05.8230469Z 
2025-05-07T20:32:05.8230629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8230887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8230994Z                            module_map=module_map)
2025-05-07T20:32:05.8231156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8231253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8231335Z E       ^
2025-05-07T20:32:05.8231682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8231687Z 
2025-05-07T20:32:05.8232100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8232147Z 
2025-05-07T20:32:05.8232250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8232470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8232551Z     T=16384,
2025-05-07T20:32:05.8232628Z     D=7168,
2025-05-07T20:32:05.8232711Z     scale_ub=1200.0,
2025-05-07T20:32:05.8232796Z     contiguous=False,
2025-05-07T20:32:05.8232878Z     compiled=True,
2025-05-07T20:32:05.8232953Z )
2025-05-07T20:32:05.8233171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8233345Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8233350Z 
2025-05-07T20:32:05.8233428Z     @given(
2025-05-07T20:32:05.8233541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8233635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8233748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8233866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8233978Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8234053Z     )
2025-05-07T20:32:05.8234297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8234393Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8234469Z         self,
2025-05-07T20:32:05.8234545Z         T: int,
2025-05-07T20:32:05.8234623Z         D: int,
2025-05-07T20:32:05.8234718Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8234891Z         contiguous: bool,
2025-05-07T20:32:05.8234976Z         compiled: bool,
2025-05-07T20:32:05.8235052Z     ) -> None:
2025-05-07T20:32:05.8235151Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8235222Z     
2025-05-07T20:32:05.8235390Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8235469Z     
2025-05-07T20:32:05.8235557Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8235678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8235774Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8235852Z         x0 = x[:, :D]
2025-05-07T20:32:05.8235930Z         x1 = x[:, D:]
2025-05-07T20:32:05.8236009Z     
2025-05-07T20:32:05.8236090Z         if contiguous:
2025-05-07T20:32:05.8236185Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8236273Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8236347Z     
2025-05-07T20:32:05.8236442Z         if scale_ub is not None:
2025-05-07T20:32:05.8236552Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8236684Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8236761Z             )
2025-05-07T20:32:05.8236838Z         else:
2025-05-07T20:32:05.8236933Z             scale_ub_tensor = None
2025-05-07T20:32:05.8237007Z     
2025-05-07T20:32:05.8237133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8237224Z             op = silu_mul_quant
2025-05-07T20:32:05.8237360Z             if compiled:
2025-05-07T20:32:05.8237460Z                 op = torch.compile(op)
2025-05-07T20:32:05.8237568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8237641Z     
2025-05-07T20:32:05.8237732Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8237736Z 
2025-05-07T20:32:05.8237837Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8237967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8238065Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8238173Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8238538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8238629Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8239123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8239219Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8239620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8239842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8240181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8240281Z     kernel = self.compile(
2025-05-07T20:32:05.8240661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8240835Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8240960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8240965Z 
2025-05-07T20:32:05.8241167Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a98e6d90>
2025-05-07T20:32:05.8241939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8242442Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9880a40>}
2025-05-07T20:32:05.8243351Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8243541Z context = <triton._C.libtriton.ir.context object at 0x7f39a982f3f0>
2025-05-07T20:32:05.8243546Z 
2025-05-07T20:32:05.8243706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8243969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8244079Z                            module_map=module_map)
2025-05-07T20:32:05.8244247Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8244346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8244425Z E       ^
2025-05-07T20:32:05.8244779Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8244784Z 
2025-05-07T20:32:05.8245198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8245202Z 
2025-05-07T20:32:05.8245310Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8245532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8245611Z     T=1,
2025-05-07T20:32:05.8245695Z     D=7168,
2025-05-07T20:32:05.8245779Z     scale_ub=None,
2025-05-07T20:32:05.8245863Z     contiguous=False,
2025-05-07T20:32:05.8245951Z     compiled=False,
2025-05-07T20:32:05.8246066Z )
2025-05-07T20:32:05.8246281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8246451Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.8246456Z 
2025-05-07T20:32:05.8246532Z     @given(
2025-05-07T20:32:05.8246657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8246757Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8246871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8246993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8247104Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8247183Z     )
2025-05-07T20:32:05.8247428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8247572Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8247651Z         self,
2025-05-07T20:32:05.8247733Z         T: int,
2025-05-07T20:32:05.8247812Z         D: int,
2025-05-07T20:32:05.8247957Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8248050Z         contiguous: bool,
2025-05-07T20:32:05.8248134Z         compiled: bool,
2025-05-07T20:32:05.8248216Z     ) -> None:
2025-05-07T20:32:05.8248310Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8248384Z     
2025-05-07T20:32:05.8248560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8248634Z     
2025-05-07T20:32:05.8248725Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8248861Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8248950Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8249026Z         x0 = x[:, :D]
2025-05-07T20:32:05.8249111Z         x1 = x[:, D:]
2025-05-07T20:32:05.8249185Z     
2025-05-07T20:32:05.8249268Z         if contiguous:
2025-05-07T20:32:05.8249363Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8249451Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8249521Z     
2025-05-07T20:32:05.8249617Z         if scale_ub is not None:
2025-05-07T20:32:05.8249725Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8249864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8249940Z             )
2025-05-07T20:32:05.8250015Z         else:
2025-05-07T20:32:05.8250115Z             scale_ub_tensor = None
2025-05-07T20:32:05.8250188Z     
2025-05-07T20:32:05.8250315Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8250411Z             op = silu_mul_quant
2025-05-07T20:32:05.8250575Z             if compiled:
2025-05-07T20:32:05.8250675Z                 op = torch.compile(op)
2025-05-07T20:32:05.8250785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8250858Z     
2025-05-07T20:32:05.8250948Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8250959Z 
2025-05-07T20:32:05.8251057Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8251187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8251298Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8251398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8251892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8251993Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8252344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8252574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8252912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8253006Z     kernel = self.compile(
2025-05-07T20:32:05.8253386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8253558Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8253730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8253734Z 
2025-05-07T20:32:05.8253938Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9830750>
2025-05-07T20:32:05.8254706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8255215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a98818a0>}
2025-05-07T20:32:05.8255953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8256144Z context = <triton._C.libtriton.ir.context object at 0x7f39a9898d70>
2025-05-07T20:32:05.8256190Z 
2025-05-07T20:32:05.8256352Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8256614Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8256730Z                            module_map=module_map)
2025-05-07T20:32:05.8256915Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8257032Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8257119Z E       ^
2025-05-07T20:32:05.8257471Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8257476Z 
2025-05-07T20:32:05.8257888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8257893Z 
2025-05-07T20:32:05.8257997Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8258222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8258306Z     T=2048,
2025-05-07T20:32:05.8258385Z     D=7168,
2025-05-07T20:32:05.8258469Z     scale_ub=None,
2025-05-07T20:32:05.8258561Z     contiguous=False,
2025-05-07T20:32:05.8258644Z     compiled=True,
2025-05-07T20:32:05.8258724Z )
2025-05-07T20:32:05.8258938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8259185Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.8259191Z 
2025-05-07T20:32:05.8259273Z     @given(
2025-05-07T20:32:05.8259389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8259487Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8259606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8259721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8259837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8259914Z     )
2025-05-07T20:32:05.8260154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8260253Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8260329Z         self,
2025-05-07T20:32:05.8260405Z         T: int,
2025-05-07T20:32:05.8260488Z         D: int,
2025-05-07T20:32:05.8260586Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8260675Z         contiguous: bool,
2025-05-07T20:32:05.8260767Z         compiled: bool,
2025-05-07T20:32:05.8260852Z     ) -> None:
2025-05-07T20:32:05.8260946Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8261024Z     
2025-05-07T20:32:05.8261190Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8261264Z     
2025-05-07T20:32:05.8261358Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8261479Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8261573Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8261651Z         x0 = x[:, :D]
2025-05-07T20:32:05.8261780Z         x1 = x[:, D:]
2025-05-07T20:32:05.8261862Z     
2025-05-07T20:32:05.8261943Z         if contiguous:
2025-05-07T20:32:05.8262032Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8262128Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8262196Z     
2025-05-07T20:32:05.8262283Z         if scale_ub is not None:
2025-05-07T20:32:05.8262392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8262526Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8262606Z             )
2025-05-07T20:32:05.8262687Z         else:
2025-05-07T20:32:05.8262779Z             scale_ub_tensor = None
2025-05-07T20:32:05.8262856Z     
2025-05-07T20:32:05.8262983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8263072Z             op = silu_mul_quant
2025-05-07T20:32:05.8263159Z             if compiled:
2025-05-07T20:32:05.8263258Z                 op = torch.compile(op)
2025-05-07T20:32:05.8263359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8263483Z     
2025-05-07T20:32:05.8263572Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8263577Z 
2025-05-07T20:32:05.8263672Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8263806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8263906Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8264009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8264381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8264472Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8264963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8265062Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8265412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8265643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8265981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8266081Z     kernel = self.compile(
2025-05-07T20:32:05.8266458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8266709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8266841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8266846Z 
2025-05-07T20:32:05.8267046Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9453510>
2025-05-07T20:32:05.8267819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8268322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9882b60>}
2025-05-07T20:32:05.8269059Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8269259Z context = <triton._C.libtriton.ir.context object at 0x7f39a945fb30>
2025-05-07T20:32:05.8269264Z 
2025-05-07T20:32:05.8269425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8269694Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8269798Z                            module_map=module_map)
2025-05-07T20:32:05.8269956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8270106Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8270183Z E       ^
2025-05-07T20:32:05.8270532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8270544Z 
2025-05-07T20:32:05.8270952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8270957Z 
2025-05-07T20:32:05.8271061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8271288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8271368Z     T=4096,
2025-05-07T20:32:05.8271446Z     D=7168,
2025-05-07T20:32:05.8271533Z     scale_ub=None,
2025-05-07T20:32:05.8271616Z     contiguous=False,
2025-05-07T20:32:05.8271701Z     compiled=True,
2025-05-07T20:32:05.8271776Z )
2025-05-07T20:32:05.8271990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8272233Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.8272238Z 
2025-05-07T20:32:05.8272314Z     @given(
2025-05-07T20:32:05.8272434Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8272538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8272652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8272771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8272886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8272966Z     )
2025-05-07T20:32:05.8273213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8273308Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8273382Z         self,
2025-05-07T20:32:05.8273464Z         T: int,
2025-05-07T20:32:05.8273541Z         D: int,
2025-05-07T20:32:05.8273639Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8273735Z         contiguous: bool,
2025-05-07T20:32:05.8273828Z         compiled: bool,
2025-05-07T20:32:05.8273906Z     ) -> None:
2025-05-07T20:32:05.8274007Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8274081Z     
2025-05-07T20:32:05.8274248Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8274329Z     
2025-05-07T20:32:05.8274420Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8274543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8274637Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8274799Z         x0 = x[:, :D]
2025-05-07T20:32:05.8274887Z         x1 = x[:, D:]
2025-05-07T20:32:05.8274959Z     
2025-05-07T20:32:05.8275042Z         if contiguous:
2025-05-07T20:32:05.8275137Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8275222Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8275291Z     
2025-05-07T20:32:05.8275385Z         if scale_ub is not None:
2025-05-07T20:32:05.8275489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8275625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8275709Z             )
2025-05-07T20:32:05.8275786Z         else:
2025-05-07T20:32:05.8275878Z             scale_ub_tensor = None
2025-05-07T20:32:05.8275957Z     
2025-05-07T20:32:05.8276084Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8276179Z             op = silu_mul_quant
2025-05-07T20:32:05.8276265Z             if compiled:
2025-05-07T20:32:05.8276367Z                 op = torch.compile(op)
2025-05-07T20:32:05.8276479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8276551Z     
2025-05-07T20:32:05.8276641Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8276645Z 
2025-05-07T20:32:05.8276748Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8276878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8276976Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8277076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8277491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8277588Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8278077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8278171Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8278534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8278751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8279086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8279187Z     kernel = self.compile(
2025-05-07T20:32:05.8279563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8279783Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8279907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8279911Z 
2025-05-07T20:32:05.8280113Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9404c90>
2025-05-07T20:32:05.8280890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8281387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a9883e20>}
2025-05-07T20:32:05.8282130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8282321Z context = <triton._C.libtriton.ir.context object at 0x7f39a94918b0>
2025-05-07T20:32:05.8282326Z 
2025-05-07T20:32:05.8282491Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8282751Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8282859Z                            module_map=module_map)
2025-05-07T20:32:05.8283026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8283199Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8283280Z E       ^
2025-05-07T20:32:05.8283634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8283639Z 
2025-05-07T20:32:05.8284047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8284052Z 
2025-05-07T20:32:05.8284163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8284381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8284464Z     T=16384,
2025-05-07T20:32:05.8284547Z     D=5120,
2025-05-07T20:32:05.8284631Z     scale_ub=1200.0,
2025-05-07T20:32:05.8284714Z     contiguous=False,
2025-05-07T20:32:05.8284807Z     compiled=False,
2025-05-07T20:32:05.8284882Z )
2025-05-07T20:32:05.8285095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8285282Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.8285287Z 
2025-05-07T20:32:05.8285363Z     @given(
2025-05-07T20:32:05.8285485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8285582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8285696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8285818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8285974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8286049Z     )
2025-05-07T20:32:05.8286295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8286387Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8286470Z         self,
2025-05-07T20:32:05.8286545Z         T: int,
2025-05-07T20:32:05.8286622Z         D: int,
2025-05-07T20:32:05.8286720Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8286806Z         contiguous: bool,
2025-05-07T20:32:05.8286893Z         compiled: bool,
2025-05-07T20:32:05.8286975Z     ) -> None:
2025-05-07T20:32:05.8287069Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8287139Z     
2025-05-07T20:32:05.8287310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8287384Z     
2025-05-07T20:32:05.8287472Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8287648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8287783Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8287859Z         x0 = x[:, :D]
2025-05-07T20:32:05.8287943Z         x1 = x[:, D:]
2025-05-07T20:32:05.8288014Z     
2025-05-07T20:32:05.8288103Z         if contiguous:
2025-05-07T20:32:05.8288193Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8288283Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8288361Z     
2025-05-07T20:32:05.8288449Z         if scale_ub is not None:
2025-05-07T20:32:05.8288553Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8288693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8288767Z             )
2025-05-07T20:32:05.8288842Z         else:
2025-05-07T20:32:05.8288940Z             scale_ub_tensor = None
2025-05-07T20:32:05.8289010Z     
2025-05-07T20:32:05.8289138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8289232Z             op = silu_mul_quant
2025-05-07T20:32:05.8289316Z             if compiled:
2025-05-07T20:32:05.8289421Z                 op = torch.compile(op)
2025-05-07T20:32:05.8289531Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8289601Z     
2025-05-07T20:32:05.8289697Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8289701Z 
2025-05-07T20:32:05.8289796Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8289926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8290032Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8290128Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8290701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8290802Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8291157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8291377Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8291719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8291813Z     kernel = self.compile(
2025-05-07T20:32:05.8295745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8295939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8296067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8296080Z 
2025-05-07T20:32:05.8296286Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9431ed0>
2025-05-07T20:32:05.8297064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8297569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a941cd60>}
2025-05-07T20:32:05.8298385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8298576Z context = <triton._C.libtriton.ir.context object at 0x7f39a94ea4f0>
2025-05-07T20:32:05.8298581Z 
2025-05-07T20:32:05.8298752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8299014Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8299121Z                            module_map=module_map)
2025-05-07T20:32:05.8299284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8299384Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8299461Z E       ^
2025-05-07T20:32:05.8299818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8299864Z 
2025-05-07T20:32:05.8300278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8300282Z 
2025-05-07T20:32:05.8300389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8300609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8300691Z     T=16384,
2025-05-07T20:32:05.8300772Z     D=5120,
2025-05-07T20:32:05.8300857Z     scale_ub=1200.0,
2025-05-07T20:32:05.8300945Z     contiguous=True,
2025-05-07T20:32:05.8301032Z     compiled=True,
2025-05-07T20:32:05.8301107Z )
2025-05-07T20:32:05.8301323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8301501Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.8301505Z 
2025-05-07T20:32:05.8301587Z     @given(
2025-05-07T20:32:05.8301708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8301807Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8301921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8302039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8302154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8302231Z     )
2025-05-07T20:32:05.8302553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8302649Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8302730Z         self,
2025-05-07T20:32:05.8302809Z         T: int,
2025-05-07T20:32:05.8302887Z         D: int,
2025-05-07T20:32:05.8302991Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8303085Z         contiguous: bool,
2025-05-07T20:32:05.8303171Z         compiled: bool,
2025-05-07T20:32:05.8303254Z     ) -> None:
2025-05-07T20:32:05.8303349Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8303428Z     
2025-05-07T20:32:05.8303597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8303671Z     
2025-05-07T20:32:05.8303763Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8303890Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8303979Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8304061Z         x0 = x[:, :D]
2025-05-07T20:32:05.8304146Z         x1 = x[:, D:]
2025-05-07T20:32:05.8304220Z     
2025-05-07T20:32:05.8304315Z         if contiguous:
2025-05-07T20:32:05.8304408Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8304501Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8304577Z     
2025-05-07T20:32:05.8304667Z         if scale_ub is not None:
2025-05-07T20:32:05.8304773Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8304913Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8304989Z             )
2025-05-07T20:32:05.8305138Z         else:
2025-05-07T20:32:05.8305235Z             scale_ub_tensor = None
2025-05-07T20:32:05.8305307Z     
2025-05-07T20:32:05.8305435Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8305529Z             op = silu_mul_quant
2025-05-07T20:32:05.8305947Z             if compiled:
2025-05-07T20:32:05.8306094Z                 op = torch.compile(op)
2025-05-07T20:32:05.8306201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8306272Z     
2025-05-07T20:32:05.8306371Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8306376Z 
2025-05-07T20:32:05.8306469Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8306596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8306697Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8306793Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8307154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8307348Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8307836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8307937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8308285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8308501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8308844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8308936Z     kernel = self.compile(
2025-05-07T20:32:05.8309315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8309485Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8309610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8309620Z 
2025-05-07T20:32:05.8309825Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a90eeed0>
2025-05-07T20:32:05.8310595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8311906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a941e200>}
2025-05-07T20:32:05.8312658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8312845Z context = <triton._C.libtriton.ir.context object at 0x7f39a90dd270>
2025-05-07T20:32:05.8312855Z 
2025-05-07T20:32:05.8313019Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8313278Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8313385Z                            module_map=module_map)
2025-05-07T20:32:05.8313544Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8313641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8313721Z E       ^
2025-05-07T20:32:05.8314078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8314083Z 
2025-05-07T20:32:05.8314498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8314502Z 
2025-05-07T20:32:05.8314603Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8314819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8314962Z     T=16384,
2025-05-07T20:32:05.8315036Z     D=5120,
2025-05-07T20:32:05.8315115Z     scale_ub=None,
2025-05-07T20:32:05.8315200Z     contiguous=False,
2025-05-07T20:32:05.8315280Z     compiled=True,
2025-05-07T20:32:05.8315354Z )
2025-05-07T20:32:05.8315571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8315742Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.8315747Z 
2025-05-07T20:32:05.8315832Z     @given(
2025-05-07T20:32:05.8315952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8316049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8316165Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8316278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8316387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8316460Z     )
2025-05-07T20:32:05.8316744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8316836Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8316915Z         self,
2025-05-07T20:32:05.8316989Z         T: int,
2025-05-07T20:32:05.8317067Z         D: int,
2025-05-07T20:32:05.8317163Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8317253Z         contiguous: bool,
2025-05-07T20:32:05.8317337Z         compiled: bool,
2025-05-07T20:32:05.8317412Z     ) -> None:
2025-05-07T20:32:05.8317507Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8317579Z     
2025-05-07T20:32:05.8317745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8317817Z     
2025-05-07T20:32:05.8317906Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8318027Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8318111Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8318192Z         x0 = x[:, :D]
2025-05-07T20:32:05.8318269Z         x1 = x[:, D:]
2025-05-07T20:32:05.8318347Z     
2025-05-07T20:32:05.8318430Z         if contiguous:
2025-05-07T20:32:05.8318518Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8318606Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8318677Z     
2025-05-07T20:32:05.8318763Z         if scale_ub is not None:
2025-05-07T20:32:05.8318873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8319003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8319077Z             )
2025-05-07T20:32:05.8319234Z         else:
2025-05-07T20:32:05.8319326Z             scale_ub_tensor = None
2025-05-07T20:32:05.8319394Z     
2025-05-07T20:32:05.8319524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8319610Z             op = silu_mul_quant
2025-05-07T20:32:05.8319693Z             if compiled:
2025-05-07T20:32:05.8319792Z                 op = torch.compile(op)
2025-05-07T20:32:05.8319894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8319976Z     
2025-05-07T20:32:05.8320064Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8320068Z 
2025-05-07T20:32:05.8320161Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8320293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8320393Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8320489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8320857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8320951Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8321442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8321537Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8321888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8322111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8322494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8322585Z     kernel = self.compile(
2025-05-07T20:32:05.8322962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8323131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8323264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8323269Z 
2025-05-07T20:32:05.8323468Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9022290>
2025-05-07T20:32:05.8324238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8324783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a941ed40>}
2025-05-07T20:32:05.8325521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8325711Z context = <triton._C.libtriton.ir.context object at 0x7f39a9058c30>
2025-05-07T20:32:05.8325720Z 
2025-05-07T20:32:05.8325881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8326141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8326246Z                            module_map=module_map)
2025-05-07T20:32:05.8326405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8326502Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8326582Z E       ^
2025-05-07T20:32:05.8326929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8326934Z 
2025-05-07T20:32:05.8327343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8327347Z 
2025-05-07T20:32:05.8327450Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8327799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8327882Z     T=2048,
2025-05-07T20:32:05.8327958Z     D=5120,
2025-05-07T20:32:05.8328042Z     scale_ub=None,
2025-05-07T20:32:05.8328125Z     contiguous=False,
2025-05-07T20:32:05.8328207Z     compiled=True,
2025-05-07T20:32:05.8328283Z )
2025-05-07T20:32:05.8328496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8328666Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.8328675Z 
2025-05-07T20:32:05.8328756Z     @given(
2025-05-07T20:32:05.8328876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8328976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8329087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8329201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8329313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8329386Z     )
2025-05-07T20:32:05.8329638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8329730Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8329806Z         self,
2025-05-07T20:32:05.8329885Z         T: int,
2025-05-07T20:32:05.8329962Z         D: int,
2025-05-07T20:32:05.8330059Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8330147Z         contiguous: bool,
2025-05-07T20:32:05.8330229Z         compiled: bool,
2025-05-07T20:32:05.8330356Z     ) -> None:
2025-05-07T20:32:05.8330444Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8330515Z     
2025-05-07T20:32:05.8330681Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8330751Z     
2025-05-07T20:32:05.8330838Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8330961Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8331046Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8331125Z         x0 = x[:, :D]
2025-05-07T20:32:05.8331210Z         x1 = x[:, D:]
2025-05-07T20:32:05.8331279Z     
2025-05-07T20:32:05.8331358Z         if contiguous:
2025-05-07T20:32:05.8331449Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8331534Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8331602Z     
2025-05-07T20:32:05.8331691Z         if scale_ub is not None:
2025-05-07T20:32:05.8331795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8331930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8332044Z             )
2025-05-07T20:32:05.8332118Z         else:
2025-05-07T20:32:05.8332211Z             scale_ub_tensor = None
2025-05-07T20:32:05.8332281Z     
2025-05-07T20:32:05.8332406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8332496Z             op = silu_mul_quant
2025-05-07T20:32:05.8332576Z             if compiled:
2025-05-07T20:32:05.8332673Z                 op = torch.compile(op)
2025-05-07T20:32:05.8332778Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8332854Z     
2025-05-07T20:32:05.8332942Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8332949Z 
2025-05-07T20:32:05.8333040Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8333167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8333266Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8333361Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8333724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8333823Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8334314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8334409Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8334766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8335130Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8335472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8335562Z     kernel = self.compile(
2025-05-07T20:32:05.8335937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8336111Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8336239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8336243Z 
2025-05-07T20:32:05.8336445Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a9343890>
2025-05-07T20:32:05.8337263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8337768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a93fc7c0>}
2025-05-07T20:32:05.8338512Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8338742Z context = <triton._C.libtriton.ir.context object at 0x7f39a9377e70>
2025-05-07T20:32:05.8338748Z 
2025-05-07T20:32:05.8338912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8339170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8339276Z                            module_map=module_map)
2025-05-07T20:32:05.8339437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8339538Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8339618Z E       ^
2025-05-07T20:32:05.8339967Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8339972Z 
2025-05-07T20:32:05.8340378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8340383Z 
2025-05-07T20:32:05.8340486Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8340746Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8340824Z     T=2048,
2025-05-07T20:32:05.8340901Z     D=5120,
2025-05-07T20:32:05.8340982Z     scale_ub=1200.0,
2025-05-07T20:32:05.8341072Z     contiguous=False,
2025-05-07T20:32:05.8341152Z     compiled=True,
2025-05-07T20:32:05.8341225Z )
2025-05-07T20:32:05.8341444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8341620Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8341625Z 
2025-05-07T20:32:05.8341701Z     @given(
2025-05-07T20:32:05.8341824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8341921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8342039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8342154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8342267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8342351Z     )
2025-05-07T20:32:05.8342591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8342683Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8342766Z         self,
2025-05-07T20:32:05.8342842Z         T: int,
2025-05-07T20:32:05.8342917Z         D: int,
2025-05-07T20:32:05.8343019Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8343105Z         contiguous: bool,
2025-05-07T20:32:05.8343188Z         compiled: bool,
2025-05-07T20:32:05.8343348Z     ) -> None:
2025-05-07T20:32:05.8343442Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8343515Z     
2025-05-07T20:32:05.8343687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8343759Z     
2025-05-07T20:32:05.8343852Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8343974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8344062Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8344147Z         x0 = x[:, :D]
2025-05-07T20:32:05.8344224Z         x1 = x[:, D:]
2025-05-07T20:32:05.8344295Z     
2025-05-07T20:32:05.8344378Z         if contiguous:
2025-05-07T20:32:05.8344466Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8344555Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8344633Z     
2025-05-07T20:32:05.8344724Z         if scale_ub is not None:
2025-05-07T20:32:05.8344826Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8344964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8345036Z             )
2025-05-07T20:32:05.8345109Z         else:
2025-05-07T20:32:05.8345198Z             scale_ub_tensor = None
2025-05-07T20:32:05.8345267Z     
2025-05-07T20:32:05.8345395Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8345481Z             op = silu_mul_quant
2025-05-07T20:32:05.8345561Z             if compiled:
2025-05-07T20:32:05.8345659Z                 op = torch.compile(op)
2025-05-07T20:32:05.8345810Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8345880Z     
2025-05-07T20:32:05.8345970Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8345974Z 
2025-05-07T20:32:05.8346068Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8346198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8346293Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8346388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8346754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8346841Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8347329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8347428Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8347778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8348047Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8348380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8348474Z     kernel = self.compile(
2025-05-07T20:32:05.8348851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8349025Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8349149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8349158Z 
2025-05-07T20:32:05.8349357Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a933cbd0>
2025-05-07T20:32:05.8350120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8350626Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a93fd300>}
2025-05-07T20:32:05.8351365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8351628Z context = <triton._C.libtriton.ir.context object at 0x7f39a93dd1f0>
2025-05-07T20:32:05.8351634Z 
2025-05-07T20:32:05.8351796Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8352053Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8352159Z                            module_map=module_map)
2025-05-07T20:32:05.8352317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8352420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8352502Z E       ^
2025-05-07T20:32:05.8352850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8352855Z 
2025-05-07T20:32:05.8353266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8353270Z 
2025-05-07T20:32:05.8353372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8353592Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8353670Z     T=4096,
2025-05-07T20:32:05.8353747Z     D=5120,
2025-05-07T20:32:05.8353830Z     scale_ub=1200.0,
2025-05-07T20:32:05.8353914Z     contiguous=True,
2025-05-07T20:32:05.8353991Z     compiled=True,
2025-05-07T20:32:05.8354069Z )
2025-05-07T20:32:05.8354284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8354499Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.8354503Z 
2025-05-07T20:32:05.8354582Z     @given(
2025-05-07T20:32:05.8354698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8354794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8354911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8355025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8355144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8355215Z     )
2025-05-07T20:32:05.8355454Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8355548Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8355622Z         self,
2025-05-07T20:32:05.8355694Z         T: int,
2025-05-07T20:32:05.8355767Z         D: int,
2025-05-07T20:32:05.8355861Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8355947Z         contiguous: bool,
2025-05-07T20:32:05.8356075Z         compiled: bool,
2025-05-07T20:32:05.8356152Z     ) -> None:
2025-05-07T20:32:05.8356244Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8356315Z     
2025-05-07T20:32:05.8356484Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8356562Z     
2025-05-07T20:32:05.8356652Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8356773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8356885Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8356975Z         x0 = x[:, :D]
2025-05-07T20:32:05.8357078Z         x1 = x[:, D:]
2025-05-07T20:32:05.8357157Z     
2025-05-07T20:32:05.8357241Z         if contiguous:
2025-05-07T20:32:05.8357335Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8357424Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8357498Z     
2025-05-07T20:32:05.8357590Z         if scale_ub is not None:
2025-05-07T20:32:05.8357697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8357837Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8357919Z             )
2025-05-07T20:32:05.8357988Z         else:
2025-05-07T20:32:05.8358078Z             scale_ub_tensor = None
2025-05-07T20:32:05.8358151Z     
2025-05-07T20:32:05.8358275Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8358363Z             op = silu_mul_quant
2025-05-07T20:32:05.8358451Z             if compiled:
2025-05-07T20:32:05.8358549Z                 op = torch.compile(op)
2025-05-07T20:32:05.8358739Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8358813Z     
2025-05-07T20:32:05.8358902Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8358906Z 
2025-05-07T20:32:05.8359006Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8359133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8359230Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8359333Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8359701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8359798Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8360286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8360382Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8360743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8360962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8361297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8361395Z     kernel = self.compile(
2025-05-07T20:32:05.8361773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8361994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8362119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8362123Z 
2025-05-07T20:32:05.8362326Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8e29c90>
2025-05-07T20:32:05.8363102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8363601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a93fdbc0>}
2025-05-07T20:32:05.8364346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8364580Z context = <triton._C.libtriton.ir.context object at 0x7f39a8ee2230>
2025-05-07T20:32:05.8364585Z 
2025-05-07T20:32:05.8364750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8365013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8365117Z                            module_map=module_map)
2025-05-07T20:32:05.8365287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8365384Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8365463Z E       ^
2025-05-07T20:32:05.8365820Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8365824Z 
2025-05-07T20:32:05.8366231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8366238Z 
2025-05-07T20:32:05.8366349Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8366571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8366648Z     T=128,
2025-05-07T20:32:05.8366728Z     D=5120,
2025-05-07T20:32:05.8366810Z     scale_ub=1200.0,
2025-05-07T20:32:05.8366896Z     contiguous=False,
2025-05-07T20:32:05.8366985Z     compiled=True,
2025-05-07T20:32:05.8367060Z )
2025-05-07T20:32:05.8367273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8367598Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8367603Z 
2025-05-07T20:32:05.8367681Z     @given(
2025-05-07T20:32:05.8367805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8367910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8368027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8368147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8368264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8368339Z     )
2025-05-07T20:32:05.8368584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8368678Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8368755Z         self,
2025-05-07T20:32:05.8368835Z         T: int,
2025-05-07T20:32:05.8368914Z         D: int,
2025-05-07T20:32:05.8369016Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8369103Z         contiguous: bool,
2025-05-07T20:32:05.8369197Z         compiled: bool,
2025-05-07T20:32:05.8369282Z     ) -> None:
2025-05-07T20:32:05.8369376Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8369449Z     
2025-05-07T20:32:05.8369622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8369699Z     
2025-05-07T20:32:05.8369789Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8369918Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8370054Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8370137Z         x0 = x[:, :D]
2025-05-07T20:32:05.8370224Z         x1 = x[:, D:]
2025-05-07T20:32:05.8370298Z     
2025-05-07T20:32:05.8370384Z         if contiguous:
2025-05-07T20:32:05.8370482Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8370571Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8370650Z     
2025-05-07T20:32:05.8370740Z         if scale_ub is not None:
2025-05-07T20:32:05.8370845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8370986Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8371059Z             )
2025-05-07T20:32:05.8371131Z         else:
2025-05-07T20:32:05.8371228Z             scale_ub_tensor = None
2025-05-07T20:32:05.8371300Z     
2025-05-07T20:32:05.8371429Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8371525Z             op = silu_mul_quant
2025-05-07T20:32:05.8371609Z             if compiled:
2025-05-07T20:32:05.8371826Z                 op = torch.compile(op)
2025-05-07T20:32:05.8371937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8372008Z     
2025-05-07T20:32:05.8372105Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8372109Z 
2025-05-07T20:32:05.8372205Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8372334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8372436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8372531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8372898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8372993Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8373480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8373586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8373937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8374162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8374502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8374594Z     kernel = self.compile(
2025-05-07T20:32:05.8374970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8375224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8375352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8375356Z 
2025-05-07T20:32:05.8375563Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8e02f50>
2025-05-07T20:32:05.8376333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8376843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a93feb60>}
2025-05-07T20:32:05.8380195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8380436Z context = <triton._C.libtriton.ir.context object at 0x7f39a8ef2170>
2025-05-07T20:32:05.8380441Z 
2025-05-07T20:32:05.8380632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8380944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8381058Z                            module_map=module_map)
2025-05-07T20:32:05.8381284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8381383Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8381457Z E       ^
2025-05-07T20:32:05.8381816Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8381820Z 
2025-05-07T20:32:05.8382236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8382240Z 
2025-05-07T20:32:05.8382372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8382591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8382671Z     T=16384,
2025-05-07T20:32:05.8382755Z     D=7168,
2025-05-07T20:32:05.8382841Z     scale_ub=1200.0,
2025-05-07T20:32:05.8382923Z     contiguous=True,
2025-05-07T20:32:05.8383011Z     compiled=True,
2025-05-07T20:32:05.8383083Z )
2025-05-07T20:32:05.8383300Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8383528Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.8383533Z 
2025-05-07T20:32:05.8383612Z     @given(
2025-05-07T20:32:05.8383737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8383836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8383950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8384074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8384188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8384262Z     )
2025-05-07T20:32:05.8384509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8384601Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8384682Z         self,
2025-05-07T20:32:05.8384761Z         T: int,
2025-05-07T20:32:05.8384838Z         D: int,
2025-05-07T20:32:05.8384941Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8385035Z         contiguous: bool,
2025-05-07T20:32:05.8385118Z         compiled: bool,
2025-05-07T20:32:05.8385201Z     ) -> None:
2025-05-07T20:32:05.8385294Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8385368Z     
2025-05-07T20:32:05.8385539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8385612Z     
2025-05-07T20:32:05.8385702Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8385830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8385961Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8386045Z         x0 = x[:, :D]
2025-05-07T20:32:05.8386133Z         x1 = x[:, D:]
2025-05-07T20:32:05.8386208Z     
2025-05-07T20:32:05.8386301Z         if contiguous:
2025-05-07T20:32:05.8386394Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8386482Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8386559Z     
2025-05-07T20:32:05.8386649Z         if scale_ub is not None:
2025-05-07T20:32:05.8386759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8386897Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8386969Z             )
2025-05-07T20:32:05.8387042Z         else:
2025-05-07T20:32:05.8387142Z             scale_ub_tensor = None
2025-05-07T20:32:05.8387214Z     
2025-05-07T20:32:05.8387342Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8387437Z             op = silu_mul_quant
2025-05-07T20:32:05.8387520Z             if compiled:
2025-05-07T20:32:05.8387706Z                 op = torch.compile(op)
2025-05-07T20:32:05.8387816Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8387890Z     
2025-05-07T20:32:05.8387990Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8387995Z 
2025-05-07T20:32:05.8388091Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8388222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8388332Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8388472Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8388840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8388937Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8389427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8389530Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8389887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8390108Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8390446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8390539Z     kernel = self.compile(
2025-05-07T20:32:05.8390919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8391143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8391269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8391274Z 
2025-05-07T20:32:05.8391482Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8fc4410>
2025-05-07T20:32:05.8392253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8392758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8e18720>}
2025-05-07T20:32:05.8393500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8393693Z context = <triton._C.libtriton.ir.context object at 0x7f39a8f08a30>
2025-05-07T20:32:05.8393698Z 
2025-05-07T20:32:05.8393866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8394125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8394240Z                            module_map=module_map)
2025-05-07T20:32:05.8394442Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8394545Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8394630Z E       ^
2025-05-07T20:32:05.8394986Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8394991Z 
2025-05-07T20:32:05.8395401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8395415Z 
2025-05-07T20:32:05.8395523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8395746Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8395829Z     T=16384,
2025-05-07T20:32:05.8395912Z     D=5120,
2025-05-07T20:32:05.8396000Z     scale_ub=1200.0,
2025-05-07T20:32:05.8396094Z     contiguous=True,
2025-05-07T20:32:05.8396183Z     compiled=False,
2025-05-07T20:32:05.8396259Z )
2025-05-07T20:32:05.8396541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8396719Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8396724Z 
2025-05-07T20:32:05.8396811Z     @given(
2025-05-07T20:32:05.8396931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8397032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8397155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8397342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8397456Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8397536Z     )
2025-05-07T20:32:05.8397777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8397870Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8397950Z         self,
2025-05-07T20:32:05.8398025Z         T: int,
2025-05-07T20:32:05.8398098Z         D: int,
2025-05-07T20:32:05.8398201Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8398293Z         contiguous: bool,
2025-05-07T20:32:05.8398387Z         compiled: bool,
2025-05-07T20:32:05.8398465Z     ) -> None:
2025-05-07T20:32:05.8398562Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8398643Z     
2025-05-07T20:32:05.8398811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8398885Z     
2025-05-07T20:32:05.8398985Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8399157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8399247Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8399331Z         x0 = x[:, :D]
2025-05-07T20:32:05.8399410Z         x1 = x[:, D:]
2025-05-07T20:32:05.8399485Z     
2025-05-07T20:32:05.8399573Z         if contiguous:
2025-05-07T20:32:05.8399662Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8399754Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8399829Z     
2025-05-07T20:32:05.8399919Z         if scale_ub is not None:
2025-05-07T20:32:05.8400033Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8400169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8400249Z             )
2025-05-07T20:32:05.8400332Z         else:
2025-05-07T20:32:05.8400426Z             scale_ub_tensor = None
2025-05-07T20:32:05.8400501Z     
2025-05-07T20:32:05.8400637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8400732Z             op = silu_mul_quant
2025-05-07T20:32:05.8400825Z             if compiled:
2025-05-07T20:32:05.8400932Z                 op = torch.compile(op)
2025-05-07T20:32:05.8401039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8401115Z     
2025-05-07T20:32:05.8401212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8401217Z 
2025-05-07T20:32:05.8401312Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8401445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8401587Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8401689Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8402188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8402283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8402635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8402861Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8403203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8403307Z     kernel = self.compile(
2025-05-07T20:32:05.8403686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8403861Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8404047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8404053Z 
2025-05-07T20:32:05.8404258Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8f9d810>
2025-05-07T20:32:05.8405035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8405577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8e19d00>}
2025-05-07T20:32:05.8406662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8406866Z context = <triton._C.libtriton.ir.context object at 0x7f39a8fda170>
2025-05-07T20:32:05.8406875Z 
2025-05-07T20:32:05.8407039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8407309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8407419Z                            module_map=module_map)
2025-05-07T20:32:05.8407650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8407758Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8407929Z E       ^
2025-05-07T20:32:05.8408291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8408295Z 
2025-05-07T20:32:05.8408711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8408716Z 
2025-05-07T20:32:05.8408818Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8409045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8409125Z     T=1,
2025-05-07T20:32:05.8409204Z     D=7168,
2025-05-07T20:32:05.8409294Z     scale_ub=1200.0,
2025-05-07T20:32:05.8409381Z     contiguous=False,
2025-05-07T20:32:05.8409469Z     compiled=False,
2025-05-07T20:32:05.8409542Z )
2025-05-07T20:32:05.8409759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8409933Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.8409942Z 
2025-05-07T20:32:05.8410020Z     @given(
2025-05-07T20:32:05.8410140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8410243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8410357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8410472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8410588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8410663Z     )
2025-05-07T20:32:05.8410979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8411075Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8411149Z         self,
2025-05-07T20:32:05.8411230Z         T: int,
2025-05-07T20:32:05.8411304Z         D: int,
2025-05-07T20:32:05.8411400Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8411495Z         contiguous: bool,
2025-05-07T20:32:05.8411579Z         compiled: bool,
2025-05-07T20:32:05.8411661Z     ) -> None:
2025-05-07T20:32:05.8411760Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8411831Z     
2025-05-07T20:32:05.8411997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8412075Z     
2025-05-07T20:32:05.8412166Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8412293Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8412380Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8412459Z         x0 = x[:, :D]
2025-05-07T20:32:05.8412546Z         x1 = x[:, D:]
2025-05-07T20:32:05.8412686Z     
2025-05-07T20:32:05.8412772Z         if contiguous:
2025-05-07T20:32:05.8412866Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8412953Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8413026Z     
2025-05-07T20:32:05.8413123Z         if scale_ub is not None:
2025-05-07T20:32:05.8413226Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8413359Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8413498Z             )
2025-05-07T20:32:05.8413573Z         else:
2025-05-07T20:32:05.8413671Z             scale_ub_tensor = None
2025-05-07T20:32:05.8413741Z     
2025-05-07T20:32:05.8413869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8413961Z             op = silu_mul_quant
2025-05-07T20:32:05.8414044Z             if compiled:
2025-05-07T20:32:05.8414143Z                 op = torch.compile(op)
2025-05-07T20:32:05.8414251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8414324Z     
2025-05-07T20:32:05.8414415Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8414419Z 
2025-05-07T20:32:05.8414522Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8414648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8414748Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8414845Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8415342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8415486Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8415839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8416056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8416400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8416495Z     kernel = self.compile(
2025-05-07T20:32:05.8416902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8417100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8417223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8417227Z 
2025-05-07T20:32:05.8417440Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8f5ca90>
2025-05-07T20:32:05.8418206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8418754Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8e193a0>}
2025-05-07T20:32:05.8419498Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8419690Z context = <triton._C.libtriton.ir.context object at 0x7f39a8f290b0>
2025-05-07T20:32:05.8419694Z 
2025-05-07T20:32:05.8419868Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8424106Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8424224Z                            module_map=module_map)
2025-05-07T20:32:05.8424386Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8424489Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8424565Z E       ^
2025-05-07T20:32:05.8424924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8424929Z 
2025-05-07T20:32:05.8425412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8425417Z 
2025-05-07T20:32:05.8425521Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8425742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8425817Z     T=4096,
2025-05-07T20:32:05.8425892Z     D=7168,
2025-05-07T20:32:05.8426022Z     scale_ub=1200.0,
2025-05-07T20:32:05.8426105Z     contiguous=False,
2025-05-07T20:32:05.8426191Z     compiled=True,
2025-05-07T20:32:05.8426263Z )
2025-05-07T20:32:05.8426481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8426654Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8426659Z 
2025-05-07T20:32:05.8426735Z     @given(
2025-05-07T20:32:05.8426851Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8426957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8427071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8427184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8427296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8427371Z     )
2025-05-07T20:32:05.8427618Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8427757Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8427830Z         self,
2025-05-07T20:32:05.8427908Z         T: int,
2025-05-07T20:32:05.8427980Z         D: int,
2025-05-07T20:32:05.8428076Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8428166Z         contiguous: bool,
2025-05-07T20:32:05.8428251Z         compiled: bool,
2025-05-07T20:32:05.8428328Z     ) -> None:
2025-05-07T20:32:05.8428429Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8428498Z     
2025-05-07T20:32:05.8428669Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8428746Z     
2025-05-07T20:32:05.8428838Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8428968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8429054Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8429133Z         x0 = x[:, :D]
2025-05-07T20:32:05.8429213Z         x1 = x[:, D:]
2025-05-07T20:32:05.8429286Z     
2025-05-07T20:32:05.8429368Z         if contiguous:
2025-05-07T20:32:05.8429466Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8429551Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8429621Z     
2025-05-07T20:32:05.8429710Z         if scale_ub is not None:
2025-05-07T20:32:05.8429813Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8429946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8430021Z             )
2025-05-07T20:32:05.8430097Z         else:
2025-05-07T20:32:05.8430190Z             scale_ub_tensor = None
2025-05-07T20:32:05.8430261Z     
2025-05-07T20:32:05.8430434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8430526Z             op = silu_mul_quant
2025-05-07T20:32:05.8430610Z             if compiled:
2025-05-07T20:32:05.8430710Z                 op = torch.compile(op)
2025-05-07T20:32:05.8430819Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8430887Z     
2025-05-07T20:32:05.8430975Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8430979Z 
2025-05-07T20:32:05.8431081Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8431208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8431308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8431402Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8431770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8431865Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8432429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8432525Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8432879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8433099Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8433443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8433578Z     kernel = self.compile(
2025-05-07T20:32:05.8433955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8434130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8434254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8434259Z 
2025-05-07T20:32:05.8434468Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8d0b490>
2025-05-07T20:32:05.8435239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8435738Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8d982c0>}
2025-05-07T20:32:05.8436525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8436713Z context = <triton._C.libtriton.ir.context object at 0x7f39a8dbbab0>
2025-05-07T20:32:05.8436717Z 
2025-05-07T20:32:05.8436903Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8437195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8437300Z                            module_map=module_map)
2025-05-07T20:32:05.8437463Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8437561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8437638Z E       ^
2025-05-07T20:32:05.8437990Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8438002Z 
2025-05-07T20:32:05.8438409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8438414Z 
2025-05-07T20:32:05.8438517Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8438732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8438810Z     T=128,
2025-05-07T20:32:05.8438888Z     D=7168,
2025-05-07T20:32:05.8439013Z     scale_ub=1200.0,
2025-05-07T20:32:05.8439100Z     contiguous=False,
2025-05-07T20:32:05.8439184Z     compiled=True,
2025-05-07T20:32:05.8439257Z )
2025-05-07T20:32:05.8439474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8439642Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.8439647Z 
2025-05-07T20:32:05.8439718Z     @given(
2025-05-07T20:32:05.8439845Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8439940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8440050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8440166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8440276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8440350Z     )
2025-05-07T20:32:05.8440590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8440685Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8440813Z         self,
2025-05-07T20:32:05.8440889Z         T: int,
2025-05-07T20:32:05.8440964Z         D: int,
2025-05-07T20:32:05.8441062Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8441148Z         contiguous: bool,
2025-05-07T20:32:05.8441229Z         compiled: bool,
2025-05-07T20:32:05.8441307Z     ) -> None:
2025-05-07T20:32:05.8441399Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8441515Z     
2025-05-07T20:32:05.8441688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8441759Z     
2025-05-07T20:32:05.8441847Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8441974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8442061Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8442142Z         x0 = x[:, :D]
2025-05-07T20:32:05.8442222Z         x1 = x[:, D:]
2025-05-07T20:32:05.8442290Z     
2025-05-07T20:32:05.8442378Z         if contiguous:
2025-05-07T20:32:05.8442469Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8442557Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8442630Z     
2025-05-07T20:32:05.8442716Z         if scale_ub is not None:
2025-05-07T20:32:05.8442817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8442954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8443026Z             )
2025-05-07T20:32:05.8443099Z         else:
2025-05-07T20:32:05.8443193Z             scale_ub_tensor = None
2025-05-07T20:32:05.8443308Z     
2025-05-07T20:32:05.8443438Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8443525Z             op = silu_mul_quant
2025-05-07T20:32:05.8443608Z             if compiled:
2025-05-07T20:32:05.8443706Z                 op = torch.compile(op)
2025-05-07T20:32:05.8443810Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8443878Z     
2025-05-07T20:32:05.8443972Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8443976Z 
2025-05-07T20:32:05.8444075Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8444201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8444299Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8444397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8444765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8444855Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8445352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8445449Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8445798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8446014Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8446404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8446495Z     kernel = self.compile(
2025-05-07T20:32:05.8446875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8447075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8447222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8447232Z 
2025-05-07T20:32:05.8447435Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8dd8790>
2025-05-07T20:32:05.8448273Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8448832Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8d98e00>}
2025-05-07T20:32:05.8449572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8449758Z context = <triton._C.libtriton.ir.context object at 0x7f39a8de8db0>
2025-05-07T20:32:05.8449767Z 
2025-05-07T20:32:05.8449967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8450224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8450336Z                            module_map=module_map)
2025-05-07T20:32:05.8450496Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8450592Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8450673Z E       ^
2025-05-07T20:32:05.8451027Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8451031Z 
2025-05-07T20:32:05.8451444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8451448Z 
2025-05-07T20:32:05.8451550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8451766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8451894Z     T=2048,
2025-05-07T20:32:05.8451971Z     D=7168,
2025-05-07T20:32:05.8452051Z     scale_ub=None,
2025-05-07T20:32:05.8452138Z     contiguous=True,
2025-05-07T20:32:05.8452219Z     compiled=True,
2025-05-07T20:32:05.8452291Z )
2025-05-07T20:32:05.8452511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8452675Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.8452680Z 
2025-05-07T20:32:05.8452760Z     @given(
2025-05-07T20:32:05.8452881Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8452977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8453091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8453203Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8453313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8453389Z     )
2025-05-07T20:32:05.8453627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8453729Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8453808Z         self,
2025-05-07T20:32:05.8453886Z         T: int,
2025-05-07T20:32:05.8453965Z         D: int,
2025-05-07T20:32:05.8454060Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8454147Z         contiguous: bool,
2025-05-07T20:32:05.8454232Z         compiled: bool,
2025-05-07T20:32:05.8454317Z     ) -> None:
2025-05-07T20:32:05.8454410Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8454485Z     
2025-05-07T20:32:05.8454704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8454780Z     
2025-05-07T20:32:05.8454869Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8454992Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8455077Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8455156Z         x0 = x[:, :D]
2025-05-07T20:32:05.8455238Z         x1 = x[:, D:]
2025-05-07T20:32:05.8455311Z     
2025-05-07T20:32:05.8455401Z         if contiguous:
2025-05-07T20:32:05.8455490Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8455581Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8455656Z     
2025-05-07T20:32:05.8455743Z         if scale_ub is not None:
2025-05-07T20:32:05.8455847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8455984Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8456059Z             )
2025-05-07T20:32:05.8456135Z         else:
2025-05-07T20:32:05.8456236Z             scale_ub_tensor = None
2025-05-07T20:32:05.8456352Z     
2025-05-07T20:32:05.8456479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8456570Z             op = silu_mul_quant
2025-05-07T20:32:05.8456656Z             if compiled:
2025-05-07T20:32:05.8456755Z                 op = torch.compile(op)
2025-05-07T20:32:05.8456873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8456958Z     
2025-05-07T20:32:05.8457111Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8457119Z 
2025-05-07T20:32:05.8457215Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8457342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8457445Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8457539Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8457903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8457999Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8458488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8458586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8458936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8459153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8459535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8459632Z     kernel = self.compile(
2025-05-07T20:32:05.8460006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8460179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8460307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8460314Z 
2025-05-07T20:32:05.8460519Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8a09bd0>
2025-05-07T20:32:05.8461284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8461784Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8d99a80>}
2025-05-07T20:32:05.8462526Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8462712Z context = <triton._C.libtriton.ir.context object at 0x7f39a8a4e230>
2025-05-07T20:32:05.8462717Z 
2025-05-07T20:32:05.8462945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8463204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8463309Z                            module_map=module_map)
2025-05-07T20:32:05.8463469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8463568Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8463649Z E       ^
2025-05-07T20:32:05.8464001Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8464006Z 
2025-05-07T20:32:05.8464411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8464423Z 
2025-05-07T20:32:05.8464524Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8464742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8464827Z     T=16384,
2025-05-07T20:32:05.8464950Z     D=5120,
2025-05-07T20:32:05.8465035Z     scale_ub=None,
2025-05-07T20:32:05.8465119Z     contiguous=False,
2025-05-07T20:32:05.8465203Z     compiled=False,
2025-05-07T20:32:05.8465275Z )
2025-05-07T20:32:05.8465491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8465664Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.8465711Z 
2025-05-07T20:32:05.8465786Z     @given(
2025-05-07T20:32:05.8465904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8466003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8466119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8466233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8466344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8466423Z     )
2025-05-07T20:32:05.8466668Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8466760Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8466842Z         self,
2025-05-07T20:32:05.8466914Z         T: int,
2025-05-07T20:32:05.8466987Z         D: int,
2025-05-07T20:32:05.8467088Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8467176Z         contiguous: bool,
2025-05-07T20:32:05.8467265Z         compiled: bool,
2025-05-07T20:32:05.8467343Z     ) -> None:
2025-05-07T20:32:05.8467438Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8467558Z     
2025-05-07T20:32:05.8467724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8467797Z     
2025-05-07T20:32:05.8467892Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8468014Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8469819Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8469825Z 
2025-05-07T20:32:05.8469939Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.8469949Z 
2025-05-07T20:32:05.8470050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8470270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8470349Z     T=4096,
2025-05-07T20:32:05.8470427Z     D=7168,
2025-05-07T20:32:05.8470510Z     scale_ub=1200.0,
2025-05-07T20:32:05.8470594Z     contiguous=True,
2025-05-07T20:32:05.8470677Z     compiled=True,
2025-05-07T20:32:05.8470750Z )
2025-05-07T20:32:05.8471008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8471180Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.8471185Z 
2025-05-07T20:32:05.8471261Z     @given(
2025-05-07T20:32:05.8471375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8471477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8471589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8471713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8471823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8471896Z     )
2025-05-07T20:32:05.8472138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8472230Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8472305Z         self,
2025-05-07T20:32:05.8472386Z         T: int,
2025-05-07T20:32:05.8472461Z         D: int,
2025-05-07T20:32:05.8472559Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8472696Z         contiguous: bool,
2025-05-07T20:32:05.8472781Z         compiled: bool,
2025-05-07T20:32:05.8472859Z     ) -> None:
2025-05-07T20:32:05.8472954Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8473027Z     
2025-05-07T20:32:05.8473195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8473268Z     
2025-05-07T20:32:05.8473363Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8473527Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8475311Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8475317Z 
2025-05-07T20:32:05.8475436Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.8475441Z 
2025-05-07T20:32:05.8475542Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8475757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8475836Z     T=16384,
2025-05-07T20:32:05.8475954Z     D=7168,
2025-05-07T20:32:05.8476035Z     scale_ub=None,
2025-05-07T20:32:05.8476120Z     contiguous=False,
2025-05-07T20:32:05.8476201Z     compiled=False,
2025-05-07T20:32:05.8476276Z )
2025-05-07T20:32:05.8476489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8476658Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.8476662Z 
2025-05-07T20:32:05.8476739Z     @given(
2025-05-07T20:32:05.8476855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8476953Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8477069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8477182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8477291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8477368Z     )
2025-05-07T20:32:05.8477606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8477705Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8477782Z         self,
2025-05-07T20:32:05.8477859Z         T: int,
2025-05-07T20:32:05.8477935Z         D: int,
2025-05-07T20:32:05.8478032Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8478118Z         contiguous: bool,
2025-05-07T20:32:05.8478204Z         compiled: bool,
2025-05-07T20:32:05.8478280Z     ) -> None:
2025-05-07T20:32:05.8478368Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8478441Z     
2025-05-07T20:32:05.8478649Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8480432Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8480443Z 
2025-05-07T20:32:05.8480558Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8480562Z 
2025-05-07T20:32:05.8480668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8480883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8480961Z     T=2048,
2025-05-07T20:32:05.8481040Z     D=7168,
2025-05-07T20:32:05.8481167Z     scale_ub=1200.0,
2025-05-07T20:32:05.8481252Z     contiguous=True,
2025-05-07T20:32:05.8481337Z     compiled=True,
2025-05-07T20:32:05.8481410Z )
2025-05-07T20:32:05.8481624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8481794Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.8481799Z 
2025-05-07T20:32:05.8481916Z     @given(
2025-05-07T20:32:05.8482035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8482131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8482243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8482358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8482466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8482538Z     )
2025-05-07T20:32:05.8482783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8482876Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8482958Z         self,
2025-05-07T20:32:05.8483036Z         T: int,
2025-05-07T20:32:05.8483117Z         D: int,
2025-05-07T20:32:05.8483220Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8483307Z         contiguous: bool,
2025-05-07T20:32:05.8483392Z         compiled: bool,
2025-05-07T20:32:05.8483480Z     ) -> None:
2025-05-07T20:32:05.8483576Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8483697Z     
2025-05-07T20:32:05.8483868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8483940Z     
2025-05-07T20:32:05.8484035Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8484157Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8485921Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8485933Z 
2025-05-07T20:32:05.8486047Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.8486056Z 
2025-05-07T20:32:05.8486158Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8486381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8486457Z     T=2048,
2025-05-07T20:32:05.8486529Z     D=7168,
2025-05-07T20:32:05.8486614Z     scale_ub=None,
2025-05-07T20:32:05.8486694Z     contiguous=True,
2025-05-07T20:32:05.8486775Z     compiled=False,
2025-05-07T20:32:05.8486854Z )
2025-05-07T20:32:05.8487109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8487287Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8487292Z 
2025-05-07T20:32:05.8487368Z     @given(
2025-05-07T20:32:05.8487486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8487637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8487750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8487863Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8487987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8488059Z     )
2025-05-07T20:32:05.8488301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8488400Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8488473Z         self,
2025-05-07T20:32:05.8488554Z         T: int,
2025-05-07T20:32:05.8488633Z         D: int,
2025-05-07T20:32:05.8488728Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8488823Z         contiguous: bool,
2025-05-07T20:32:05.8488955Z         compiled: bool,
2025-05-07T20:32:05.8489036Z     ) -> None:
2025-05-07T20:32:05.8489133Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8489202Z     
2025-05-07T20:32:05.8489367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8489444Z     
2025-05-07T20:32:05.8489533Z >       x_sign = torch.sign(x)
2025-05-07T20:32:05.8491293Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8491345Z 
2025-05-07T20:32:05.8491464Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:05.8491468Z 
2025-05-07T20:32:05.8491576Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8491794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8491871Z     T=1,
2025-05-07T20:32:05.8491949Z     D=7168,
2025-05-07T20:32:05.8492030Z     scale_ub=1200.0,
2025-05-07T20:32:05.8492112Z     contiguous=True,
2025-05-07T20:32:05.8492267Z     compiled=False,
2025-05-07T20:32:05.8492341Z )
2025-05-07T20:32:05.8492554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8492721Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8492726Z 
2025-05-07T20:32:05.8492800Z     @given(
2025-05-07T20:32:05.8492920Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8493019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8493137Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8493256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8493367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8493441Z     )
2025-05-07T20:32:05.8493687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8493783Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8493862Z         self,
2025-05-07T20:32:05.8493949Z         T: int,
2025-05-07T20:32:05.8494025Z         D: int,
2025-05-07T20:32:05.8494118Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8494214Z         contiguous: bool,
2025-05-07T20:32:05.8494297Z         compiled: bool,
2025-05-07T20:32:05.8494377Z     ) -> None:
2025-05-07T20:32:05.8494470Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8494545Z     
2025-05-07T20:32:05.8494711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8494785Z     
2025-05-07T20:32:05.8494924Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8495051Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8495139Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8495218Z         x0 = x[:, :D]
2025-05-07T20:32:05.8495300Z         x1 = x[:, D:]
2025-05-07T20:32:05.8495374Z     
2025-05-07T20:32:05.8495459Z         if contiguous:
2025-05-07T20:32:05.8495554Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8495645Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8495723Z     
2025-05-07T20:32:05.8495820Z         if scale_ub is not None:
2025-05-07T20:32:05.8495925Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8496060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8496135Z             )
2025-05-07T20:32:05.8496213Z         else:
2025-05-07T20:32:05.8496315Z             scale_ub_tensor = None
2025-05-07T20:32:05.8496388Z     
2025-05-07T20:32:05.8496515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8496661Z             op = silu_mul_quant
2025-05-07T20:32:05.8496749Z             if compiled:
2025-05-07T20:32:05.8496846Z                 op = torch.compile(op)
2025-05-07T20:32:05.8496957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8497029Z     
2025-05-07T20:32:05.8497118Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8497129Z 
2025-05-07T20:32:05.8497224Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8497354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8497496Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8497593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8498090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8498192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8498551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8498779Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8499117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8499213Z     kernel = self.compile(
2025-05-07T20:32:05.8499600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8499817Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8499943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8499948Z 
2025-05-07T20:32:05.8500156Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8b53c90>
2025-05-07T20:32:05.8500931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8501434Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8a01440>}
2025-05-07T20:32:05.8502177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8502377Z context = <triton._C.libtriton.ir.context object at 0x7f39a8bd42f0>
2025-05-07T20:32:05.8502381Z 
2025-05-07T20:32:05.8502544Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8502806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8502920Z                            module_map=module_map)
2025-05-07T20:32:05.8503119Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8503220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8503304Z E       ^
2025-05-07T20:32:05.8503658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8503663Z 
2025-05-07T20:32:05.8504078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8504085Z 
2025-05-07T20:32:05.8504188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8504408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8504494Z     T=128,
2025-05-07T20:32:05.8504571Z     D=5120,
2025-05-07T20:32:05.8504652Z     scale_ub=None,
2025-05-07T20:32:05.8504743Z     contiguous=True,
2025-05-07T20:32:05.8504828Z     compiled=False,
2025-05-07T20:32:05.8504905Z )
2025-05-07T20:32:05.8505118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8505404Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8505409Z 
2025-05-07T20:32:05.8505493Z     @given(
2025-05-07T20:32:05.8505903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8506039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8506160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8506274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8506474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8506552Z     )
2025-05-07T20:32:05.8506794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8506892Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8506967Z         self,
2025-05-07T20:32:05.8507043Z         T: int,
2025-05-07T20:32:05.8507121Z         D: int,
2025-05-07T20:32:05.8507219Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8507308Z         contiguous: bool,
2025-05-07T20:32:05.8507403Z         compiled: bool,
2025-05-07T20:32:05.8507482Z     ) -> None:
2025-05-07T20:32:05.8507574Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8507648Z     
2025-05-07T20:32:05.8507812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8507885Z     
2025-05-07T20:32:05.8507980Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8508101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8508262Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8508341Z         x0 = x[:, :D]
2025-05-07T20:32:05.8508421Z         x1 = x[:, D:]
2025-05-07T20:32:05.8508499Z     
2025-05-07T20:32:05.8508581Z         if contiguous:
2025-05-07T20:32:05.8508671Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8508763Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8508834Z     
2025-05-07T20:32:05.8508924Z         if scale_ub is not None:
2025-05-07T20:32:05.8509036Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8509173Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8509249Z             )
2025-05-07T20:32:05.8509333Z         else:
2025-05-07T20:32:05.8509426Z             scale_ub_tensor = None
2025-05-07T20:32:05.8509505Z     
2025-05-07T20:32:05.8509632Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8509720Z             op = silu_mul_quant
2025-05-07T20:32:05.8509808Z             if compiled:
2025-05-07T20:32:05.8509913Z                 op = torch.compile(op)
2025-05-07T20:32:05.8510020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8510100Z     
2025-05-07T20:32:05.8510187Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8510192Z 
2025-05-07T20:32:05.8510289Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8510421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8510520Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8510617Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8511183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8511283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8511642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8511859Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8512202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8512300Z     kernel = self.compile(
2025-05-07T20:32:05.8512677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8512856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8512982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8512990Z 
2025-05-07T20:32:05.8513258Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8b3ac50>
2025-05-07T20:32:05.8514038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8514536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8a02660>}
2025-05-07T20:32:05.8515329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8515517Z context = <triton._C.libtriton.ir.context object at 0x7f39a8b472b0>
2025-05-07T20:32:05.8515521Z 
2025-05-07T20:32:05.8515687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8515952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8516058Z                            module_map=module_map)
2025-05-07T20:32:05.8516222Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8516320Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8516396Z E       ^
2025-05-07T20:32:05.8516794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8516799Z 
2025-05-07T20:32:05.8517256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8517261Z 
2025-05-07T20:32:05.8517368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8517587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8517667Z     T=128,
2025-05-07T20:32:05.8517754Z     D=7168,
2025-05-07T20:32:05.8517836Z     scale_ub=None,
2025-05-07T20:32:05.8517920Z     contiguous=True,
2025-05-07T20:32:05.8518008Z     compiled=False,
2025-05-07T20:32:05.8518081Z )
2025-05-07T20:32:05.8518296Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8518468Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8518473Z 
2025-05-07T20:32:05.8518552Z     @given(
2025-05-07T20:32:05.8518674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8518771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8518882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8519001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8519117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8519189Z     )
2025-05-07T20:32:05.8519476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8519573Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8519646Z         self,
2025-05-07T20:32:05.8519722Z         T: int,
2025-05-07T20:32:05.8519797Z         D: int,
2025-05-07T20:32:05.8519899Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8519986Z         contiguous: bool,
2025-05-07T20:32:05.8520071Z         compiled: bool,
2025-05-07T20:32:05.8520151Z     ) -> None:
2025-05-07T20:32:05.8520243Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8520319Z     
2025-05-07T20:32:05.8520490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8520563Z     
2025-05-07T20:32:05.8520652Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8520781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8520868Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8520949Z         x0 = x[:, :D]
2025-05-07T20:32:05.8521033Z         x1 = x[:, D:]
2025-05-07T20:32:05.8521103Z     
2025-05-07T20:32:05.8521193Z         if contiguous:
2025-05-07T20:32:05.8521343Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8521434Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8521511Z     
2025-05-07T20:32:05.8521604Z         if scale_ub is not None:
2025-05-07T20:32:05.8521709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8521848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8521923Z             )
2025-05-07T20:32:05.8522062Z         else:
2025-05-07T20:32:05.8522165Z             scale_ub_tensor = None
2025-05-07T20:32:05.8522237Z     
2025-05-07T20:32:05.8522364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8522460Z             op = silu_mul_quant
2025-05-07T20:32:05.8522543Z             if compiled:
2025-05-07T20:32:05.8522640Z                 op = torch.compile(op)
2025-05-07T20:32:05.8522751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8522824Z     
2025-05-07T20:32:05.8522918Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8522925Z 
2025-05-07T20:32:05.8523022Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8523150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8523254Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8523347Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8523841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8523985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8524338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8524561Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8524900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8524996Z     kernel = self.compile(
2025-05-07T20:32:05.8525382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8525554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8525680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8525690Z 
2025-05-07T20:32:05.8525891Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8b65e10>
2025-05-07T20:32:05.8526667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8527175Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8a036a0>}
2025-05-07T20:32:05.8528017Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8528212Z context = <triton._C.libtriton.ir.context object at 0x7f39a8b4a3f0>
2025-05-07T20:32:05.8528216Z 
2025-05-07T20:32:05.8528377Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8528636Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8528752Z                            module_map=module_map)
2025-05-07T20:32:05.8528910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8529013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8529089Z E       ^
2025-05-07T20:32:05.8529440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8529445Z 
2025-05-07T20:32:05.8529905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8529910Z 
2025-05-07T20:32:05.8530014Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8530233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8530318Z     T=2048,
2025-05-07T20:32:05.8530393Z     D=7168,
2025-05-07T20:32:05.8530479Z     scale_ub=1200.0,
2025-05-07T20:32:05.8530563Z     contiguous=True,
2025-05-07T20:32:05.8530689Z     compiled=False,
2025-05-07T20:32:05.8530769Z )
2025-05-07T20:32:05.8530986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8531158Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8531163Z 
2025-05-07T20:32:05.8531244Z     @given(
2025-05-07T20:32:05.8531361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8531460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8531579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8531697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8531813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8531889Z     )
2025-05-07T20:32:05.8532130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8532225Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8532298Z         self,
2025-05-07T20:32:05.8532418Z         T: int,
2025-05-07T20:32:05.8532498Z         D: int,
2025-05-07T20:32:05.8532595Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8532684Z         contiguous: bool,
2025-05-07T20:32:05.8532770Z         compiled: bool,
2025-05-07T20:32:05.8532847Z     ) -> None:
2025-05-07T20:32:05.8532938Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8533012Z     
2025-05-07T20:32:05.8533179Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8534964Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8534975Z 
2025-05-07T20:32:05.8535092Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8535096Z 
2025-05-07T20:32:05.8535198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8535416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8535490Z     T=1,
2025-05-07T20:32:05.8535572Z     D=5120,
2025-05-07T20:32:05.8535652Z     scale_ub=1200.0,
2025-05-07T20:32:05.8535777Z     contiguous=True,
2025-05-07T20:32:05.8535869Z     compiled=False,
2025-05-07T20:32:05.8535938Z )
2025-05-07T20:32:05.8536152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8536318Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8536323Z 
2025-05-07T20:32:05.8536400Z     @given(
2025-05-07T20:32:05.8536521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8536624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8536735Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8536881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8537004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8537090Z     )
2025-05-07T20:32:05.8537335Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8537428Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8537499Z         self,
2025-05-07T20:32:05.8537582Z         T: int,
2025-05-07T20:32:05.8537698Z         D: int,
2025-05-07T20:32:05.8537797Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8537886Z         contiguous: bool,
2025-05-07T20:32:05.8537967Z         compiled: bool,
2025-05-07T20:32:05.8538047Z     ) -> None:
2025-05-07T20:32:05.8538141Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8538211Z     
2025-05-07T20:32:05.8538381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8538499Z     
2025-05-07T20:32:05.8538588Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8538718Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8538805Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8538884Z         x0 = x[:, :D]
2025-05-07T20:32:05.8538969Z         x1 = x[:, D:]
2025-05-07T20:32:05.8539041Z     
2025-05-07T20:32:05.8539127Z         if contiguous:
2025-05-07T20:32:05.8539217Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8539307Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8539385Z     
2025-05-07T20:32:05.8539474Z         if scale_ub is not None:
2025-05-07T20:32:05.8539578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8539715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8539791Z             )
2025-05-07T20:32:05.8539868Z         else:
2025-05-07T20:32:05.8539967Z             scale_ub_tensor = None
2025-05-07T20:32:05.8540040Z     
2025-05-07T20:32:05.8540212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8540301Z             op = silu_mul_quant
2025-05-07T20:32:05.8540383Z             if compiled:
2025-05-07T20:32:05.8540482Z                 op = torch.compile(op)
2025-05-07T20:32:05.8540587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8540659Z     
2025-05-07T20:32:05.8540758Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8540763Z 
2025-05-07T20:32:05.8540854Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8544582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8544703Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8544802Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8545312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8545410Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8545768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8545999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8546340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8546439Z     kernel = self.compile(
2025-05-07T20:32:05.8546884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8547062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8547192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8547197Z 
2025-05-07T20:32:05.8547399Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8c29c50>
2025-05-07T20:32:05.8548172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8548685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8c90b80>}
2025-05-07T20:32:05.8549475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8549668Z context = <triton._C.libtriton.ir.context object at 0x7f39a8caa230>
2025-05-07T20:32:05.8549673Z 
2025-05-07T20:32:05.8549835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8550100Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8550206Z                            module_map=module_map)
2025-05-07T20:32:05.8550409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8550510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8550588Z E       ^
2025-05-07T20:32:05.8550939Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8550943Z 
2025-05-07T20:32:05.8551358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8551362Z 
2025-05-07T20:32:05.8551470Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8551691Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8551769Z     T=2048,
2025-05-07T20:32:05.8551843Z     D=5120,
2025-05-07T20:32:05.8551931Z     scale_ub=None,
2025-05-07T20:32:05.8552015Z     contiguous=True,
2025-05-07T20:32:05.8552099Z     compiled=False,
2025-05-07T20:32:05.8552176Z )
2025-05-07T20:32:05.8552392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8552626Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8552630Z 
2025-05-07T20:32:05.8552707Z     @given(
2025-05-07T20:32:05.8552826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8552929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8553043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8553159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8553276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8553350Z     )
2025-05-07T20:32:05.8553595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8553694Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8553770Z         self,
2025-05-07T20:32:05.8553849Z         T: int,
2025-05-07T20:32:05.8553925Z         D: int,
2025-05-07T20:32:05.8554023Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8554122Z         contiguous: bool,
2025-05-07T20:32:05.8554207Z         compiled: bool,
2025-05-07T20:32:05.8554286Z     ) -> None:
2025-05-07T20:32:05.8554384Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8554457Z     
2025-05-07T20:32:05.8554625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8554702Z     
2025-05-07T20:32:05.8554794Z >       x_sign = torch.sign(x)
2025-05-07T20:32:05.8556624Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8556635Z 
2025-05-07T20:32:05.8556754Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:05.8556758Z 
2025-05-07T20:32:05.8556861Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8557083Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8557161Z     T=16384,
2025-05-07T20:32:05.8557240Z     D=5120,
2025-05-07T20:32:05.8557323Z     scale_ub=None,
2025-05-07T20:32:05.8557407Z     contiguous=True,
2025-05-07T20:32:05.8557497Z     compiled=False,
2025-05-07T20:32:05.8557635Z )
2025-05-07T20:32:05.8557851Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8558027Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8558031Z 
2025-05-07T20:32:05.8558108Z     @given(
2025-05-07T20:32:05.8558225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8558324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8558479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8558596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8558708Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8558782Z     )
2025-05-07T20:32:05.8559024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8559117Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8559192Z         self,
2025-05-07T20:32:05.8559272Z         T: int,
2025-05-07T20:32:05.8559351Z         D: int,
2025-05-07T20:32:05.8559448Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8559540Z         contiguous: bool,
2025-05-07T20:32:05.8559624Z         compiled: bool,
2025-05-07T20:32:05.8559703Z     ) -> None:
2025-05-07T20:32:05.8559801Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8559874Z     
2025-05-07T20:32:05.8560041Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8561861Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8561869Z 
2025-05-07T20:32:05.8561988Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8561993Z 
2025-05-07T20:32:05.8562095Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8562315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8562395Z     T=4096,
2025-05-07T20:32:05.8562471Z     D=5120,
2025-05-07T20:32:05.8562553Z     scale_ub=None,
2025-05-07T20:32:05.8562648Z     contiguous=True,
2025-05-07T20:32:05.8562733Z     compiled=False,
2025-05-07T20:32:05.8562808Z )
2025-05-07T20:32:05.8563024Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8563193Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8563197Z 
2025-05-07T20:32:05.8563277Z     @given(
2025-05-07T20:32:05.8563392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8563531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8563650Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8563764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8563876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8563951Z     )
2025-05-07T20:32:05.8564191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8564289Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8564372Z         self,
2025-05-07T20:32:05.8564447Z         T: int,
2025-05-07T20:32:05.8564525Z         D: int,
2025-05-07T20:32:05.8564621Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8564709Z         contiguous: bool,
2025-05-07T20:32:05.8564795Z         compiled: bool,
2025-05-07T20:32:05.8564871Z     ) -> None:
2025-05-07T20:32:05.8564964Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8565040Z     
2025-05-07T20:32:05.8565205Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8567019Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8567063Z 
2025-05-07T20:32:05.8567183Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8567187Z 
2025-05-07T20:32:05.8567289Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8567511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8567668Z     T=2048,
2025-05-07T20:32:05.8567747Z     D=5120,
2025-05-07T20:32:05.8567829Z     scale_ub=None,
2025-05-07T20:32:05.8567921Z     contiguous=False,
2025-05-07T20:32:05.8568007Z     compiled=False,
2025-05-07T20:32:05.8568078Z )
2025-05-07T20:32:05.8568293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8568466Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.8568470Z 
2025-05-07T20:32:05.8568548Z     @given(
2025-05-07T20:32:05.8568664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8568812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8568924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8569041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8569153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8569225Z     )
2025-05-07T20:32:05.8569470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8569562Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8569639Z         self,
2025-05-07T20:32:05.8569722Z         T: int,
2025-05-07T20:32:05.8569797Z         D: int,
2025-05-07T20:32:05.8569896Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8569990Z         contiguous: bool,
2025-05-07T20:32:05.8570074Z         compiled: bool,
2025-05-07T20:32:05.8570151Z     ) -> None:
2025-05-07T20:32:05.8570248Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8570320Z     
2025-05-07T20:32:05.8570487Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8572302Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8572308Z 
2025-05-07T20:32:05.8572427Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8572431Z 
2025-05-07T20:32:05.8572532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8572752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8572835Z     T=4096,
2025-05-07T20:32:05.8572917Z     D=7168,
2025-05-07T20:32:05.8572999Z     scale_ub=None,
2025-05-07T20:32:05.8573086Z     contiguous=True,
2025-05-07T20:32:05.8573169Z     compiled=True,
2025-05-07T20:32:05.8573242Z )
2025-05-07T20:32:05.8573460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8573626Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.8573631Z 
2025-05-07T20:32:05.8573710Z     @given(
2025-05-07T20:32:05.8573828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8573966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8574081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8574196Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8574306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8574382Z     )
2025-05-07T20:32:05.8574624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8574763Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8574838Z         self,
2025-05-07T20:32:05.8574913Z         T: int,
2025-05-07T20:32:05.8574991Z         D: int,
2025-05-07T20:32:05.8575088Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8575175Z         contiguous: bool,
2025-05-07T20:32:05.8575262Z         compiled: bool,
2025-05-07T20:32:05.8575339Z     ) -> None:
2025-05-07T20:32:05.8575431Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8575508Z     
2025-05-07T20:32:05.8575680Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8577452Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8577503Z 
2025-05-07T20:32:05.8577622Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8577626Z 
2025-05-07T20:32:05.8577727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8577947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8578024Z     T=2048,
2025-05-07T20:32:05.8578105Z     D=5120,
2025-05-07T20:32:05.8578190Z     scale_ub=1200.0,
2025-05-07T20:32:05.8578275Z     contiguous=False,
2025-05-07T20:32:05.8578361Z     compiled=False,
2025-05-07T20:32:05.8578435Z )
2025-05-07T20:32:05.8578650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8578827Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.8578832Z 
2025-05-07T20:32:05.8578911Z     @given(
2025-05-07T20:32:05.8579031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8579129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8579244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8579358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8579474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8579546Z     )
2025-05-07T20:32:05.8579832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8579933Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8580009Z         self,
2025-05-07T20:32:05.8580085Z         T: int,
2025-05-07T20:32:05.8580163Z         D: int,
2025-05-07T20:32:05.8580259Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8580345Z         contiguous: bool,
2025-05-07T20:32:05.8580432Z         compiled: bool,
2025-05-07T20:32:05.8580510Z     ) -> None:
2025-05-07T20:32:05.8580603Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8580683Z     
2025-05-07T20:32:05.8580850Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8582662Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8582669Z 
2025-05-07T20:32:05.8582786Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8582791Z 
2025-05-07T20:32:05.8582894Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8583112Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8583231Z     T=4096,
2025-05-07T20:32:05.8583310Z     D=7168,
2025-05-07T20:32:05.8583392Z     scale_ub=1200.0,
2025-05-07T20:32:05.8583475Z     contiguous=True,
2025-05-07T20:32:05.8583560Z     compiled=False,
2025-05-07T20:32:05.8583633Z )
2025-05-07T20:32:05.8583846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8584017Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8584022Z 
2025-05-07T20:32:05.8584101Z     @given(
2025-05-07T20:32:05.8584222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8584318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8584430Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8584545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8584657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8584730Z     )
2025-05-07T20:32:05.8584974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8585112Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8585187Z         self,
2025-05-07T20:32:05.8585265Z         T: int,
2025-05-07T20:32:05.8585341Z         D: int,
2025-05-07T20:32:05.8585439Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8585528Z         contiguous: bool,
2025-05-07T20:32:05.8585613Z         compiled: bool,
2025-05-07T20:32:05.8585692Z     ) -> None:
2025-05-07T20:32:05.8585788Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8585860Z     
2025-05-07T20:32:05.8586030Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8587800Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8587812Z 
2025-05-07T20:32:05.8587934Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8587938Z 
2025-05-07T20:32:05.8588038Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8588322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8588408Z     T=16384,
2025-05-07T20:32:05.8588484Z     D=7168,
2025-05-07T20:32:05.8588571Z     scale_ub=None,
2025-05-07T20:32:05.8588657Z     contiguous=False,
2025-05-07T20:32:05.8588740Z     compiled=True,
2025-05-07T20:32:05.8588820Z )
2025-05-07T20:32:05.8589032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8589204Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.8589212Z 
2025-05-07T20:32:05.8589293Z     @given(
2025-05-07T20:32:05.8589410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8589506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8589620Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8589733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8589847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8589919Z     )
2025-05-07T20:32:05.8590202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8590298Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8590373Z         self,
2025-05-07T20:32:05.8590449Z         T: int,
2025-05-07T20:32:05.8590527Z         D: int,
2025-05-07T20:32:05.8590625Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8590713Z         contiguous: bool,
2025-05-07T20:32:05.8590801Z         compiled: bool,
2025-05-07T20:32:05.8590923Z     ) -> None:
2025-05-07T20:32:05.8591016Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8591092Z     
2025-05-07T20:32:05.8591257Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8593032Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8593038Z 
2025-05-07T20:32:05.8593155Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8593160Z 
2025-05-07T20:32:05.8593262Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8593524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8593603Z     T=4096,
2025-05-07T20:32:05.8593683Z     D=7168,
2025-05-07T20:32:05.8593765Z     scale_ub=None,
2025-05-07T20:32:05.8593849Z     contiguous=True,
2025-05-07T20:32:05.8593934Z     compiled=False,
2025-05-07T20:32:05.8594006Z )
2025-05-07T20:32:05.8594220Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8594394Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8594401Z 
2025-05-07T20:32:05.8594478Z     @given(
2025-05-07T20:32:05.8594598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8594694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8594806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8594925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8595036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8595116Z     )
2025-05-07T20:32:05.8595360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8595452Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8595527Z         self,
2025-05-07T20:32:05.8595610Z         T: int,
2025-05-07T20:32:05.8595685Z         D: int,
2025-05-07T20:32:05.8595786Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8595874Z         contiguous: bool,
2025-05-07T20:32:05.8595959Z         compiled: bool,
2025-05-07T20:32:05.8596084Z     ) -> None:
2025-05-07T20:32:05.8596180Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8596252Z     
2025-05-07T20:32:05.8596417Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8598189Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8598199Z 
2025-05-07T20:32:05.8598318Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8598323Z 
2025-05-07T20:32:05.8598424Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8598682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8598764Z     T=16384,
2025-05-07T20:32:05.8598840Z     D=7168,
2025-05-07T20:32:05.8598925Z     scale_ub=None,
2025-05-07T20:32:05.8599011Z     contiguous=True,
2025-05-07T20:32:05.8599094Z     compiled=False,
2025-05-07T20:32:05.8599171Z )
2025-05-07T20:32:05.8599385Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8599599Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.8599604Z 
2025-05-07T20:32:05.8599682Z     @given(
2025-05-07T20:32:05.8599797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8599893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8600007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8600122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8600237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8600313Z     )
2025-05-07T20:32:05.8600552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8600646Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8600721Z         self,
2025-05-07T20:32:05.8600798Z         T: int,
2025-05-07T20:32:05.8600876Z         D: int,
2025-05-07T20:32:05.8600972Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8601061Z         contiguous: bool,
2025-05-07T20:32:05.8601193Z         compiled: bool,
2025-05-07T20:32:05.8601271Z     ) -> None:
2025-05-07T20:32:05.8601364Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8601439Z     
2025-05-07T20:32:05.8601603Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8603379Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8603385Z 
2025-05-07T20:32:05.8603500Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8603510Z 
2025-05-07T20:32:05.8603613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8603831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8603908Z     T=16384,
2025-05-07T20:32:05.8603988Z     D=7168,
2025-05-07T20:32:05.8604071Z     scale_ub=1200.0,
2025-05-07T20:32:05.8604156Z     contiguous=True,
2025-05-07T20:32:05.8604242Z     compiled=False,
2025-05-07T20:32:05.8604315Z )
2025-05-07T20:32:05.8604567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8604745Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8604750Z 
2025-05-07T20:32:05.8604829Z     @given(
2025-05-07T20:32:05.8604947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8605045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8605159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8605277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8605393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8605468Z     )
2025-05-07T20:32:05.8605939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8606076Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8606154Z         self,
2025-05-07T20:32:05.8606234Z         T: int,
2025-05-07T20:32:05.8606308Z         D: int,
2025-05-07T20:32:05.8606405Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8606495Z         contiguous: bool,
2025-05-07T20:32:05.8606661Z         compiled: bool,
2025-05-07T20:32:05.8606742Z     ) -> None:
2025-05-07T20:32:05.8606835Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8606924Z     
2025-05-07T20:32:05.8607115Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8609021Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8609100Z 
2025-05-07T20:32:05.8609218Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8609226Z 
2025-05-07T20:32:05.8609328Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8609545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8609630Z     T=128,
2025-05-07T20:32:05.8609702Z     D=5120,
2025-05-07T20:32:05.8609788Z     scale_ub=1200.0,
2025-05-07T20:32:05.8609872Z     contiguous=False,
2025-05-07T20:32:05.8609951Z     compiled=False,
2025-05-07T20:32:05.8610036Z )
2025-05-07T20:32:05.8610317Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8610496Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.8610501Z 
2025-05-07T20:32:05.8610578Z     @given(
2025-05-07T20:32:05.8610693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8610790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8610906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8611021Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8611139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8611213Z     )
2025-05-07T20:32:05.8611453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8611548Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8611620Z         self,
2025-05-07T20:32:05.8611698Z         T: int,
2025-05-07T20:32:05.8611772Z         D: int,
2025-05-07T20:32:05.8611872Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8611966Z         contiguous: bool,
2025-05-07T20:32:05.8612049Z         compiled: bool,
2025-05-07T20:32:05.8612126Z     ) -> None:
2025-05-07T20:32:05.8612222Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8612294Z     
2025-05-07T20:32:05.8612459Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8612538Z     
2025-05-07T20:32:05.8612630Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8612812Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8612908Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8612988Z         x0 = x[:, :D]
2025-05-07T20:32:05.8613070Z         x1 = x[:, D:]
2025-05-07T20:32:05.8613139Z     
2025-05-07T20:32:05.8613220Z         if contiguous:
2025-05-07T20:32:05.8613313Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8613401Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8613470Z     
2025-05-07T20:32:05.8613562Z         if scale_ub is not None:
2025-05-07T20:32:05.8613671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8613805Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8613886Z             )
2025-05-07T20:32:05.8613963Z         else:
2025-05-07T20:32:05.8614055Z             scale_ub_tensor = None
2025-05-07T20:32:05.8614134Z     
2025-05-07T20:32:05.8614264Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8614356Z             op = silu_mul_quant
2025-05-07T20:32:05.8614441Z             if compiled:
2025-05-07T20:32:05.8614583Z                 op = torch.compile(op)
2025-05-07T20:32:05.8614695Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8614765Z     
2025-05-07T20:32:05.8614854Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8614859Z 
2025-05-07T20:32:05.8614959Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8615088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8615185Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8615335Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8615834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8615934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8616290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8616510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8616855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8616950Z     kernel = self.compile(
2025-05-07T20:32:05.8617329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8617507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8617699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8617704Z 
2025-05-07T20:32:05.8617912Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a8748750>
2025-05-07T20:32:05.8618687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8619198Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a89237e0>}
2025-05-07T20:32:05.8619949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8620141Z context = <triton._C.libtriton.ir.context object at 0x7f39a8990d30>
2025-05-07T20:32:05.8620150Z 
2025-05-07T20:32:05.8620318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8620579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8620687Z                            module_map=module_map)
2025-05-07T20:32:05.8620847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8620945Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8621066Z E       ^
2025-05-07T20:32:05.8621423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8621428Z 
2025-05-07T20:32:05.8621839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8621849Z 
2025-05-07T20:32:05.8621952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8622173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8622259Z     T=2048,
2025-05-07T20:32:05.8622335Z     D=7168,
2025-05-07T20:32:05.8622414Z     scale_ub=None,
2025-05-07T20:32:05.8622503Z     contiguous=False,
2025-05-07T20:32:05.8622586Z     compiled=False,
2025-05-07T20:32:05.8622657Z )
2025-05-07T20:32:05.8622875Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8623046Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.8623053Z 
2025-05-07T20:32:05.8623166Z     @given(
2025-05-07T20:32:05.8623293Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8623391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8623506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8623618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8623728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8623852Z     )
2025-05-07T20:32:05.8624094Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8624185Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8624260Z         self,
2025-05-07T20:32:05.8624339Z         T: int,
2025-05-07T20:32:05.8624416Z         D: int,
2025-05-07T20:32:05.8624515Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8624601Z         contiguous: bool,
2025-05-07T20:32:05.8624690Z         compiled: bool,
2025-05-07T20:32:05.8624766Z     ) -> None:
2025-05-07T20:32:05.8624863Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8624943Z     
2025-05-07T20:32:05.8625110Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8626896Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8626963Z 
2025-05-07T20:32:05.8627097Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8627102Z 
2025-05-07T20:32:05.8627209Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8627435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8627511Z     T=128,
2025-05-07T20:32:05.8627584Z     D=7168,
2025-05-07T20:32:05.8627672Z     scale_ub=1200.0,
2025-05-07T20:32:05.8627756Z     contiguous=True,
2025-05-07T20:32:05.8627842Z     compiled=True,
2025-05-07T20:32:05.8627916Z )
2025-05-07T20:32:05.8628129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8628299Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.8628308Z 
2025-05-07T20:32:05.8628384Z     @given(
2025-05-07T20:32:05.8628500Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8628603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8628715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8628826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8628943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8629059Z     )
2025-05-07T20:32:05.8629305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8629396Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8629473Z         self,
2025-05-07T20:32:05.8629555Z         T: int,
2025-05-07T20:32:05.8629630Z         D: int,
2025-05-07T20:32:05.8629728Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8629820Z         contiguous: bool,
2025-05-07T20:32:05.8629904Z         compiled: bool,
2025-05-07T20:32:05.8629986Z     ) -> None:
2025-05-07T20:32:05.8630083Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8630155Z     
2025-05-07T20:32:05.8630319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8630397Z     
2025-05-07T20:32:05.8630485Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8630612Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8630697Z         x = x_sign * x_clamp
2025-05-07T20:32:05.8630778Z         x0 = x[:, :D]
2025-05-07T20:32:05.8630904Z         x1 = x[:, D:]
2025-05-07T20:32:05.8630977Z     
2025-05-07T20:32:05.8631060Z         if contiguous:
2025-05-07T20:32:05.8631158Z             x0 = x0.contiguous()
2025-05-07T20:32:05.8631245Z             x1 = x1.contiguous()
2025-05-07T20:32:05.8631316Z     
2025-05-07T20:32:05.8631412Z         if scale_ub is not None:
2025-05-07T20:32:05.8631516Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.8631649Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.8631771Z             )
2025-05-07T20:32:05.8631845Z         else:
2025-05-07T20:32:05.8631937Z             scale_ub_tensor = None
2025-05-07T20:32:05.8632015Z     
2025-05-07T20:32:05.8632141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.8632235Z             op = silu_mul_quant
2025-05-07T20:32:05.8632317Z             if compiled:
2025-05-07T20:32:05.8632414Z                 op = torch.compile(op)
2025-05-07T20:32:05.8632526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8632603Z     
2025-05-07T20:32:05.8632692Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.8632697Z 
2025-05-07T20:32:05.8632798Z moe/activation_test.py:117: 
2025-05-07T20:32:05.8632925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8633022Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.8633124Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.8633490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.8633630Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.8634121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.8634218Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.8634579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.8634803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.8635140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.8635236Z     kernel = self.compile(
2025-05-07T20:32:05.8635614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.8635792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8635921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.8635925Z 
2025-05-07T20:32:05.8636129Z self = <triton.compiler.compiler.ASTSource object at 0x7f39a87efe10>
2025-05-07T20:32:05.8637021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.8637525Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3adc63c540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f39a8786a20>}
2025-05-07T20:32:05.8638279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.8638474Z context = <triton._C.libtriton.ir.context object at 0x7f39a8730d30>
2025-05-07T20:32:05.8638478Z 
2025-05-07T20:32:05.8638645Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.8638905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8639010Z                            module_map=module_map)
2025-05-07T20:32:05.8639177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8639278Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8639394Z E       ^
2025-05-07T20:32:05.8639752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8639757Z 
2025-05-07T20:32:05.8640167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8640171Z 
2025-05-07T20:32:05.8640278Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8640538Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8640616Z     T=128,
2025-05-07T20:32:05.8640695Z     D=7168,
2025-05-07T20:32:05.8640776Z     scale_ub=1200.0,
2025-05-07T20:32:05.8640858Z     contiguous=True,
2025-05-07T20:32:05.8640946Z     compiled=False,
2025-05-07T20:32:05.8641017Z )
2025-05-07T20:32:05.8641232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8641409Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.8641414Z 
2025-05-07T20:32:05.8641490Z     @given(
2025-05-07T20:32:05.8641607Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8641707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8641820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8641940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8642051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8642170Z     )
2025-05-07T20:32:05.8642415Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8642507Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8642592Z         self,
2025-05-07T20:32:05.8642666Z         T: int,
2025-05-07T20:32:05.8642744Z         D: int,
2025-05-07T20:32:05.8642844Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8642933Z         contiguous: bool,
2025-05-07T20:32:05.8643020Z         compiled: bool,
2025-05-07T20:32:05.8643106Z     ) -> None:
2025-05-07T20:32:05.8643198Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8643270Z     
2025-05-07T20:32:05.8643437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8643510Z     
2025-05-07T20:32:05.8643598Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8643724Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8645531Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8645542Z 
2025-05-07T20:32:05.8645669Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.8645673Z 
2025-05-07T20:32:05.8645776Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8646003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8646080Z     T=128,
2025-05-07T20:32:05.8646158Z     D=5120,
2025-05-07T20:32:05.8646244Z     scale_ub=1200.0,
2025-05-07T20:32:05.8646328Z     contiguous=True,
2025-05-07T20:32:05.8646411Z     compiled=True,
2025-05-07T20:32:05.8646485Z )
2025-05-07T20:32:05.8646703Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8646885Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.8646890Z 
2025-05-07T20:32:05.8646977Z     @given(
2025-05-07T20:32:05.8647117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8647218Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8647375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8647491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8647660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8647729Z     )
2025-05-07T20:32:05.8647969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8648066Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8648140Z         self,
2025-05-07T20:32:05.8648286Z         T: int,
2025-05-07T20:32:05.8648363Z         D: int,
2025-05-07T20:32:05.8648460Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8648547Z         contiguous: bool,
2025-05-07T20:32:05.8648627Z         compiled: bool,
2025-05-07T20:32:05.8648703Z     ) -> None:
2025-05-07T20:32:05.8648804Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8648878Z     
2025-05-07T20:32:05.8649043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8649123Z     
2025-05-07T20:32:05.8649221Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.8649346Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.8651109Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8651160Z 
2025-05-07T20:32:05.8651277Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.8651281Z 
2025-05-07T20:32:05.8651389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.8651608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.8651693Z     T=128,
2025-05-07T20:32:05.8651769Z     D=7168,
2025-05-07T20:32:05.8651850Z     scale_ub=None,
2025-05-07T20:32:05.8651937Z     contiguous=True,
2025-05-07T20:32:05.8652016Z     compiled=True,
2025-05-07T20:32:05.8652084Z )
2025-05-07T20:32:05.8652303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.8652466Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.8652476Z 
2025-05-07T20:32:05.8652552Z     @given(
2025-05-07T20:32:05.8652670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.8652768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.8652880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.8653000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.8653109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.8653188Z     )
2025-05-07T20:32:05.8653474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.8653567Z     def test_silu_mul_quant(
2025-05-07T20:32:05.8653645Z         self,
2025-05-07T20:32:05.8653719Z         T: int,
2025-05-07T20:32:05.8653791Z         D: int,
2025-05-07T20:32:05.8653890Z         scale_ub: Optional[float],
2025-05-07T20:32:05.8653977Z         contiguous: bool,
2025-05-07T20:32:05.8654061Z         compiled: bool,
2025-05-07T20:32:05.8654140Z     ) -> None:
2025-05-07T20:32:05.8654238Z         torch.manual_seed(2025)
2025-05-07T20:32:05.8654310Z     
2025-05-07T20:32:05.8654479Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.8656285Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.8656295Z 
2025-05-07T20:32:05.8656410Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.8656543Z =============================== warnings summary ===============================
2025-05-07T20:32:05.8656890Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:05.8657187Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:05.8657481Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:05.8658362Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:05.8658593Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:05.8658597Z 
2025-05-07T20:32:05.8658816Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:05.8658983Z ================= 1 failed, 1 deselected, 3 warnings in 16.33s =================
2025-05-07T20:32:07.4783557Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:07.5403056Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:07.5403348Z 
2025-05-07T20:32:09.5420030Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:11.6823013Z ============================= test session starts ==============================
2025-05-07T20:32:11.6823686Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:11.6824209Z cachedir: .pytest_cache
2025-05-07T20:32:11.6824778Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:11.6825539Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:11.6825950Z plugins: hypothesis-6.131.14
2025-05-07T20:32:13.2679182Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:13.4196153Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:13.4197274Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:13.4197848Z 
2025-05-07T20:32:15.7672315Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7673595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7674418Z     T=1,
2025-05-07T20:32:15.7674798Z     D=5120,
2025-05-07T20:32:15.7675179Z     scale_ub=None,
2025-05-07T20:32:15.7675612Z     contiguous=True,
2025-05-07T20:32:15.7676061Z     compiled=True,
2025-05-07T20:32:15.7676464Z )
2025-05-07T20:32:15.7677127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7678102Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:15.7678619Z 
2025-05-07T20:32:15.7678794Z     @given(
2025-05-07T20:32:15.7679252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7679752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7680067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7680401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7680825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7681122Z     )
2025-05-07T20:32:15.7681472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7681927Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7682176Z         self,
2025-05-07T20:32:15.7682372Z         T: int,
2025-05-07T20:32:15.7682577Z         D: int,
2025-05-07T20:32:15.7682805Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7683166Z         contiguous: bool,
2025-05-07T20:32:15.7683410Z         compiled: bool,
2025-05-07T20:32:15.7683649Z     ) -> None:
2025-05-07T20:32:15.7683876Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7684119Z     
2025-05-07T20:32:15.7684420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7684765Z     
2025-05-07T20:32:15.7684969Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.7685264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.7685584Z         x = x_sign * x_clamp
2025-05-07T20:32:15.7685834Z         x0 = x[:, :D]
2025-05-07T20:32:15.7686047Z         x1 = x[:, D:]
2025-05-07T20:32:15.7686269Z     
2025-05-07T20:32:15.7686463Z         if contiguous:
2025-05-07T20:32:15.7686699Z             x0 = x0.contiguous()
2025-05-07T20:32:15.7686966Z             x1 = x1.contiguous()
2025-05-07T20:32:15.7687212Z     
2025-05-07T20:32:15.7687405Z         if scale_ub is not None:
2025-05-07T20:32:15.7687901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.7688245Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.7688556Z             )
2025-05-07T20:32:15.7688761Z         else:
2025-05-07T20:32:15.7688981Z             scale_ub_tensor = None
2025-05-07T20:32:15.7689247Z     
2025-05-07T20:32:15.7689485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.7689814Z             op = silu_mul_quant
2025-05-07T20:32:15.7690075Z             if compiled:
2025-05-07T20:32:15.7690329Z                 op = torch.compile(op)
2025-05-07T20:32:15.7690633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7690919Z     
2025-05-07T20:32:15.7691117Z         y_fp8, y_scale = fn()
2025-05-07T20:32:15.7691411Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:15.7691712Z     
2025-05-07T20:32:15.7691954Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.7692299Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:15.7692610Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:15.7692924Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:15.7693290Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.7693617Z     
2025-05-07T20:32:15.7693832Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:15.7694031Z 
2025-05-07T20:32:15.7694135Z moe/activation_test.py:126: 
2025-05-07T20:32:15.7694496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7694845Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:15.7695175Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.7695972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:15.7696736Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:15.7697290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.7697980Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.7698669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:15.7699389Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.7700236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:15.7700988Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.7701720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:15.7702356Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:15.7703007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:15.7703526Z     fn()
2025-05-07T20:32:15.7704030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:15.7704617Z     self.fn.run(
2025-05-07T20:32:15.7705087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.7705953Z     kernel = self.compile(
2025-05-07T20:32:15.7706524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.7707179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.7707575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7707804Z 
2025-05-07T20:32:15.7708009Z self = <triton.compiler.compiler.ASTSource object at 0x7f07695ecf50>
2025-05-07T20:32:15.7709190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.7710583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05c48553a0>}
2025-05-07T20:32:15.7711929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.7712962Z context = <triton._C.libtriton.ir.context object at 0x7f05c625a9f0>
2025-05-07T20:32:15.7713248Z 
2025-05-07T20:32:15.7713415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.7713942Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.7714417Z                            module_map=module_map)
2025-05-07T20:32:15.7714791Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.7715148Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:15.7715420Z E       ^
2025-05-07T20:32:15.7715888Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.7716337Z 
2025-05-07T20:32:15.7716824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.7717347Z 
2025-05-07T20:32:15.7717454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7717871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7718277Z     T=2048,
2025-05-07T20:32:15.7718467Z     D=5120,
2025-05-07T20:32:15.7718666Z     scale_ub=1200.0,
2025-05-07T20:32:15.7718897Z     contiguous=True,
2025-05-07T20:32:15.7719127Z     compiled=False,
2025-05-07T20:32:15.7719337Z )
2025-05-07T20:32:16.6881882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6882494Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6882880Z 
2025-05-07T20:32:16.6882987Z     @given(
2025-05-07T20:32:16.6883225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6883542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6884169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6884512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6884848Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6885148Z     )
2025-05-07T20:32:16.6885507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6885950Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6886199Z         self,
2025-05-07T20:32:16.6886501Z         T: int,
2025-05-07T20:32:16.6886703Z         D: int,
2025-05-07T20:32:16.6886925Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6887206Z         contiguous: bool,
2025-05-07T20:32:16.6887446Z         compiled: bool,
2025-05-07T20:32:16.6887810Z     ) -> None:
2025-05-07T20:32:16.6888042Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6888292Z     
2025-05-07T20:32:16.6888579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6888939Z     
2025-05-07T20:32:16.6889148Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6889453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6889777Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6890025Z         x0 = x[:, :D]
2025-05-07T20:32:16.6890258Z         x1 = x[:, D:]
2025-05-07T20:32:16.6890482Z     
2025-05-07T20:32:16.6890685Z         if contiguous:
2025-05-07T20:32:16.6890927Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6891207Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6891566Z     
2025-05-07T20:32:16.6891762Z         if scale_ub is not None:
2025-05-07T20:32:16.6892048Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6892398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6892709Z             )
2025-05-07T20:32:16.6892912Z         else:
2025-05-07T20:32:16.6893133Z             scale_ub_tensor = None
2025-05-07T20:32:16.6893389Z     
2025-05-07T20:32:16.6893636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6893962Z             op = silu_mul_quant
2025-05-07T20:32:16.6894220Z             if compiled:
2025-05-07T20:32:16.6894485Z                 op = torch.compile(op)
2025-05-07T20:32:16.6894796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6895076Z     
2025-05-07T20:32:16.6895294Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6895469Z 
2025-05-07T20:32:16.6895574Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6895887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6896243Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6896531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6897236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6897947Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6898569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6899270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6899942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6900487Z     kernel = self.compile(
2025-05-07T20:32:16.6901036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6901708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6902117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6902348Z 
2025-05-07T20:32:16.6902567Z self = <triton.compiler.compiler.ASTSource object at 0x7f05c4a1f7d0>
2025-05-07T20:32:16.6903699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6905100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05c45102c0>}
2025-05-07T20:32:16.6906835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6907950Z context = <triton._C.libtriton.ir.context object at 0x7f05c4a31830>
2025-05-07T20:32:16.6908237Z 
2025-05-07T20:32:16.6908405Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6908927Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6909395Z                            module_map=module_map)
2025-05-07T20:32:16.6909775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6910174Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6910438Z E       ^
2025-05-07T20:32:16.6910903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6911350Z 
2025-05-07T20:32:16.6911766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6912400Z 
2025-05-07T20:32:16.6912506Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6912925Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6913333Z     T=2048,
2025-05-07T20:32:16.6913527Z     D=5120,
2025-05-07T20:32:16.6913735Z     scale_ub=1200.0,
2025-05-07T20:32:16.6913973Z     contiguous=True,
2025-05-07T20:32:16.6914202Z     compiled=True,
2025-05-07T20:32:16.6914422Z )
2025-05-07T20:32:16.6914754Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6915258Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.6915541Z 
2025-05-07T20:32:16.6915624Z     @given(
2025-05-07T20:32:16.6915871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6916198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6916510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6916855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6917199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6917490Z     )
2025-05-07T20:32:16.6917853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6918303Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6918551Z         self,
2025-05-07T20:32:16.6918760Z         T: int,
2025-05-07T20:32:16.6918973Z         D: int,
2025-05-07T20:32:16.6919198Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6919553Z         contiguous: bool,
2025-05-07T20:32:16.6919814Z         compiled: bool,
2025-05-07T20:32:16.6920043Z     ) -> None:
2025-05-07T20:32:16.6920274Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6920529Z     
2025-05-07T20:32:16.6920817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6921165Z     
2025-05-07T20:32:16.6921376Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6921684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6922003Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6922258Z         x0 = x[:, :D]
2025-05-07T20:32:16.6922491Z         x1 = x[:, D:]
2025-05-07T20:32:16.6922704Z     
2025-05-07T20:32:16.6922906Z         if contiguous:
2025-05-07T20:32:16.6923151Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6923420Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6923670Z     
2025-05-07T20:32:16.6923878Z         if scale_ub is not None:
2025-05-07T20:32:16.6924155Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6924577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6924901Z             )
2025-05-07T20:32:16.6925102Z         else:
2025-05-07T20:32:16.6925325Z             scale_ub_tensor = None
2025-05-07T20:32:16.6925590Z     
2025-05-07T20:32:16.6925834Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6926158Z             op = silu_mul_quant
2025-05-07T20:32:16.6926424Z             if compiled:
2025-05-07T20:32:16.6926733Z                 op = torch.compile(op)
2025-05-07T20:32:16.6927037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6927325Z     
2025-05-07T20:32:16.6927613Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.6927895Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.6928187Z     
2025-05-07T20:32:16.6928426Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6928752Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.6929048Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.6929366Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.6929717Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.6930027Z     
2025-05-07T20:32:16.6930226Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.6930419Z 
2025-05-07T20:32:16.6930522Z moe/activation_test.py:126: 
2025-05-07T20:32:16.6930813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6931201Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.6931528Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.6932310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.6933069Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.6933623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6934316Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6935002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.6935728Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.6936487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:16.6937244Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.6937973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.6938629Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.6939287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.6939811Z     fn()
2025-05-07T20:32:16.6940331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.6940922Z     self.fn.run(
2025-05-07T20:32:16.6941399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6941934Z     kernel = self.compile(
2025-05-07T20:32:16.6942490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6943153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6943553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6943791Z 
2025-05-07T20:32:16.6944003Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bf1bf050>
2025-05-07T20:32:16.6945135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6946509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05c4510900>}
2025-05-07T20:32:16.6947852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6948917Z context = <triton._C.libtriton.ir.context object at 0x7f05bf1c3570>
2025-05-07T20:32:16.6949212Z 
2025-05-07T20:32:16.6949378Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6949916Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6950441Z                            module_map=module_map)
2025-05-07T20:32:16.6950805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6951166Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.6951439Z E       ^
2025-05-07T20:32:16.6951902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6952357Z 
2025-05-07T20:32:16.6952772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6953339Z 
2025-05-07T20:32:16.6953447Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6953866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6954270Z     T=16384,
2025-05-07T20:32:16.6954473Z     D=7168,
2025-05-07T20:32:16.6954678Z     scale_ub=1200.0,
2025-05-07T20:32:16.6954904Z     contiguous=False,
2025-05-07T20:32:16.6955144Z     compiled=False,
2025-05-07T20:32:16.6955357Z )
2025-05-07T20:32:17.4852970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4853743Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.4854138Z 
2025-05-07T20:32:17.4854259Z     @given(
2025-05-07T20:32:17.4854574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4854982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4855381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4855706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4856043Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4856332Z     )
2025-05-07T20:32:17.4856682Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4857125Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4857371Z         self,
2025-05-07T20:32:17.4857564Z         T: int,
2025-05-07T20:32:17.4858075Z         D: int,
2025-05-07T20:32:17.4858304Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4858572Z         contiguous: bool,
2025-05-07T20:32:17.4858815Z         compiled: bool,
2025-05-07T20:32:17.4859046Z     ) -> None:
2025-05-07T20:32:17.4859268Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4859504Z     
2025-05-07T20:32:17.4859780Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4860129Z     
2025-05-07T20:32:17.4860359Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4860642Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4860954Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4861201Z         x0 = x[:, :D]
2025-05-07T20:32:17.4861416Z         x1 = x[:, D:]
2025-05-07T20:32:17.4861621Z     
2025-05-07T20:32:17.4861809Z         if contiguous:
2025-05-07T20:32:17.4862044Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4862296Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4862536Z     
2025-05-07T20:32:17.4862814Z         if scale_ub is not None:
2025-05-07T20:32:17.4863085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4863423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4863735Z             )
2025-05-07T20:32:17.4863922Z         else:
2025-05-07T20:32:17.4864133Z             scale_ub_tensor = None
2025-05-07T20:32:17.4864670Z     
2025-05-07T20:32:17.4864899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4865303Z             op = silu_mul_quant
2025-05-07T20:32:17.4865550Z             if compiled:
2025-05-07T20:32:17.4865790Z                 op = torch.compile(op)
2025-05-07T20:32:17.4866086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4866359Z     
2025-05-07T20:32:17.4866551Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4866718Z 
2025-05-07T20:32:17.4866818Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4867115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4867477Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4867754Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4868454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4869153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4869691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4870463Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4871130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4871661Z     kernel = self.compile(
2025-05-07T20:32:17.4879373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4880059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4880471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4880703Z 
2025-05-07T20:32:17.4880914Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bf259550>
2025-05-07T20:32:17.4882000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4883391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bf51bc40>}
2025-05-07T20:32:17.4884730Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4885825Z context = <triton._C.libtriton.ir.context object at 0x7f05bf221af0>
2025-05-07T20:32:17.4886126Z 
2025-05-07T20:32:17.4886294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4886818Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4887289Z                            module_map=module_map)
2025-05-07T20:32:17.4887748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4888115Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4888386Z E       ^
2025-05-07T20:32:17.4888851Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4889309Z 
2025-05-07T20:32:17.4889724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4890272Z 
2025-05-07T20:32:17.4890401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4890870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4891275Z     T=1,
2025-05-07T20:32:17.4891470Z     D=7168,
2025-05-07T20:32:17.4891674Z     scale_ub=None,
2025-05-07T20:32:17.4891891Z     contiguous=True,
2025-05-07T20:32:17.4892123Z     compiled=True,
2025-05-07T20:32:17.4892327Z )
2025-05-07T20:32:17.4892638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4893158Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4893413Z 
2025-05-07T20:32:17.4893491Z     @given(
2025-05-07T20:32:17.4893716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4894023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4894340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4894678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4895009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4895305Z     )
2025-05-07T20:32:17.4895661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4896106Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4896350Z         self,
2025-05-07T20:32:17.4896554Z         T: int,
2025-05-07T20:32:17.4896758Z         D: int,
2025-05-07T20:32:17.4896977Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4897254Z         contiguous: bool,
2025-05-07T20:32:17.4897552Z         compiled: bool,
2025-05-07T20:32:17.4897777Z     ) -> None:
2025-05-07T20:32:17.4898001Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4898264Z     
2025-05-07T20:32:17.4898538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4898892Z     
2025-05-07T20:32:17.4899096Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4899385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4899701Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4899952Z         x0 = x[:, :D]
2025-05-07T20:32:17.4900175Z         x1 = x[:, D:]
2025-05-07T20:32:17.4900396Z     
2025-05-07T20:32:17.4900590Z         if contiguous:
2025-05-07T20:32:17.4900834Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4901094Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4901340Z     
2025-05-07T20:32:17.4901540Z         if scale_ub is not None:
2025-05-07T20:32:17.4901813Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4902164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4902478Z             )
2025-05-07T20:32:17.4902672Z         else:
2025-05-07T20:32:17.4902895Z             scale_ub_tensor = None
2025-05-07T20:32:17.4903154Z     
2025-05-07T20:32:17.4903390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4903711Z             op = silu_mul_quant
2025-05-07T20:32:17.4903963Z             if compiled:
2025-05-07T20:32:17.4904210Z                 op = torch.compile(op)
2025-05-07T20:32:17.4904560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4904843Z     
2025-05-07T20:32:17.4905039Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4905334Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4905997Z     
2025-05-07T20:32:17.4906248Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4906587Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4906891Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4907220Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4907581Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4907906Z     
2025-05-07T20:32:17.4908117Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4908315Z 
2025-05-07T20:32:17.4908418Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4908725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4909073Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4909485Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4910272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4911031Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4911585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4912492Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4913188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4913917Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4914672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4915421Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4916151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4916795Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4917401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4917991Z     fn()
2025-05-07T20:32:17.4918503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4919089Z     self.fn.run(
2025-05-07T20:32:17.4919554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4920099Z     kernel = self.compile(
2025-05-07T20:32:17.4920689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4921348Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4921744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4921981Z 
2025-05-07T20:32:17.4922192Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bed43790>
2025-05-07T20:32:17.4923271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4924654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bf2e79c0>}
2025-05-07T20:32:17.4926057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4927082Z context = <triton._C.libtriton.ir.context object at 0x7f05bee7a9b0>
2025-05-07T20:32:17.4927380Z 
2025-05-07T20:32:17.4927609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4928139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4928603Z                            module_map=module_map)
2025-05-07T20:32:17.4928981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4929342Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4929610Z E       ^
2025-05-07T20:32:17.4930068Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4930521Z 
2025-05-07T20:32:17.4930940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4931491Z 
2025-05-07T20:32:17.4931606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4932016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4932427Z     T=4096,
2025-05-07T20:32:17.4932626Z     D=5120,
2025-05-07T20:32:17.4932830Z     scale_ub=None,
2025-05-07T20:32:17.4933049Z     contiguous=False,
2025-05-07T20:32:17.4933285Z     compiled=False,
2025-05-07T20:32:17.4933535Z )
2025-05-07T20:32:18.3961840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.3962437Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.3962833Z 
2025-05-07T20:32:18.3962927Z     @given(
2025-05-07T20:32:18.3963175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.3963510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.3963833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.3964195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.3964541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.3964843Z     )
2025-05-07T20:32:18.3965207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.3965656Z     def test_silu_mul_quant(
2025-05-07T20:32:18.3965911Z         self,
2025-05-07T20:32:18.3966119Z         T: int,
2025-05-07T20:32:18.3966324Z         D: int,
2025-05-07T20:32:18.3966841Z         scale_ub: Optional[float],
2025-05-07T20:32:18.3967125Z         contiguous: bool,
2025-05-07T20:32:18.3967369Z         compiled: bool,
2025-05-07T20:32:18.3967725Z     ) -> None:
2025-05-07T20:32:18.3967958Z         torch.manual_seed(2025)
2025-05-07T20:32:18.3968205Z     
2025-05-07T20:32:18.3968488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.3968845Z     
2025-05-07T20:32:18.3969042Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.3969345Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.3969670Z         x = x_sign * x_clamp
2025-05-07T20:32:18.3969914Z         x0 = x[:, :D]
2025-05-07T20:32:18.3970141Z         x1 = x[:, D:]
2025-05-07T20:32:18.3970367Z     
2025-05-07T20:32:18.3970591Z         if contiguous:
2025-05-07T20:32:18.3970856Z             x0 = x0.contiguous()
2025-05-07T20:32:18.3971127Z             x1 = x1.contiguous()
2025-05-07T20:32:18.3971375Z     
2025-05-07T20:32:18.3971570Z         if scale_ub is not None:
2025-05-07T20:32:18.3971893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.3972241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.3972560Z             )
2025-05-07T20:32:18.3972759Z         else:
2025-05-07T20:32:18.3972984Z             scale_ub_tensor = None
2025-05-07T20:32:18.3973255Z     
2025-05-07T20:32:18.3973494Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.3973815Z             op = silu_mul_quant
2025-05-07T20:32:18.3974154Z             if compiled:
2025-05-07T20:32:18.3974419Z                 op = torch.compile(op)
2025-05-07T20:32:18.3974721Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.3975009Z     
2025-05-07T20:32:18.3975213Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.3975383Z 
2025-05-07T20:32:18.3975489Z moe/activation_test.py:117: 
2025-05-07T20:32:18.3975803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.3976159Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.3976446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.3977153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.3977860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.3978414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.3979179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.3979856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.3980402Z     kernel = self.compile(
2025-05-07T20:32:18.3980958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.3981613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.3982121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.3982354Z 
2025-05-07T20:32:18.3982571Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bef0cad0>
2025-05-07T20:32:18.3983652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.3985074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bf2c40e0>}
2025-05-07T20:32:18.3986433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.3987471Z context = <triton._C.libtriton.ir.context object at 0x7f05bef110b0>
2025-05-07T20:32:18.3987811Z 
2025-05-07T20:32:18.3987986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.3988510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.3988983Z                            module_map=module_map)
2025-05-07T20:32:18.3989353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.3989710Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.3989973Z E       ^
2025-05-07T20:32:18.3990443Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.3990897Z 
2025-05-07T20:32:18.3991320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.3991837Z 
2025-05-07T20:32:18.3991943Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.3992367Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.3992775Z     T=4096,
2025-05-07T20:32:18.3992974Z     D=7168,
2025-05-07T20:32:18.3993172Z     scale_ub=None,
2025-05-07T20:32:18.3993397Z     contiguous=False,
2025-05-07T20:32:18.3993627Z     compiled=False,
2025-05-07T20:32:18.3993839Z )
2025-05-07T20:32:18.3994163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.3994706Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.3994983Z 
2025-05-07T20:32:18.3995065Z     @given(
2025-05-07T20:32:18.3995304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.3995625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.3995932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.3996267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.3996606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.3996901Z     )
2025-05-07T20:32:18.3997250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.3997695Z     def test_silu_mul_quant(
2025-05-07T20:32:18.3997950Z         self,
2025-05-07T20:32:18.3998149Z         T: int,
2025-05-07T20:32:18.3998355Z         D: int,
2025-05-07T20:32:18.3998579Z         scale_ub: Optional[float],
2025-05-07T20:32:18.3998849Z         contiguous: bool,
2025-05-07T20:32:18.3999099Z         compiled: bool,
2025-05-07T20:32:18.3999330Z     ) -> None:
2025-05-07T20:32:18.3999587Z         torch.manual_seed(2025)
2025-05-07T20:32:18.3999840Z     
2025-05-07T20:32:18.4000119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.4000468Z     
2025-05-07T20:32:18.4000702Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.4001017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.4001326Z         x = x_sign * x_clamp
2025-05-07T20:32:18.4001576Z         x0 = x[:, :D]
2025-05-07T20:32:18.4001845Z         x1 = x[:, D:]
2025-05-07T20:32:18.4002060Z     
2025-05-07T20:32:18.4002246Z         if contiguous:
2025-05-07T20:32:18.4002479Z             x0 = x0.contiguous()
2025-05-07T20:32:18.4002740Z             x1 = x1.contiguous()
2025-05-07T20:32:18.4002978Z     
2025-05-07T20:32:18.4003173Z         if scale_ub is not None:
2025-05-07T20:32:18.4003448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.4003785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.4004106Z             )
2025-05-07T20:32:18.4004311Z         else:
2025-05-07T20:32:18.4004523Z             scale_ub_tensor = None
2025-05-07T20:32:18.4004786Z     
2025-05-07T20:32:18.4005021Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.4005337Z             op = silu_mul_quant
2025-05-07T20:32:18.4006089Z             if compiled:
2025-05-07T20:32:18.4006354Z                 op = torch.compile(op)
2025-05-07T20:32:18.4006651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.4007019Z     
2025-05-07T20:32:18.4007215Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.4007380Z 
2025-05-07T20:32:18.4007487Z moe/activation_test.py:117: 
2025-05-07T20:32:18.4007840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4008179Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.4008465Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.4009157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.4009861Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.4010403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.4011091Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.4011753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.4012297Z     kernel = self.compile(
2025-05-07T20:32:18.4012842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.4013497Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.4013899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4014135Z 
2025-05-07T20:32:18.4014410Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bef4b2d0>
2025-05-07T20:32:18.4015497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.4016866Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bf2c4ea0>}
2025-05-07T20:32:18.4018216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.4019248Z context = <triton._C.libtriton.ir.context object at 0x7f05bef378b0>
2025-05-07T20:32:18.4019537Z 
2025-05-07T20:32:18.4019710Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.4020310Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.4020808Z                            module_map=module_map)
2025-05-07T20:32:18.4021179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.4021541Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.4021807Z E       ^
2025-05-07T20:32:18.4022279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.4022798Z 
2025-05-07T20:32:18.4023216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.4023731Z 
2025-05-07T20:32:18.4023842Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.4024255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.4024663Z     T=128,
2025-05-07T20:32:18.4024855Z     D=7168,
2025-05-07T20:32:18.4025058Z     scale_ub=None,
2025-05-07T20:32:18.4025278Z     contiguous=False,
2025-05-07T20:32:18.4025510Z     compiled=True,
2025-05-07T20:32:18.4025719Z )
2025-05-07T20:32:18.4457620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.4458140Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:18.4458418Z 
2025-05-07T20:32:18.4458500Z     @given(
2025-05-07T20:32:18.4458738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.4459275Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.4459591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.4459922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.4460254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.4460543Z     )
2025-05-07T20:32:18.4460892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.4461338Z     def test_silu_mul_quant(
2025-05-07T20:32:18.4461597Z         self,
2025-05-07T20:32:18.4461798Z         T: int,
2025-05-07T20:32:18.4462005Z         D: int,
2025-05-07T20:32:18.4462232Z         scale_ub: Optional[float],
2025-05-07T20:32:18.4462502Z         contiguous: bool,
2025-05-07T20:32:18.4462746Z         compiled: bool,
2025-05-07T20:32:18.4462977Z     ) -> None:
2025-05-07T20:32:18.4463195Z         torch.manual_seed(2025)
2025-05-07T20:32:18.4463441Z     
2025-05-07T20:32:18.4463721Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.4464076Z     
2025-05-07T20:32:18.4464267Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.4464561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.4464875Z         x = x_sign * x_clamp
2025-05-07T20:32:18.4465116Z         x0 = x[:, :D]
2025-05-07T20:32:18.4465337Z         x1 = x[:, D:]
2025-05-07T20:32:18.4465555Z     
2025-05-07T20:32:18.4465742Z         if contiguous:
2025-05-07T20:32:18.4466053Z             x0 = x0.contiguous()
2025-05-07T20:32:18.4466321Z             x1 = x1.contiguous()
2025-05-07T20:32:18.4466561Z     
2025-05-07T20:32:18.4466760Z         if scale_ub is not None:
2025-05-07T20:32:18.4467039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.4467374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.4467689Z             )
2025-05-07T20:32:18.4467890Z         else:
2025-05-07T20:32:18.4468098Z             scale_ub_tensor = None
2025-05-07T20:32:18.4468361Z     
2025-05-07T20:32:18.4468596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.4468908Z             op = silu_mul_quant
2025-05-07T20:32:18.4469164Z             if compiled:
2025-05-07T20:32:18.4469416Z                 op = torch.compile(op)
2025-05-07T20:32:18.4469716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.4469989Z     
2025-05-07T20:32:18.4470187Z         y_fp8, y_scale = fn()
2025-05-07T20:32:18.4470484Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:18.4470844Z     
2025-05-07T20:32:18.4471092Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.4471434Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:18.4471728Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:18.4472061Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:18.4472427Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.4472827Z     
2025-05-07T20:32:18.4473030Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:18.4473235Z 
2025-05-07T20:32:18.4473338Z moe/activation_test.py:126: 
2025-05-07T20:32:18.4473643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4473976Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:18.4474308Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.4475107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:18.4475863Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:18.4476410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.4477100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.4477788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:18.4478564Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.4479315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:18.4480064Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.4480846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:18.4481484Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:18.4482087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:18.4482610Z     fn()
2025-05-07T20:32:18.4483122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:18.4483704Z     self.fn.run(
2025-05-07T20:32:18.4484192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.4484734Z     kernel = self.compile(
2025-05-07T20:32:18.4485280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.4485931Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.4486375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4486616Z 
2025-05-07T20:32:18.4486826Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bed9b790>
2025-05-07T20:32:18.4488035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.4496675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05becbf060>}
2025-05-07T20:32:18.4498096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.4499136Z context = <triton._C.libtriton.ir.context object at 0x7f05beddc7f0>
2025-05-07T20:32:18.4499429Z 
2025-05-07T20:32:18.4499673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.4500197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.4500662Z                            module_map=module_map)
2025-05-07T20:32:18.4501040Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.4501411Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:18.4501727Z E       ^
2025-05-07T20:32:18.4502208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.4502673Z 
2025-05-07T20:32:18.4503097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.4503618Z 
2025-05-07T20:32:18.4503735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.4504158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.4504579Z     T=128,
2025-05-07T20:32:18.4504787Z     D=7168,
2025-05-07T20:32:18.4504996Z     scale_ub=None,
2025-05-07T20:32:18.4505221Z     contiguous=False,
2025-05-07T20:32:18.4505463Z     compiled=False,
2025-05-07T20:32:18.4505986Z )
2025-05-07T20:32:18.7467932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.7468501Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.7469260Z 
2025-05-07T20:32:18.7469365Z     @given(
2025-05-07T20:32:18.7469676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.7470082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.7470492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.7470913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.7471246Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.7471545Z     )
2025-05-07T20:32:18.7471912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.7472367Z     def test_silu_mul_quant(
2025-05-07T20:32:18.7472626Z         self,
2025-05-07T20:32:18.7472829Z         T: int,
2025-05-07T20:32:18.7473041Z         D: int,
2025-05-07T20:32:18.7473276Z         scale_ub: Optional[float],
2025-05-07T20:32:18.7473554Z         contiguous: bool,
2025-05-07T20:32:18.7473813Z         compiled: bool,
2025-05-07T20:32:18.7474055Z     ) -> None:
2025-05-07T20:32:18.7474286Z         torch.manual_seed(2025)
2025-05-07T20:32:18.7474544Z     
2025-05-07T20:32:18.7474835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.7475189Z     
2025-05-07T20:32:18.7475397Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.7475705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.7476021Z         x = x_sign * x_clamp
2025-05-07T20:32:18.7476276Z         x0 = x[:, :D]
2025-05-07T20:32:18.7476507Z         x1 = x[:, D:]
2025-05-07T20:32:18.7476824Z     
2025-05-07T20:32:18.7477032Z         if contiguous:
2025-05-07T20:32:18.7477282Z             x0 = x0.contiguous()
2025-05-07T20:32:18.7477549Z             x1 = x1.contiguous()
2025-05-07T20:32:18.7477805Z     
2025-05-07T20:32:18.7478016Z         if scale_ub is not None:
2025-05-07T20:32:18.7478305Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.7478648Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.7478972Z             )
2025-05-07T20:32:18.7479183Z         else:
2025-05-07T20:32:18.7479404Z             scale_ub_tensor = None
2025-05-07T20:32:18.7479677Z     
2025-05-07T20:32:18.7479924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.7480246Z             op = silu_mul_quant
2025-05-07T20:32:18.7480511Z             if compiled:
2025-05-07T20:32:18.7480774Z                 op = torch.compile(op)
2025-05-07T20:32:18.7481074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.7481364Z     
2025-05-07T20:32:18.7481676Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.7481850Z 
2025-05-07T20:32:18.7481958Z moe/activation_test.py:117: 
2025-05-07T20:32:18.7482267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.7482615Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.7482909Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.7483606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.7484390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.7484943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.7485635Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.7486309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.7486858Z     kernel = self.compile(
2025-05-07T20:32:18.7487414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.7488177Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.7488587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.7488820Z 
2025-05-07T20:32:18.7489039Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bee21350>
2025-05-07T20:32:18.7490182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.7491576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be790cc0>}
2025-05-07T20:32:18.7492936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.7493977Z context = <triton._C.libtriton.ir.context object at 0x7f05bee159b0>
2025-05-07T20:32:18.7494270Z 
2025-05-07T20:32:18.7494450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.7494979Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.7495466Z                            module_map=module_map)
2025-05-07T20:32:18.7495842Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.7496208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.7496473Z E       ^
2025-05-07T20:32:18.7496945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.7497394Z 
2025-05-07T20:32:18.7497861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.7498373Z 
2025-05-07T20:32:18.7498488Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.7498901Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.7499315Z     T=4096,
2025-05-07T20:32:18.7499516Z     D=5120,
2025-05-07T20:32:18.7499712Z     scale_ub=1200.0,
2025-05-07T20:32:18.7499953Z     contiguous=True,
2025-05-07T20:32:18.7500189Z     compiled=False,
2025-05-07T20:32:18.7500401Z )
2025-05-07T20:32:18.7500768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.7501279Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.7501554Z 
2025-05-07T20:32:18.7501663Z     @given(
2025-05-07T20:32:18.7501895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.7502222Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.7502578Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.7502914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.7503245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.7503545Z     )
2025-05-07T20:32:18.7503900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.7504341Z     def test_silu_mul_quant(
2025-05-07T20:32:18.7504638Z         self,
2025-05-07T20:32:18.7504848Z         T: int,
2025-05-07T20:32:18.7505049Z         D: int,
2025-05-07T20:32:18.7505277Z         scale_ub: Optional[float],
2025-05-07T20:32:18.7505554Z         contiguous: bool,
2025-05-07T20:32:18.7506088Z         compiled: bool,
2025-05-07T20:32:18.7506322Z     ) -> None:
2025-05-07T20:32:18.7506546Z         torch.manual_seed(2025)
2025-05-07T20:32:18.7506787Z     
2025-05-07T20:32:18.7507066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.7507421Z     
2025-05-07T20:32:18.7507619Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.7507917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.7508234Z         x = x_sign * x_clamp
2025-05-07T20:32:18.7508481Z         x0 = x[:, :D]
2025-05-07T20:32:18.7508699Z         x1 = x[:, D:]
2025-05-07T20:32:18.7508920Z     
2025-05-07T20:32:18.7509116Z         if contiguous:
2025-05-07T20:32:18.7509357Z             x0 = x0.contiguous()
2025-05-07T20:32:18.7509700Z             x1 = x1.contiguous()
2025-05-07T20:32:18.7509949Z     
2025-05-07T20:32:18.7510148Z         if scale_ub is not None:
2025-05-07T20:32:18.7510419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.7510761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.7511084Z             )
2025-05-07T20:32:18.7511279Z         else:
2025-05-07T20:32:18.7511506Z             scale_ub_tensor = None
2025-05-07T20:32:18.7511764Z     
2025-05-07T20:32:18.7511996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.7512317Z             op = silu_mul_quant
2025-05-07T20:32:18.7512573Z             if compiled:
2025-05-07T20:32:18.7512820Z                 op = torch.compile(op)
2025-05-07T20:32:18.7513121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.7513403Z     
2025-05-07T20:32:18.7513599Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.7513765Z 
2025-05-07T20:32:18.7513867Z moe/activation_test.py:117: 
2025-05-07T20:32:18.7514176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.7514513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.7514802Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.7515503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.7516204Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.7516816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.7517505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.7518181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.7518727Z     kernel = self.compile(
2025-05-07T20:32:18.7519268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.7519937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.7520345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.7520576Z 
2025-05-07T20:32:18.7520787Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509e91c90>
2025-05-07T20:32:18.7521932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.7523320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be791f80>}
2025-05-07T20:32:18.7524682Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.7525776Z context = <triton._C.libtriton.ir.context object at 0x7f0509e4e270>
2025-05-07T20:32:18.7526065Z 
2025-05-07T20:32:18.7526240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.7526762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.7527236Z                            module_map=module_map)
2025-05-07T20:32:18.7527675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.7528032Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.7528299Z E       ^
2025-05-07T20:32:18.7528773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.7529227Z 
2025-05-07T20:32:18.7529656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.7530255Z 
2025-05-07T20:32:18.7530362Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.7530782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.7531196Z     T=1,
2025-05-07T20:32:18.7531382Z     D=5120,
2025-05-07T20:32:18.7531588Z     scale_ub=None,
2025-05-07T20:32:18.7531810Z     contiguous=True,
2025-05-07T20:32:18.7532035Z     compiled=True,
2025-05-07T20:32:18.7532247Z )
2025-05-07T20:32:19.1809154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1809756Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1810121Z 
2025-05-07T20:32:19.1810209Z     @given(
2025-05-07T20:32:19.1810460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1810817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1811155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1811501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1811855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1812154Z     )
2025-05-07T20:32:19.1812509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1812960Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1813216Z         self,
2025-05-07T20:32:19.1813418Z         T: int,
2025-05-07T20:32:19.1813628Z         D: int,
2025-05-07T20:32:19.1813861Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1814427Z         contiguous: bool,
2025-05-07T20:32:19.1814687Z         compiled: bool,
2025-05-07T20:32:19.1814926Z     ) -> None:
2025-05-07T20:32:19.1815144Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1815396Z     
2025-05-07T20:32:19.1815682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1816029Z     
2025-05-07T20:32:19.1816233Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1816534Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1816852Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1817105Z         x0 = x[:, :D]
2025-05-07T20:32:19.1817335Z         x1 = x[:, D:]
2025-05-07T20:32:19.1817547Z     
2025-05-07T20:32:19.1817745Z         if contiguous:
2025-05-07T20:32:19.1817989Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1818261Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1818505Z     
2025-05-07T20:32:19.1818707Z         if scale_ub is not None:
2025-05-07T20:32:19.1818996Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1819419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1819742Z             )
2025-05-07T20:32:19.1819945Z         else:
2025-05-07T20:32:19.1820161Z             scale_ub_tensor = None
2025-05-07T20:32:19.1820426Z     
2025-05-07T20:32:19.1820673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1820995Z             op = silu_mul_quant
2025-05-07T20:32:19.1821263Z             if compiled:
2025-05-07T20:32:19.1821608Z                 op = torch.compile(op)
2025-05-07T20:32:19.1821911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1822196Z     
2025-05-07T20:32:19.1822401Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1822693Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1822997Z     
2025-05-07T20:32:19.1823248Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1823597Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1823902Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1824228Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1824601Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1824918Z     
2025-05-07T20:32:19.1825134Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1825334Z 
2025-05-07T20:32:19.1825460Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1825773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1826208Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1826552Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1827361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1828125Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1828691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1829389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1830092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1830821Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1831584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:19.1832351Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1833093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1833739Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1834401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1834936Z     fn()
2025-05-07T20:32:19.1835454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1836038Z     self.fn.run(
2025-05-07T20:32:19.1836517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1837059Z     kernel = self.compile(
2025-05-07T20:32:19.1837612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1838275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1838685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1838918Z 
2025-05-07T20:32:19.1839133Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509ebbe50>
2025-05-07T20:32:19.1840271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1841682Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be792fc0>}
2025-05-07T20:32:19.1843041Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1844128Z context = <triton._C.libtriton.ir.context object at 0x7f0509e144f0>
2025-05-07T20:32:19.1844418Z 
2025-05-07T20:32:19.1844594Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1845121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1845605Z                            module_map=module_map)
2025-05-07T20:32:19.1845979Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1846339Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1846616Z E       ^
2025-05-07T20:32:19.1847094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1847715Z 
2025-05-07T20:32:19.1848149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1848717Z 
2025-05-07T20:32:19.1848826Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1849248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1849661Z     T=2048,
2025-05-07T20:32:19.1849856Z     D=5120,
2025-05-07T20:32:19.1850062Z     scale_ub=None,
2025-05-07T20:32:19.1850288Z     contiguous=True,
2025-05-07T20:32:19.1850527Z     compiled=True,
2025-05-07T20:32:19.1850743Z )
2025-05-07T20:32:19.5980584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.5981422Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.5981701Z 
2025-05-07T20:32:19.5981794Z     @given(
2025-05-07T20:32:19.5982046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.5982367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.5982703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.5983045Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.5983382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.5983681Z     )
2025-05-07T20:32:19.5984044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.5984491Z     def test_silu_mul_quant(
2025-05-07T20:32:19.5984745Z         self,
2025-05-07T20:32:19.5984955Z         T: int,
2025-05-07T20:32:19.5985452Z         D: int,
2025-05-07T20:32:19.5985690Z         scale_ub: Optional[float],
2025-05-07T20:32:19.5985969Z         contiguous: bool,
2025-05-07T20:32:19.5986214Z         compiled: bool,
2025-05-07T20:32:19.5986451Z     ) -> None:
2025-05-07T20:32:19.5986679Z         torch.manual_seed(2025)
2025-05-07T20:32:19.5986931Z     
2025-05-07T20:32:19.5987206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.5987559Z     
2025-05-07T20:32:19.5987766Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.5988063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.5988386Z         x = x_sign * x_clamp
2025-05-07T20:32:19.5988636Z         x0 = x[:, :D]
2025-05-07T20:32:19.5988855Z         x1 = x[:, D:]
2025-05-07T20:32:19.5989074Z     
2025-05-07T20:32:19.5989272Z         if contiguous:
2025-05-07T20:32:19.5989509Z             x0 = x0.contiguous()
2025-05-07T20:32:19.5989780Z             x1 = x1.contiguous()
2025-05-07T20:32:19.5990031Z     
2025-05-07T20:32:19.5990314Z         if scale_ub is not None:
2025-05-07T20:32:19.5990597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.5990946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.5991259Z             )
2025-05-07T20:32:19.5991461Z         else:
2025-05-07T20:32:19.5991681Z             scale_ub_tensor = None
2025-05-07T20:32:19.5991939Z     
2025-05-07T20:32:19.5992181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.5992599Z             op = silu_mul_quant
2025-05-07T20:32:19.5992862Z             if compiled:
2025-05-07T20:32:19.5993115Z                 op = torch.compile(op)
2025-05-07T20:32:19.5993424Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.5993717Z     
2025-05-07T20:32:19.5993915Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.5994213Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.5994521Z     
2025-05-07T20:32:19.5994767Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.5995117Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.5995421Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.5995743Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.5996116Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.5996439Z     
2025-05-07T20:32:19.5996654Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.5996945Z 
2025-05-07T20:32:19.5997050Z moe/activation_test.py:126: 
2025-05-07T20:32:19.5997362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.5997713Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.5998045Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.5998849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.5999621Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.6000180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.6000915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.6001625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.6002362Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.6003127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:19.6003889Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.6004631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.6005334Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.6006306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.6006839Z     fn()
2025-05-07T20:32:19.6007361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.6008066Z     self.fn.run(
2025-05-07T20:32:19.6008542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.6009094Z     kernel = self.compile(
2025-05-07T20:32:19.6009648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.6010309Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.6010719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6010960Z 
2025-05-07T20:32:19.6011253Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be1ef0d0>
2025-05-07T20:32:19.6012353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.6013756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be712ac0>}
2025-05-07T20:32:19.6015211Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.6016256Z context = <triton._C.libtriton.ir.context object at 0x7f05be213570>
2025-05-07T20:32:19.6016547Z 
2025-05-07T20:32:19.6016725Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.6017263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.6017751Z                            module_map=module_map)
2025-05-07T20:32:19.6025889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.6026297Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.6026584Z E       ^
2025-05-07T20:32:19.6027063Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.6027680Z 
2025-05-07T20:32:19.6028111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.6028641Z 
2025-05-07T20:32:19.6028752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.6029180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.6029591Z     T=128,
2025-05-07T20:32:19.6029797Z     D=5120,
2025-05-07T20:32:19.6030018Z     scale_ub=None,
2025-05-07T20:32:19.6030246Z     contiguous=True,
2025-05-07T20:32:19.6030482Z     compiled=True,
2025-05-07T20:32:19.6030709Z )
2025-05-07T20:32:20.2433059Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.2433604Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.2433876Z 
2025-05-07T20:32:20.2433957Z     @given(
2025-05-07T20:32:20.2434197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.2434543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.2434846Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.2435181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.2435512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.2435797Z     )
2025-05-07T20:32:20.2436143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.2436884Z     def test_silu_mul_quant(
2025-05-07T20:32:20.2437140Z         self,
2025-05-07T20:32:20.2437330Z         T: int,
2025-05-07T20:32:20.2437533Z         D: int,
2025-05-07T20:32:20.2437746Z         scale_ub: Optional[float],
2025-05-07T20:32:20.2438016Z         contiguous: bool,
2025-05-07T20:32:20.2438261Z         compiled: bool,
2025-05-07T20:32:20.2438486Z     ) -> None:
2025-05-07T20:32:20.2438710Z         torch.manual_seed(2025)
2025-05-07T20:32:20.2438953Z     
2025-05-07T20:32:20.2439235Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.2439575Z     
2025-05-07T20:32:20.2439773Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.2440070Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.2440375Z         x = x_sign * x_clamp
2025-05-07T20:32:20.2440622Z         x0 = x[:, :D]
2025-05-07T20:32:20.2440842Z         x1 = x[:, D:]
2025-05-07T20:32:20.2441071Z     
2025-05-07T20:32:20.2441284Z         if contiguous:
2025-05-07T20:32:20.2441526Z             x0 = x0.contiguous()
2025-05-07T20:32:20.2441862Z             x1 = x1.contiguous()
2025-05-07T20:32:20.2442112Z     
2025-05-07T20:32:20.2442309Z         if scale_ub is not None:
2025-05-07T20:32:20.2442577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.2442914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.2443224Z             )
2025-05-07T20:32:20.2443412Z         else:
2025-05-07T20:32:20.2443627Z             scale_ub_tensor = None
2025-05-07T20:32:20.2443954Z     
2025-05-07T20:32:20.2444184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.2444500Z             op = silu_mul_quant
2025-05-07T20:32:20.2444755Z             if compiled:
2025-05-07T20:32:20.2445009Z                 op = torch.compile(op)
2025-05-07T20:32:20.2445301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.2445579Z     
2025-05-07T20:32:20.2445781Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.2446063Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.2446359Z     
2025-05-07T20:32:20.2446600Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.2446931Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.2447225Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.2447641Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.2448001Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.2448406Z     
2025-05-07T20:32:20.2448615Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.2448811Z 
2025-05-07T20:32:20.2448921Z moe/activation_test.py:126: 
2025-05-07T20:32:20.2449220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.2449562Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.2449896Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.2450695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.2451514Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.2452070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.2452763Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.2453451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.2454191Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.2454955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:20.2455711Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.2456491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.2457148Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.2457760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.2458284Z     fn()
2025-05-07T20:32:20.2458804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.2459402Z     self.fn.run(
2025-05-07T20:32:20.2459889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.2460425Z     kernel = self.compile(
2025-05-07T20:32:20.2460980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.2461647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.2462089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.2462333Z 
2025-05-07T20:32:20.2462541Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509c08610>
2025-05-07T20:32:20.2463634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.2465082Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bea95c60>}
2025-05-07T20:32:20.2466445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.2467464Z context = <triton._C.libtriton.ir.context object at 0x7f0509c28af0>
2025-05-07T20:32:20.2467764Z 
2025-05-07T20:32:20.2467935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.2468472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.2468945Z                            module_map=module_map)
2025-05-07T20:32:20.2469312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.2469678Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.2470007Z E       ^
2025-05-07T20:32:20.2470469Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.2470930Z 
2025-05-07T20:32:20.2471346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.2471867Z 
2025-05-07T20:32:20.2471972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.2472394Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.2472800Z     T=4096,
2025-05-07T20:32:20.2473001Z     D=5120,
2025-05-07T20:32:20.2473205Z     scale_ub=None,
2025-05-07T20:32:20.2473420Z     contiguous=True,
2025-05-07T20:32:20.2473651Z     compiled=True,
2025-05-07T20:32:20.2473868Z )
2025-05-07T20:32:20.7390502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.7391244Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.7391633Z 
2025-05-07T20:32:20.7391738Z     @given(
2025-05-07T20:32:20.7392041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.7392452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.7392839Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.7393167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.7393502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.7393798Z     )
2025-05-07T20:32:20.7394436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.7394885Z     def test_silu_mul_quant(
2025-05-07T20:32:20.7395133Z         self,
2025-05-07T20:32:20.7395335Z         T: int,
2025-05-07T20:32:20.7395531Z         D: int,
2025-05-07T20:32:20.7395754Z         scale_ub: Optional[float],
2025-05-07T20:32:20.7396029Z         contiguous: bool,
2025-05-07T20:32:20.7396265Z         compiled: bool,
2025-05-07T20:32:20.7396495Z     ) -> None:
2025-05-07T20:32:20.7396722Z         torch.manual_seed(2025)
2025-05-07T20:32:20.7396958Z     
2025-05-07T20:32:20.7397231Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.7397579Z     
2025-05-07T20:32:20.7397771Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.7398061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.7398371Z         x = x_sign * x_clamp
2025-05-07T20:32:20.7398611Z         x0 = x[:, :D]
2025-05-07T20:32:20.7398821Z         x1 = x[:, D:]
2025-05-07T20:32:20.7399033Z     
2025-05-07T20:32:20.7399294Z         if contiguous:
2025-05-07T20:32:20.7399524Z             x0 = x0.contiguous()
2025-05-07T20:32:20.7399783Z             x1 = x1.contiguous()
2025-05-07T20:32:20.7400022Z     
2025-05-07T20:32:20.7400208Z         if scale_ub is not None:
2025-05-07T20:32:20.7400482Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.7400817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.7401200Z             )
2025-05-07T20:32:20.7401427Z         else:
2025-05-07T20:32:20.7401652Z             scale_ub_tensor = None
2025-05-07T20:32:20.7401897Z     
2025-05-07T20:32:20.7402130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.7402442Z             op = silu_mul_quant
2025-05-07T20:32:20.7402686Z             if compiled:
2025-05-07T20:32:20.7402936Z                 op = torch.compile(op)
2025-05-07T20:32:20.7403234Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.7403507Z     
2025-05-07T20:32:20.7403700Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.7403985Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.7404271Z     
2025-05-07T20:32:20.7404505Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.7404840Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.7405133Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.7405442Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.7406248Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.7406568Z     
2025-05-07T20:32:20.7406768Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.7406970Z 
2025-05-07T20:32:20.7407075Z moe/activation_test.py:126: 
2025-05-07T20:32:20.7407376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7407792Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.7408122Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.7408918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.7409680Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.7410228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.7410922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.7411619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.7412348Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.7413099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:20.7413961Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.7414700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.7415344Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.7415943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.7416469Z     fn()
2025-05-07T20:32:20.7416989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.7417576Z     self.fn.run(
2025-05-07T20:32:20.7418048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.7418588Z     kernel = self.compile(
2025-05-07T20:32:20.7419135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.7419891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.7420297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7420529Z 
2025-05-07T20:32:20.7420745Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be56f210>
2025-05-07T20:32:20.7421834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.7423293Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05beb02840>}
2025-05-07T20:32:20.7424639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.7425676Z context = <triton._C.libtriton.ir.context object at 0x7f05be5737f0>
2025-05-07T20:32:20.7425965Z 
2025-05-07T20:32:20.7426138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.7426657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.7427130Z                            module_map=module_map)
2025-05-07T20:32:20.7427499Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.7427937Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.7428205Z E       ^
2025-05-07T20:32:20.7428677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.7429127Z 
2025-05-07T20:32:20.7429547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.7430057Z 
2025-05-07T20:32:20.7430166Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.7430587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.7430998Z     T=16384,
2025-05-07T20:32:20.7431229Z     D=5120,
2025-05-07T20:32:20.7431449Z     scale_ub=None,
2025-05-07T20:32:20.7431673Z     contiguous=True,
2025-05-07T20:32:20.7431904Z     compiled=True,
2025-05-07T20:32:20.7432113Z )
2025-05-07T20:32:20.7691615Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:20.7693290Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:20.7694653Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:20.7695877Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:20.7696989Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:20.8381899Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.8383150Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.8383709Z 
2025-05-07T20:32:20.8383885Z     @given(
2025-05-07T20:32:20.8384360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.8385008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.8385635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.8386300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.8386972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.8387869Z     )
2025-05-07T20:32:20.8388576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.8389472Z     def test_silu_mul_quant(
2025-05-07T20:32:20.8389966Z         self,
2025-05-07T20:32:20.8390363Z         T: int,
2025-05-07T20:32:20.8390757Z         D: int,
2025-05-07T20:32:20.8391095Z         scale_ub: Optional[float],
2025-05-07T20:32:20.8391377Z         contiguous: bool,
2025-05-07T20:32:20.8391706Z         compiled: bool,
2025-05-07T20:32:20.8391946Z     ) -> None:
2025-05-07T20:32:20.8392173Z         torch.manual_seed(2025)
2025-05-07T20:32:20.8392420Z     
2025-05-07T20:32:20.8392705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.8393060Z     
2025-05-07T20:32:20.8393258Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.8393562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.8393885Z         x = x_sign * x_clamp
2025-05-07T20:32:20.8394137Z         x0 = x[:, :D]
2025-05-07T20:32:20.8394367Z         x1 = x[:, D:]
2025-05-07T20:32:20.8394586Z     
2025-05-07T20:32:20.8394778Z         if contiguous:
2025-05-07T20:32:20.8395023Z             x0 = x0.contiguous()
2025-05-07T20:32:20.8395295Z             x1 = x1.contiguous()
2025-05-07T20:32:20.8395548Z     
2025-05-07T20:32:20.8395746Z         if scale_ub is not None:
2025-05-07T20:32:20.8396032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.8396474Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.8396787Z             )
2025-05-07T20:32:20.8396991Z         else:
2025-05-07T20:32:20.8397215Z             scale_ub_tensor = None
2025-05-07T20:32:20.8397470Z     
2025-05-07T20:32:20.8397714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.8398038Z             op = silu_mul_quant
2025-05-07T20:32:20.8398297Z             if compiled:
2025-05-07T20:32:20.8398556Z                 op = torch.compile(op)
2025-05-07T20:32:20.8398869Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.8399148Z     
2025-05-07T20:32:20.8399354Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.8399653Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.8399948Z     
2025-05-07T20:32:20.8400197Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.8400546Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.8400858Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.8401181Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.8401556Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.8401879Z     
2025-05-07T20:32:20.8402087Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.8402295Z 
2025-05-07T20:32:20.8402403Z moe/activation_test.py:126: 
2025-05-07T20:32:20.8402713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.8403132Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.8403472Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.8404266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.8405028Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.8405578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.8406562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.8407262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.8408079Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.8408839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:20.8409661Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.8410398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.8411071Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.8411705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.8412326Z     fn()
2025-05-07T20:32:20.8412844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.8413432Z     self.fn.run(
2025-05-07T20:32:20.8413904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.8414445Z     kernel = self.compile(
2025-05-07T20:32:20.8414998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.8415655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.8416056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.8416299Z 
2025-05-07T20:32:20.8416508Z self = <triton.compiler.compiler.ASTSource object at 0x7f050971c150>
2025-05-07T20:32:20.8417594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.8419056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509b25d00>}
2025-05-07T20:32:20.8420406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.8421438Z context = <triton._C.libtriton.ir.context object at 0x7f050970c630>
2025-05-07T20:32:20.8421734Z 
2025-05-07T20:32:20.8421903Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.8422435Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.8422909Z                            module_map=module_map)
2025-05-07T20:32:20.8423286Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.8423655Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.8423925Z E       ^
2025-05-07T20:32:20.8424396Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.8424854Z 
2025-05-07T20:32:20.8425335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.8425853Z 
2025-05-07T20:32:20.8425971Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.8426387Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.8426802Z     T=1,
2025-05-07T20:32:20.8426999Z     D=5120,
2025-05-07T20:32:20.8427200Z     scale_ub=1200.0,
2025-05-07T20:32:20.8427437Z     contiguous=True,
2025-05-07T20:32:20.8427671Z     compiled=True,
2025-05-07T20:32:20.8427889Z )
2025-05-07T20:32:21.1136395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.1136957Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.1137223Z 
2025-05-07T20:32:21.1137305Z     @given(
2025-05-07T20:32:21.1137543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.1137858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.1138156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.1138824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.1139161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.1139443Z     )
2025-05-07T20:32:21.1139796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.1140237Z     def test_silu_mul_quant(
2025-05-07T20:32:21.1140484Z         self,
2025-05-07T20:32:21.1140674Z         T: int,
2025-05-07T20:32:21.1140880Z         D: int,
2025-05-07T20:32:21.1141221Z         scale_ub: Optional[float],
2025-05-07T20:32:21.1141518Z         contiguous: bool,
2025-05-07T20:32:21.1141762Z         compiled: bool,
2025-05-07T20:32:21.1141996Z     ) -> None:
2025-05-07T20:32:21.1142210Z         torch.manual_seed(2025)
2025-05-07T20:32:21.1142455Z     
2025-05-07T20:32:21.1142736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.1143077Z     
2025-05-07T20:32:21.1143274Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.1143576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.1143884Z         x = x_sign * x_clamp
2025-05-07T20:32:21.1144132Z         x0 = x[:, :D]
2025-05-07T20:32:21.1144355Z         x1 = x[:, D:]
2025-05-07T20:32:21.1144558Z     
2025-05-07T20:32:21.1144748Z         if contiguous:
2025-05-07T20:32:21.1144983Z             x0 = x0.contiguous()
2025-05-07T20:32:21.1145235Z             x1 = x1.contiguous()
2025-05-07T20:32:21.1145481Z     
2025-05-07T20:32:21.1145785Z         if scale_ub is not None:
2025-05-07T20:32:21.1146062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.1146396Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.1146707Z             )
2025-05-07T20:32:21.1146902Z         else:
2025-05-07T20:32:21.1147113Z             scale_ub_tensor = None
2025-05-07T20:32:21.1147370Z     
2025-05-07T20:32:21.1147607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.1147922Z             op = silu_mul_quant
2025-05-07T20:32:21.1148184Z             if compiled:
2025-05-07T20:32:21.1148439Z                 op = torch.compile(op)
2025-05-07T20:32:21.1148729Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1149012Z     
2025-05-07T20:32:21.1149211Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.1149373Z 
2025-05-07T20:32:21.1157093Z moe/activation_test.py:117: 
2025-05-07T20:32:21.1157456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1157829Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.1158138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1158725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.1159306Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.1159990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.1160827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.1161380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.1162085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.1162767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.1163318Z     kernel = self.compile(
2025-05-07T20:32:21.1163883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.1164561Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.1164987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1165224Z 
2025-05-07T20:32:21.1165439Z self = <triton.compiler.compiler.ASTSource object at 0x7f05097fd8d0>
2025-05-07T20:32:21.1166597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.1168120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509a30a40>}
2025-05-07T20:32:21.1169478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.1170570Z context = <triton._C.libtriton.ir.context object at 0x7f05097e2030>
2025-05-07T20:32:21.1170859Z 
2025-05-07T20:32:21.1171028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.1171553Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.1172045Z                            module_map=module_map)
2025-05-07T20:32:21.1172430Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.1172793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.1173069Z E       ^
2025-05-07T20:32:21.1173554Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.1174006Z 
2025-05-07T20:32:21.1174433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.1174999Z 
2025-05-07T20:32:21.1175108Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.1175533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.1175938Z     T=1,
2025-05-07T20:32:21.1176138Z     D=5120,
2025-05-07T20:32:21.1176341Z     scale_ub=None,
2025-05-07T20:32:21.1176578Z     contiguous=False,
2025-05-07T20:32:21.1176820Z     compiled=True,
2025-05-07T20:32:21.1177042Z )
2025-05-07T20:32:21.1645273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.1645910Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:21.1646178Z 
2025-05-07T20:32:21.1646269Z     @given(
2025-05-07T20:32:21.1646514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.1646832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.1647153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.1647502Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.1647889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.1648185Z     )
2025-05-07T20:32:21.1648543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.1648988Z     def test_silu_mul_quant(
2025-05-07T20:32:21.1649242Z         self,
2025-05-07T20:32:21.1649451Z         T: int,
2025-05-07T20:32:21.1649651Z         D: int,
2025-05-07T20:32:21.1650108Z         scale_ub: Optional[float],
2025-05-07T20:32:21.1650398Z         contiguous: bool,
2025-05-07T20:32:21.1650648Z         compiled: bool,
2025-05-07T20:32:21.1650888Z     ) -> None:
2025-05-07T20:32:21.1651120Z         torch.manual_seed(2025)
2025-05-07T20:32:21.1651381Z     
2025-05-07T20:32:21.1651661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.1652021Z     
2025-05-07T20:32:21.1652237Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.1652538Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.1652867Z         x = x_sign * x_clamp
2025-05-07T20:32:21.1653122Z         x0 = x[:, :D]
2025-05-07T20:32:21.1653345Z         x1 = x[:, D:]
2025-05-07T20:32:21.1653564Z     
2025-05-07T20:32:21.1653759Z         if contiguous:
2025-05-07T20:32:21.1653997Z             x0 = x0.contiguous()
2025-05-07T20:32:21.1654263Z             x1 = x1.contiguous()
2025-05-07T20:32:21.1654510Z     
2025-05-07T20:32:21.1654709Z         if scale_ub is not None:
2025-05-07T20:32:21.1655068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.1655414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.1655718Z             )
2025-05-07T20:32:21.1655925Z         else:
2025-05-07T20:32:21.1656141Z             scale_ub_tensor = None
2025-05-07T20:32:21.1656390Z     
2025-05-07T20:32:21.1656629Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.1657032Z             op = silu_mul_quant
2025-05-07T20:32:21.1657289Z             if compiled:
2025-05-07T20:32:21.1657541Z                 op = torch.compile(op)
2025-05-07T20:32:21.1657842Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1658132Z     
2025-05-07T20:32:21.1658328Z         y_fp8, y_scale = fn()
2025-05-07T20:32:21.1658623Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:21.1658927Z     
2025-05-07T20:32:21.1659170Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.1659525Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:21.1659830Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:21.1660145Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:21.1660516Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.1660838Z     
2025-05-07T20:32:21.1661055Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.1661278Z 
2025-05-07T20:32:21.1661491Z moe/activation_test.py:126: 
2025-05-07T20:32:21.1661800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1662149Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:21.1662478Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.1663279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:21.1664040Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.1664594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.1665277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.1665968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:21.1666698Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.1667450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:21.1668200Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.1668932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:21.1669631Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.1670238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:21.1670764Z     fn()
2025-05-07T20:32:21.1671307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:21.1671919Z     self.fn.run(
2025-05-07T20:32:21.1672389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.1672935Z     kernel = self.compile(
2025-05-07T20:32:21.1673485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.1674138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.1674541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1674778Z 
2025-05-07T20:32:21.1674991Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508d31390>
2025-05-07T20:32:21.1676177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.1677573Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509a12200>}
2025-05-07T20:32:21.1678969Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.1680005Z context = <triton._C.libtriton.ir.context object at 0x7f0508d49970>
2025-05-07T20:32:21.1680295Z 
2025-05-07T20:32:21.1680472Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.1681002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.1681517Z                            module_map=module_map)
2025-05-07T20:32:21.1681889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.1682252Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:21.1682520Z E       ^
2025-05-07T20:32:21.1682990Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.1683490Z 
2025-05-07T20:32:21.1683917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.1684427Z 
2025-05-07T20:32:21.1684540Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.1684949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.1685357Z     T=1,
2025-05-07T20:32:21.1685549Z     D=5120,
2025-05-07T20:32:21.1685758Z     scale_ub=None,
2025-05-07T20:32:21.1685976Z     contiguous=True,
2025-05-07T20:32:21.1686209Z     compiled=False,
2025-05-07T20:32:21.1686427Z )
2025-05-07T20:32:21.2837164Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.2838489Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.2839017Z 
2025-05-07T20:32:21.2839180Z     @given(
2025-05-07T20:32:21.2839647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.2840322Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.2840936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.2841396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.2841751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.2842037Z     )
2025-05-07T20:32:21.2842387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.2842831Z     def test_silu_mul_quant(
2025-05-07T20:32:21.2843357Z         self,
2025-05-07T20:32:21.2843560Z         T: int,
2025-05-07T20:32:21.2843764Z         D: int,
2025-05-07T20:32:21.2843985Z         scale_ub: Optional[float],
2025-05-07T20:32:21.2844257Z         contiguous: bool,
2025-05-07T20:32:21.2844505Z         compiled: bool,
2025-05-07T20:32:21.2844742Z     ) -> None:
2025-05-07T20:32:21.2844958Z         torch.manual_seed(2025)
2025-05-07T20:32:21.2845209Z     
2025-05-07T20:32:21.2845487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.2845833Z     
2025-05-07T20:32:21.2846040Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.2846339Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.2846656Z         x = x_sign * x_clamp
2025-05-07T20:32:21.2846897Z         x0 = x[:, :D]
2025-05-07T20:32:21.2847118Z         x1 = x[:, D:]
2025-05-07T20:32:21.2847329Z     
2025-05-07T20:32:21.2847610Z         if contiguous:
2025-05-07T20:32:21.2847850Z             x0 = x0.contiguous()
2025-05-07T20:32:21.2848202Z             x1 = x1.contiguous()
2025-05-07T20:32:21.2848470Z     
2025-05-07T20:32:21.2848680Z         if scale_ub is not None:
2025-05-07T20:32:21.2848983Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.2849355Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.2849707Z             )
2025-05-07T20:32:21.2849914Z         else:
2025-05-07T20:32:21.2850136Z             scale_ub_tensor = None
2025-05-07T20:32:21.2850511Z     
2025-05-07T20:32:21.2850751Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.2851067Z             op = silu_mul_quant
2025-05-07T20:32:21.2851353Z             if compiled:
2025-05-07T20:32:21.2851636Z                 op = torch.compile(op)
2025-05-07T20:32:21.2851937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.2852212Z     
2025-05-07T20:32:21.2852412Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.2852579Z 
2025-05-07T20:32:21.2852690Z moe/activation_test.py:117: 
2025-05-07T20:32:21.2852990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.2853336Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.2853626Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.2854314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.2855014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.2855643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.2856330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.2856991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.2857528Z     kernel = self.compile(
2025-05-07T20:32:21.2858075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.2858731Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.2859131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.2859366Z 
2025-05-07T20:32:21.2859572Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508dc1950>
2025-05-07T20:32:21.2860657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.2862064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509a13560>}
2025-05-07T20:32:21.2863452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.2864485Z context = <triton._C.libtriton.ir.context object at 0x7f0508d89df0>
2025-05-07T20:32:21.2864778Z 
2025-05-07T20:32:21.2864944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.2865471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.2865936Z                            module_map=module_map)
2025-05-07T20:32:21.2866310Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.2866667Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.2866927Z E       ^
2025-05-07T20:32:21.2867396Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.2867854Z 
2025-05-07T20:32:21.2868274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.2868787Z 
2025-05-07T20:32:21.2868945Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.2869361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.2869769Z     T=128,
2025-05-07T20:32:21.2869967Z     D=5120,
2025-05-07T20:32:21.2870162Z     scale_ub=None,
2025-05-07T20:32:21.2870387Z     contiguous=False,
2025-05-07T20:32:21.2870623Z     compiled=True,
2025-05-07T20:32:21.2870829Z )
2025-05-07T20:32:21.2871201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.2871697Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:21.2871966Z 
2025-05-07T20:32:21.2872056Z     @given(
2025-05-07T20:32:21.2872288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.2872614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.2872927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.2873261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.2873599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.2873895Z     )
2025-05-07T20:32:21.2874244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.2874692Z     def test_silu_mul_quant(
2025-05-07T20:32:21.2874942Z         self,
2025-05-07T20:32:21.2875143Z         T: int,
2025-05-07T20:32:21.2875339Z         D: int,
2025-05-07T20:32:21.2875615Z         scale_ub: Optional[float],
2025-05-07T20:32:21.2875890Z         contiguous: bool,
2025-05-07T20:32:21.2876128Z         compiled: bool,
2025-05-07T20:32:21.2876356Z     ) -> None:
2025-05-07T20:32:21.2876575Z         torch.manual_seed(2025)
2025-05-07T20:32:21.2876817Z     
2025-05-07T20:32:21.2877093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.2877439Z     
2025-05-07T20:32:21.2877633Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.2877929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.2878244Z         x = x_sign * x_clamp
2025-05-07T20:32:21.2878481Z         x0 = x[:, :D]
2025-05-07T20:32:21.2878704Z         x1 = x[:, D:]
2025-05-07T20:32:21.2878921Z     
2025-05-07T20:32:21.2879107Z         if contiguous:
2025-05-07T20:32:21.2879347Z             x0 = x0.contiguous()
2025-05-07T20:32:21.2879609Z             x1 = x1.contiguous()
2025-05-07T20:32:21.2879846Z     
2025-05-07T20:32:21.2880047Z         if scale_ub is not None:
2025-05-07T20:32:21.2880332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.2880670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.2880977Z             )
2025-05-07T20:32:21.2881180Z         else:
2025-05-07T20:32:21.2881394Z             scale_ub_tensor = None
2025-05-07T20:32:21.2881648Z     
2025-05-07T20:32:21.2881884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.2882204Z             op = silu_mul_quant
2025-05-07T20:32:21.2882505Z             if compiled:
2025-05-07T20:32:21.2882763Z                 op = torch.compile(op)
2025-05-07T20:32:21.2883063Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.2883337Z     
2025-05-07T20:32:21.2883537Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.2883700Z 
2025-05-07T20:32:21.2883807Z moe/activation_test.py:117: 
2025-05-07T20:32:21.2884102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.2884445Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.2884735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.2885299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.2885858Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.2886523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.2887221Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.2887888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.2888576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.2889242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.2889781Z     kernel = self.compile(
2025-05-07T20:32:21.2890321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.2891029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.2891430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.2891664Z 
2025-05-07T20:32:21.2891876Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508d7b0d0>
2025-05-07T20:32:21.2892957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.2894327Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509a304a0>}
2025-05-07T20:32:21.2895671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.2896742Z context = <triton._C.libtriton.ir.context object at 0x7f0508cd5f70>
2025-05-07T20:32:21.2897030Z 
2025-05-07T20:32:21.2897197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.2897721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.2898196Z                            module_map=module_map)
2025-05-07T20:32:21.2898567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.2898936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.2899198Z E       ^
2025-05-07T20:32:21.2899665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.2900118Z 
2025-05-07T20:32:21.2900542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.2901063Z 
2025-05-07T20:32:21.2901176Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.2901588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.2901998Z     T=128,
2025-05-07T20:32:21.2902197Z     D=7168,
2025-05-07T20:32:21.2902392Z     scale_ub=1200.0,
2025-05-07T20:32:21.2902623Z     contiguous=False,
2025-05-07T20:32:21.2902854Z     compiled=False,
2025-05-07T20:32:21.2903059Z )
2025-05-07T20:32:21.3770858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.3771465Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:21.3771750Z 
2025-05-07T20:32:21.3771833Z     @given(
2025-05-07T20:32:21.3772069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.3772380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.3772695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.3773045Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.3773376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.3773661Z     )
2025-05-07T20:32:21.3774014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.3774458Z     def test_silu_mul_quant(
2025-05-07T20:32:21.3774699Z         self,
2025-05-07T20:32:21.3774902Z         T: int,
2025-05-07T20:32:21.3775109Z         D: int,
2025-05-07T20:32:21.3775408Z         scale_ub: Optional[float],
2025-05-07T20:32:21.3775691Z         contiguous: bool,
2025-05-07T20:32:21.3775937Z         compiled: bool,
2025-05-07T20:32:21.3776165Z     ) -> None:
2025-05-07T20:32:21.3776389Z         torch.manual_seed(2025)
2025-05-07T20:32:21.3776636Z     
2025-05-07T20:32:21.3776910Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.3777257Z     
2025-05-07T20:32:21.3777457Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.3777840Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.3778156Z         x = x_sign * x_clamp
2025-05-07T20:32:21.3778407Z         x0 = x[:, :D]
2025-05-07T20:32:21.3778629Z         x1 = x[:, D:]
2025-05-07T20:32:21.3778840Z     
2025-05-07T20:32:21.3779039Z         if contiguous:
2025-05-07T20:32:21.3779277Z             x0 = x0.contiguous()
2025-05-07T20:32:21.3779534Z             x1 = x1.contiguous()
2025-05-07T20:32:21.3779774Z     
2025-05-07T20:32:21.3779972Z         if scale_ub is not None:
2025-05-07T20:32:21.3780244Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.3780581Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.3780897Z             )
2025-05-07T20:32:21.3781091Z         else:
2025-05-07T20:32:21.3781307Z             scale_ub_tensor = None
2025-05-07T20:32:21.3781563Z     
2025-05-07T20:32:21.3781790Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.3782109Z             op = silu_mul_quant
2025-05-07T20:32:21.3782438Z             if compiled:
2025-05-07T20:32:21.3782684Z                 op = torch.compile(op)
2025-05-07T20:32:21.3782978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.3783253Z     
2025-05-07T20:32:21.3783445Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.3783606Z 
2025-05-07T20:32:21.3783705Z moe/activation_test.py:117: 
2025-05-07T20:32:21.3783996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.3784330Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.3784607Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.3785299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.3785992Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.3786521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.3787211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.3787869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.3788404Z     kernel = self.compile(
2025-05-07T20:32:21.3788937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.3789594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.3790038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.3790265Z 
2025-05-07T20:32:21.3790477Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508c415d0>
2025-05-07T20:32:21.3791545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.3792938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05091fca40>}
2025-05-07T20:32:21.3794283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.3795316Z context = <triton._C.libtriton.ir.context object at 0x7f0508ce9bf0>
2025-05-07T20:32:21.3795677Z 
2025-05-07T20:32:21.3795851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.3796371Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.3796847Z                            module_map=module_map)
2025-05-07T20:32:21.3797214Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.3797612Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.3797883Z E       ^
2025-05-07T20:32:21.3798351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.3798802Z 
2025-05-07T20:32:21.3799225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.3799735Z 
2025-05-07T20:32:21.3799839Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.3800261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.3800668Z     T=128,
2025-05-07T20:32:21.3800881Z     D=5120,
2025-05-07T20:32:21.3810029Z     scale_ub=None,
2025-05-07T20:32:21.3810281Z     contiguous=False,
2025-05-07T20:32:21.3810529Z     compiled=False,
2025-05-07T20:32:21.3810746Z )
2025-05-07T20:32:21.3811084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.3811606Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:21.3812012Z 
2025-05-07T20:32:21.3812098Z     @given(
2025-05-07T20:32:21.3812349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.3812680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.3812992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.3813338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.3813679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.3813980Z     )
2025-05-07T20:32:21.3814342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.3814796Z     def test_silu_mul_quant(
2025-05-07T20:32:21.3815049Z         self,
2025-05-07T20:32:21.3815250Z         T: int,
2025-05-07T20:32:21.3815460Z         D: int,
2025-05-07T20:32:21.3815695Z         scale_ub: Optional[float],
2025-05-07T20:32:21.3815974Z         contiguous: bool,
2025-05-07T20:32:21.3816229Z         compiled: bool,
2025-05-07T20:32:21.3816468Z     ) -> None:
2025-05-07T20:32:21.3816691Z         torch.manual_seed(2025)
2025-05-07T20:32:21.3816952Z     
2025-05-07T20:32:21.3817237Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.3817585Z     
2025-05-07T20:32:21.3817790Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.3818095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.3818409Z         x = x_sign * x_clamp
2025-05-07T20:32:21.3818670Z         x0 = x[:, :D]
2025-05-07T20:32:21.3818978Z         x1 = x[:, D:]
2025-05-07T20:32:21.3819194Z     
2025-05-07T20:32:21.3819394Z         if contiguous:
2025-05-07T20:32:21.3819630Z             x0 = x0.contiguous()
2025-05-07T20:32:21.3819892Z             x1 = x1.contiguous()
2025-05-07T20:32:21.3820145Z     
2025-05-07T20:32:21.3820352Z         if scale_ub is not None:
2025-05-07T20:32:21.3820637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.3820982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.3821308Z             )
2025-05-07T20:32:21.3821520Z         else:
2025-05-07T20:32:21.3821739Z             scale_ub_tensor = None
2025-05-07T20:32:21.3822003Z     
2025-05-07T20:32:21.3822249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.3822571Z             op = silu_mul_quant
2025-05-07T20:32:21.3822835Z             if compiled:
2025-05-07T20:32:21.3823099Z                 op = torch.compile(op)
2025-05-07T20:32:21.3823407Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.3823762Z     
2025-05-07T20:32:21.3823972Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.3824144Z 
2025-05-07T20:32:21.3824250Z moe/activation_test.py:117: 
2025-05-07T20:32:21.3824558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.3824911Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.3825206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.3825963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.3826669Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.3827218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.3827903Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.3828581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.3829125Z     kernel = self.compile(
2025-05-07T20:32:21.3829673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.3830328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.3830740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.3831020Z 
2025-05-07T20:32:21.3831242Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508ccc950>
2025-05-07T20:32:21.3832327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.3833700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508c84720>}
2025-05-07T20:32:21.3835045Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.3836073Z context = <triton._C.libtriton.ir.context object at 0x7f0508c78fb0>
2025-05-07T20:32:21.3836362Z 
2025-05-07T20:32:21.3836537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.3837068Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.3837545Z                            module_map=module_map)
2025-05-07T20:32:21.3837921Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.3838284Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.3838550Z E       ^
2025-05-07T20:32:21.3839078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.3839533Z 
2025-05-07T20:32:21.3839958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.3840470Z 
2025-05-07T20:32:21.3840589Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.3841010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.3841433Z     T=128,
2025-05-07T20:32:21.3841648Z     D=5120,
2025-05-07T20:32:21.3841859Z     scale_ub=1200.0,
2025-05-07T20:32:21.3842099Z     contiguous=True,
2025-05-07T20:32:21.3842340Z     compiled=False,
2025-05-07T20:32:21.3842555Z )
2025-05-07T20:32:21.6765205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.6765843Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.6766119Z 
2025-05-07T20:32:21.6766202Z     @given(
2025-05-07T20:32:21.6766467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.6767075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.6767387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.6767852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.6768191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.6768484Z     )
2025-05-07T20:32:21.6768835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.6769371Z     def test_silu_mul_quant(
2025-05-07T20:32:21.6769621Z         self,
2025-05-07T20:32:21.6769819Z         T: int,
2025-05-07T20:32:21.6770026Z         D: int,
2025-05-07T20:32:21.6770261Z         scale_ub: Optional[float],
2025-05-07T20:32:21.6770532Z         contiguous: bool,
2025-05-07T20:32:21.6770780Z         compiled: bool,
2025-05-07T20:32:21.6771018Z     ) -> None:
2025-05-07T20:32:21.6771235Z         torch.manual_seed(2025)
2025-05-07T20:32:21.6771489Z     
2025-05-07T20:32:21.6771774Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.6772118Z     
2025-05-07T20:32:21.6772325Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.6772625Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.6772938Z         x = x_sign * x_clamp
2025-05-07T20:32:21.6773187Z         x0 = x[:, :D]
2025-05-07T20:32:21.6773411Z         x1 = x[:, D:]
2025-05-07T20:32:21.6773630Z     
2025-05-07T20:32:21.6773819Z         if contiguous:
2025-05-07T20:32:21.6774158Z             x0 = x0.contiguous()
2025-05-07T20:32:21.6774425Z             x1 = x1.contiguous()
2025-05-07T20:32:21.6774666Z     
2025-05-07T20:32:21.6774870Z         if scale_ub is not None:
2025-05-07T20:32:21.6775148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.6775487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.6775800Z             )
2025-05-07T20:32:21.6776006Z         else:
2025-05-07T20:32:21.6776219Z             scale_ub_tensor = None
2025-05-07T20:32:21.6776481Z     
2025-05-07T20:32:21.6776718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.6777028Z             op = silu_mul_quant
2025-05-07T20:32:21.6777283Z             if compiled:
2025-05-07T20:32:21.6777535Z                 op = torch.compile(op)
2025-05-07T20:32:21.6777828Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.6778112Z     
2025-05-07T20:32:21.6778307Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.6778478Z 
2025-05-07T20:32:21.6778587Z moe/activation_test.py:117: 
2025-05-07T20:32:21.6778882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.6779219Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.6779505Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.6780196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.6780972Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.6781519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.6782211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.6782873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.6783414Z     kernel = self.compile(
2025-05-07T20:32:21.6783963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.6784623Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.6785023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.6785255Z 
2025-05-07T20:32:21.6785458Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508e739d0>
2025-05-07T20:32:21.6786585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.6787979Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508c858a0>}
2025-05-07T20:32:21.6789326Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.6790399Z context = <triton._C.libtriton.ir.context object at 0x7f0508e18030>
2025-05-07T20:32:21.6790691Z 
2025-05-07T20:32:21.6790859Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.6791417Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.6791903Z                            module_map=module_map)
2025-05-07T20:32:21.6792279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.6792638Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.6792899Z E       ^
2025-05-07T20:32:21.6793369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.6793826Z 
2025-05-07T20:32:21.6794241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.6794797Z 
2025-05-07T20:32:21.6794907Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.6795320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.6795722Z     T=1,
2025-05-07T20:32:21.6795919Z     D=7168,
2025-05-07T20:32:21.6796109Z     scale_ub=1200.0,
2025-05-07T20:32:21.6796343Z     contiguous=True,
2025-05-07T20:32:21.6796563Z     compiled=True,
2025-05-07T20:32:21.6796774Z )
2025-05-07T20:32:21.6797096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.6797588Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.6797845Z 
2025-05-07T20:32:21.6797930Z     @given(
2025-05-07T20:32:21.6798165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.6798480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.6798792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.6799122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.6799460Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.6799745Z     )
2025-05-07T20:32:21.6800095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.6800537Z     def test_silu_mul_quant(
2025-05-07T20:32:21.6800784Z         self,
2025-05-07T20:32:21.6800975Z         T: int,
2025-05-07T20:32:21.6801181Z         D: int,
2025-05-07T20:32:21.6801450Z         scale_ub: Optional[float],
2025-05-07T20:32:21.6801731Z         contiguous: bool,
2025-05-07T20:32:21.6801966Z         compiled: bool,
2025-05-07T20:32:21.6802194Z     ) -> None:
2025-05-07T20:32:21.6802410Z         torch.manual_seed(2025)
2025-05-07T20:32:21.6802656Z     
2025-05-07T20:32:21.6802932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.6803285Z     
2025-05-07T20:32:21.6803476Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.6803782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.6804093Z         x = x_sign * x_clamp
2025-05-07T20:32:21.6804331Z         x0 = x[:, :D]
2025-05-07T20:32:21.6804548Z         x1 = x[:, D:]
2025-05-07T20:32:21.6804768Z     
2025-05-07T20:32:21.6804949Z         if contiguous:
2025-05-07T20:32:21.6805186Z             x0 = x0.contiguous()
2025-05-07T20:32:21.6805446Z             x1 = x1.contiguous()
2025-05-07T20:32:21.6805961Z     
2025-05-07T20:32:21.6806168Z         if scale_ub is not None:
2025-05-07T20:32:21.6806674Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.6807014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.6807329Z             )
2025-05-07T20:32:21.6807599Z         else:
2025-05-07T20:32:21.6807823Z             scale_ub_tensor = None
2025-05-07T20:32:21.6808069Z     
2025-05-07T20:32:21.6808307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.6808701Z             op = silu_mul_quant
2025-05-07T20:32:21.6808952Z             if compiled:
2025-05-07T20:32:21.6809203Z                 op = torch.compile(op)
2025-05-07T20:32:21.6809506Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.6809772Z     
2025-05-07T20:32:21.6809975Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.6810139Z 
2025-05-07T20:32:21.6810244Z moe/activation_test.py:117: 
2025-05-07T20:32:21.6810542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.6810877Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.6811167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.6811727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.6812286Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.6812942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.6813712Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.6814240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.6814929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.6815588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.6816129Z     kernel = self.compile(
2025-05-07T20:32:21.6816670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.6817324Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.6817723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.6817949Z 
2025-05-07T20:32:21.6818156Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508e32e90>
2025-05-07T20:32:21.6819231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.6820594Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508c86e80>}
2025-05-07T20:32:21.6822041Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.6823068Z context = <triton._C.libtriton.ir.context object at 0x7f0508ec3e70>
2025-05-07T20:32:21.6823351Z 
2025-05-07T20:32:21.6823516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.6824048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.6824517Z                            module_map=module_map)
2025-05-07T20:32:21.6824884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.6825229Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.6825493Z E       ^
2025-05-07T20:32:21.6825954Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.6826403Z 
2025-05-07T20:32:21.6826865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.6827380Z 
2025-05-07T20:32:21.6827481Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.6827897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.6828294Z     T=1,
2025-05-07T20:32:21.6828480Z     D=7168,
2025-05-07T20:32:21.6828675Z     scale_ub=1200.0,
2025-05-07T20:32:21.6828905Z     contiguous=False,
2025-05-07T20:32:21.6829166Z     compiled=True,
2025-05-07T20:32:21.6829368Z )
2025-05-07T20:32:21.7831905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.7832549Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:21.7832818Z 
2025-05-07T20:32:21.7832911Z     @given(
2025-05-07T20:32:21.7833148Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.7833468Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.7833797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.7834132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.7834465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.7834759Z     )
2025-05-07T20:32:21.7835104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.7835553Z     def test_silu_mul_quant(
2025-05-07T20:32:21.7835806Z         self,
2025-05-07T20:32:21.7836002Z         T: int,
2025-05-07T20:32:21.7836495Z         D: int,
2025-05-07T20:32:21.7836719Z         scale_ub: Optional[float],
2025-05-07T20:32:21.7836990Z         contiguous: bool,
2025-05-07T20:32:21.7837238Z         compiled: bool,
2025-05-07T20:32:21.7837472Z     ) -> None:
2025-05-07T20:32:21.7837691Z         torch.manual_seed(2025)
2025-05-07T20:32:21.7837930Z     
2025-05-07T20:32:21.7838206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.7838549Z     
2025-05-07T20:32:21.7838744Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.7839044Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.7839356Z         x = x_sign * x_clamp
2025-05-07T20:32:21.7839609Z         x0 = x[:, :D]
2025-05-07T20:32:21.7839823Z         x1 = x[:, D:]
2025-05-07T20:32:21.7840037Z     
2025-05-07T20:32:21.7840229Z         if contiguous:
2025-05-07T20:32:21.7840462Z             x0 = x0.contiguous()
2025-05-07T20:32:21.7840725Z             x1 = x1.contiguous()
2025-05-07T20:32:21.7840978Z     
2025-05-07T20:32:21.7841170Z         if scale_ub is not None:
2025-05-07T20:32:21.7841474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.7841840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.7842151Z             )
2025-05-07T20:32:21.7842349Z         else:
2025-05-07T20:32:21.7842565Z             scale_ub_tensor = None
2025-05-07T20:32:21.7842815Z     
2025-05-07T20:32:21.7843049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.7843454Z             op = silu_mul_quant
2025-05-07T20:32:21.7843712Z             if compiled:
2025-05-07T20:32:21.7843960Z                 op = torch.compile(op)
2025-05-07T20:32:21.7844261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.7844541Z     
2025-05-07T20:32:21.7844733Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.7844905Z 
2025-05-07T20:32:21.7845006Z moe/activation_test.py:117: 
2025-05-07T20:32:21.7845308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.7845649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.7845938Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.7846500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.7847062Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.7847843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.7848621Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.7849165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.7849846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.7850511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.7851119Z     kernel = self.compile(
2025-05-07T20:32:21.7851669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.7852326Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.7852727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.7852958Z 
2025-05-07T20:32:21.7853170Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508f5cad0>
2025-05-07T20:32:21.7854259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.7855649Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508e44680>}
2025-05-07T20:32:21.7856996Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.7858071Z context = <triton._C.libtriton.ir.context object at 0x7f0508f41130>
2025-05-07T20:32:21.7858355Z 
2025-05-07T20:32:21.7858530Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.7859052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.7859523Z                            module_map=module_map)
2025-05-07T20:32:21.7859892Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.7860245Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.7860504Z E       ^
2025-05-07T20:32:21.7860972Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.7861424Z 
2025-05-07T20:32:21.7861847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.7862357Z 
2025-05-07T20:32:21.7862463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.7862878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.7863282Z     T=1,
2025-05-07T20:32:21.7863471Z     D=7168,
2025-05-07T20:32:21.7863667Z     scale_ub=None,
2025-05-07T20:32:21.7863890Z     contiguous=False,
2025-05-07T20:32:21.7864169Z     compiled=True,
2025-05-07T20:32:21.7864382Z )
2025-05-07T20:32:21.8538896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.8539526Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:21.8539789Z 
2025-05-07T20:32:21.8539878Z     @given(
2025-05-07T20:32:21.8540100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.8540412Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.8540743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.8541071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.8541390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.8541675Z     )
2025-05-07T20:32:21.8542020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.8542453Z     def test_silu_mul_quant(
2025-05-07T20:32:21.8542701Z         self,
2025-05-07T20:32:21.8542897Z         T: int,
2025-05-07T20:32:21.8543320Z         D: int,
2025-05-07T20:32:21.8543541Z         scale_ub: Optional[float],
2025-05-07T20:32:21.8543812Z         contiguous: bool,
2025-05-07T20:32:21.8544045Z         compiled: bool,
2025-05-07T20:32:21.8544273Z     ) -> None:
2025-05-07T20:32:21.8544488Z         torch.manual_seed(2025)
2025-05-07T20:32:21.8544726Z     
2025-05-07T20:32:21.8544998Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.8545427Z     
2025-05-07T20:32:21.8545616Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.8545907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.8546221Z         x = x_sign * x_clamp
2025-05-07T20:32:21.8546462Z         x0 = x[:, :D]
2025-05-07T20:32:21.8546671Z         x1 = x[:, D:]
2025-05-07T20:32:21.8546886Z     
2025-05-07T20:32:21.8547070Z         if contiguous:
2025-05-07T20:32:21.8547297Z             x0 = x0.contiguous()
2025-05-07T20:32:21.8547550Z             x1 = x1.contiguous()
2025-05-07T20:32:21.8547810Z     
2025-05-07T20:32:21.8547998Z         if scale_ub is not None:
2025-05-07T20:32:21.8548272Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.8548604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.8557094Z             )
2025-05-07T20:32:21.8557308Z         else:
2025-05-07T20:32:21.8557526Z             scale_ub_tensor = None
2025-05-07T20:32:21.8557800Z     
2025-05-07T20:32:21.8558052Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.8558536Z             op = silu_mul_quant
2025-05-07T20:32:21.8558804Z             if compiled:
2025-05-07T20:32:21.8559071Z                 op = torch.compile(op)
2025-05-07T20:32:21.8559377Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.8559668Z     
2025-05-07T20:32:21.8559877Z         y_fp8, y_scale = fn()
2025-05-07T20:32:21.8560170Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:21.8560480Z     
2025-05-07T20:32:21.8560744Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.8561093Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:21.8561412Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:21.8561782Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:21.8562157Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.8562472Z     
2025-05-07T20:32:21.8562690Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.8562897Z 
2025-05-07T20:32:21.8563009Z moe/activation_test.py:126: 
2025-05-07T20:32:21.8563316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.8563665Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:21.8564005Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.8564804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:21.8565643Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.8566204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.8566894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.8567726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:21.8568498Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.8569263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:21.8570022Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.8570765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:21.8571491Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.8572131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:21.8572705Z     fn()
2025-05-07T20:32:21.8573395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:21.8573988Z     self.fn.run(
2025-05-07T20:32:21.8574472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.8575088Z     kernel = self.compile(
2025-05-07T20:32:21.8575634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.8576298Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.8576707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.8576941Z 
2025-05-07T20:32:21.8577165Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508fcf010>
2025-05-07T20:32:21.8578246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.8579639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508e45580>}
2025-05-07T20:32:21.8581037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.8582069Z context = <triton._C.libtriton.ir.context object at 0x7f0508fb3630>
2025-05-07T20:32:21.8582359Z 
2025-05-07T20:32:21.8582540Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.8583069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.8583548Z                            module_map=module_map)
2025-05-07T20:32:21.8583928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.8584292Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:21.8584575Z E       ^
2025-05-07T20:32:21.8585048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.8585506Z 
2025-05-07T20:32:21.8585932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.8586444Z 
2025-05-07T20:32:21.8586555Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.8586977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.8587390Z     T=1,
2025-05-07T20:32:21.8587585Z     D=5120,
2025-05-07T20:32:21.8587850Z     scale_ub=1200.0,
2025-05-07T20:32:21.8588092Z     contiguous=False,
2025-05-07T20:32:21.8588325Z     compiled=True,
2025-05-07T20:32:21.8588548Z )
2025-05-07T20:32:21.9789365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.9790133Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:21.9790409Z 
2025-05-07T20:32:21.9790493Z     @given(
2025-05-07T20:32:21.9790758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.9791088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.9791398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.9791740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.9792080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.9792377Z     )
2025-05-07T20:32:21.9792730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.9793187Z     def test_silu_mul_quant(
2025-05-07T20:32:21.9793693Z         self,
2025-05-07T20:32:21.9793897Z         T: int,
2025-05-07T20:32:21.9794111Z         D: int,
2025-05-07T20:32:21.9794340Z         scale_ub: Optional[float],
2025-05-07T20:32:21.9794614Z         contiguous: bool,
2025-05-07T20:32:21.9794865Z         compiled: bool,
2025-05-07T20:32:21.9795104Z     ) -> None:
2025-05-07T20:32:21.9795324Z         torch.manual_seed(2025)
2025-05-07T20:32:21.9795578Z     
2025-05-07T20:32:21.9795954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9796302Z     
2025-05-07T20:32:21.9796505Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.9796813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.9797125Z         x = x_sign * x_clamp
2025-05-07T20:32:21.9797375Z         x0 = x[:, :D]
2025-05-07T20:32:21.9797600Z         x1 = x[:, D:]
2025-05-07T20:32:21.9797812Z     
2025-05-07T20:32:21.9798012Z         if contiguous:
2025-05-07T20:32:21.9798259Z             x0 = x0.contiguous()
2025-05-07T20:32:21.9798532Z             x1 = x1.contiguous()
2025-05-07T20:32:21.9798775Z     
2025-05-07T20:32:21.9798978Z         if scale_ub is not None:
2025-05-07T20:32:21.9799260Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.9799598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.9799914Z             )
2025-05-07T20:32:21.9800117Z         else:
2025-05-07T20:32:21.9800335Z             scale_ub_tensor = None
2025-05-07T20:32:21.9800687Z     
2025-05-07T20:32:21.9800926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.9801242Z             op = silu_mul_quant
2025-05-07T20:32:21.9801504Z             if compiled:
2025-05-07T20:32:21.9801759Z                 op = torch.compile(op)
2025-05-07T20:32:21.9802054Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.9802339Z     
2025-05-07T20:32:21.9802541Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.9802710Z 
2025-05-07T20:32:21.9802829Z moe/activation_test.py:117: 
2025-05-07T20:32:21.9803130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9803476Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.9803773Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.9804343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.9804919Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.9805898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.9806610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.9807151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.9807898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.9808656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.9809199Z     kernel = self.compile(
2025-05-07T20:32:21.9809743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.9810413Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.9810817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9811053Z 
2025-05-07T20:32:21.9811266Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508fff910>
2025-05-07T20:32:21.9812352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.9813820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508e46b60>}
2025-05-07T20:32:21.9815187Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.9816225Z context = <triton._C.libtriton.ir.context object at 0x7f0509857e30>
2025-05-07T20:32:21.9816516Z 
2025-05-07T20:32:21.9816755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.9817276Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.9817756Z                            module_map=module_map)
2025-05-07T20:32:21.9818126Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.9818485Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.9818751Z E       ^
2025-05-07T20:32:21.9819234Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.9819686Z 
2025-05-07T20:32:21.9820106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.9820626Z 
2025-05-07T20:32:21.9820734Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.9821151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9821678Z     T=1,
2025-05-07T20:32:21.9821873Z     D=5120,
2025-05-07T20:32:21.9822081Z     scale_ub=1200.0,
2025-05-07T20:32:21.9822315Z     contiguous=False,
2025-05-07T20:32:21.9822550Z     compiled=False,
2025-05-07T20:32:21.9822765Z )
2025-05-07T20:32:21.9823095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.9823585Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:21.9823857Z 
2025-05-07T20:32:21.9823940Z     @given(
2025-05-07T20:32:21.9824181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.9824509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.9824814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.9825156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.9825489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.9825778Z     )
2025-05-07T20:32:21.9826131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.9826582Z     def test_silu_mul_quant(
2025-05-07T20:32:21.9826837Z         self,
2025-05-07T20:32:21.9827034Z         T: int,
2025-05-07T20:32:21.9827245Z         D: int,
2025-05-07T20:32:21.9827471Z         scale_ub: Optional[float],
2025-05-07T20:32:21.9827749Z         contiguous: bool,
2025-05-07T20:32:21.9827995Z         compiled: bool,
2025-05-07T20:32:21.9828233Z     ) -> None:
2025-05-07T20:32:21.9828452Z         torch.manual_seed(2025)
2025-05-07T20:32:21.9828711Z     
2025-05-07T20:32:21.9829042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.9829395Z     
2025-05-07T20:32:21.9829597Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.9829902Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.9830212Z         x = x_sign * x_clamp
2025-05-07T20:32:21.9830465Z         x0 = x[:, :D]
2025-05-07T20:32:21.9830693Z         x1 = x[:, D:]
2025-05-07T20:32:21.9830909Z     
2025-05-07T20:32:21.9831105Z         if contiguous:
2025-05-07T20:32:21.9831349Z             x0 = x0.contiguous()
2025-05-07T20:32:21.9831629Z             x1 = x1.contiguous()
2025-05-07T20:32:21.9831907Z     
2025-05-07T20:32:21.9832102Z         if scale_ub is not None:
2025-05-07T20:32:21.9832383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.9832717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.9833037Z             )
2025-05-07T20:32:21.9833243Z         else:
2025-05-07T20:32:21.9833462Z             scale_ub_tensor = None
2025-05-07T20:32:21.9833828Z     
2025-05-07T20:32:21.9834074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.9834385Z             op = silu_mul_quant
2025-05-07T20:32:21.9834648Z             if compiled:
2025-05-07T20:32:21.9834904Z                 op = torch.compile(op)
2025-05-07T20:32:21.9835206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.9835486Z     
2025-05-07T20:32:21.9835693Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.9835906Z 
2025-05-07T20:32:21.9836006Z moe/activation_test.py:117: 
2025-05-07T20:32:21.9836314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9836651Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.9836944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.9837628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.9838333Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.9838881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.9839568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.9840232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.9840779Z     kernel = self.compile(
2025-05-07T20:32:21.9841383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.9842092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.9842492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.9842725Z 
2025-05-07T20:32:21.9842941Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509895510>
2025-05-07T20:32:21.9844025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.9845401Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508e472e0>}
2025-05-07T20:32:21.9846745Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.9847898Z context = <triton._C.libtriton.ir.context object at 0x7f05098bdb70>
2025-05-07T20:32:21.9848189Z 
2025-05-07T20:32:21.9848366Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.9848894Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.9849427Z                            module_map=module_map)
2025-05-07T20:32:21.9849799Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.9850167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.9850426Z E       ^
2025-05-07T20:32:21.9850902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.9851353Z 
2025-05-07T20:32:21.9851829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.9852345Z 
2025-05-07T20:32:21.9852465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.9852878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.9853301Z     T=16384,
2025-05-07T20:32:21.9853504Z     D=5120,
2025-05-07T20:32:21.9853708Z     scale_ub=1200.0,
2025-05-07T20:32:21.9853937Z     contiguous=False,
2025-05-07T20:32:21.9854184Z     compiled=True,
2025-05-07T20:32:21.9854392Z )
2025-05-07T20:32:22.2214580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.2215169Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.2215458Z 
2025-05-07T20:32:22.2215539Z     @given(
2025-05-07T20:32:22.2215781Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.2216092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.2216497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.2216829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.2217160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.2217444Z     )
2025-05-07T20:32:22.2217800Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.2218246Z     def test_silu_mul_quant(
2025-05-07T20:32:22.2218487Z         self,
2025-05-07T20:32:22.2218685Z         T: int,
2025-05-07T20:32:22.2218890Z         D: int,
2025-05-07T20:32:22.2219107Z         scale_ub: Optional[float],
2025-05-07T20:32:22.2219381Z         contiguous: bool,
2025-05-07T20:32:22.2219629Z         compiled: bool,
2025-05-07T20:32:22.2219854Z     ) -> None:
2025-05-07T20:32:22.2220075Z         torch.manual_seed(2025)
2025-05-07T20:32:22.2220323Z     
2025-05-07T20:32:22.2220593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.2220947Z     
2025-05-07T20:32:22.2221237Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.2221551Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.2221890Z         x = x_sign * x_clamp
2025-05-07T20:32:22.2222137Z         x0 = x[:, :D]
2025-05-07T20:32:22.2222361Z         x1 = x[:, D:]
2025-05-07T20:32:22.2222576Z     
2025-05-07T20:32:22.2222765Z         if contiguous:
2025-05-07T20:32:22.2223006Z             x0 = x0.contiguous()
2025-05-07T20:32:22.2223261Z             x1 = x1.contiguous()
2025-05-07T20:32:22.2223501Z     
2025-05-07T20:32:22.2223711Z         if scale_ub is not None:
2025-05-07T20:32:22.2223982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.2224323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.2224634Z             )
2025-05-07T20:32:22.2224829Z         else:
2025-05-07T20:32:22.2225052Z             scale_ub_tensor = None
2025-05-07T20:32:22.2225308Z     
2025-05-07T20:32:22.2225537Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.2225863Z             op = silu_mul_quant
2025-05-07T20:32:22.2226117Z             if compiled:
2025-05-07T20:32:22.2226363Z                 op = torch.compile(op)
2025-05-07T20:32:22.2226664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.2226943Z     
2025-05-07T20:32:22.2227135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.2227310Z 
2025-05-07T20:32:22.2227413Z moe/activation_test.py:117: 
2025-05-07T20:32:22.2227791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.2228131Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.2228409Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.2228968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.2229535Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.2230197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.2230889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.2231430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.2232114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.2232783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.2233317Z     kernel = self.compile(
2025-05-07T20:32:22.2233910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.2234574Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.2234971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.2235200Z 
2025-05-07T20:32:22.2235403Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509814450>
2025-05-07T20:32:22.2236524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.2237912Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509844fe0>}
2025-05-07T20:32:22.2239262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.2240288Z context = <triton._C.libtriton.ir.context object at 0x7f05098f8670>
2025-05-07T20:32:22.2240579Z 
2025-05-07T20:32:22.2240745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.2241271Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.2241788Z                            module_map=module_map)
2025-05-07T20:32:22.2242149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.2242507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.2242772Z E       ^
2025-05-07T20:32:22.2243234Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.2243692Z 
2025-05-07T20:32:22.2244113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.2244635Z 
2025-05-07T20:32:22.2244741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.2245158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.2245559Z     T=2048,
2025-05-07T20:32:22.2245757Z     D=7168,
2025-05-07T20:32:22.2245959Z     scale_ub=1200.0,
2025-05-07T20:32:22.2246191Z     contiguous=False,
2025-05-07T20:32:22.2246427Z     compiled=True,
2025-05-07T20:32:22.2246643Z )
2025-05-07T20:32:22.2246960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.2247458Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.2247857Z 
2025-05-07T20:32:22.2247940Z     @given(
2025-05-07T20:32:22.2248179Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.2248492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.2248862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.2249198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.2249524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.2249815Z     )
2025-05-07T20:32:22.2250167Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.2250603Z     def test_silu_mul_quant(
2025-05-07T20:32:22.2250851Z         self,
2025-05-07T20:32:22.2251057Z         T: int,
2025-05-07T20:32:22.2251257Z         D: int,
2025-05-07T20:32:22.2251477Z         scale_ub: Optional[float],
2025-05-07T20:32:22.2251792Z         contiguous: bool,
2025-05-07T20:32:22.2252042Z         compiled: bool,
2025-05-07T20:32:22.2252262Z     ) -> None:
2025-05-07T20:32:22.2252484Z         torch.manual_seed(2025)
2025-05-07T20:32:22.2252729Z     
2025-05-07T20:32:22.2252999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.2253346Z     
2025-05-07T20:32:22.2253596Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.2253886Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.2254200Z         x = x_sign * x_clamp
2025-05-07T20:32:22.2254444Z         x0 = x[:, :D]
2025-05-07T20:32:22.2254660Z         x1 = x[:, D:]
2025-05-07T20:32:22.2254874Z     
2025-05-07T20:32:22.2255065Z         if contiguous:
2025-05-07T20:32:22.2255295Z             x0 = x0.contiguous()
2025-05-07T20:32:22.2255605Z             x1 = x1.contiguous()
2025-05-07T20:32:22.2255849Z     
2025-05-07T20:32:22.2256040Z         if scale_ub is not None:
2025-05-07T20:32:22.2256313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.2256654Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.2256967Z             )
2025-05-07T20:32:22.2257161Z         else:
2025-05-07T20:32:22.2257377Z             scale_ub_tensor = None
2025-05-07T20:32:22.2257634Z     
2025-05-07T20:32:22.2257867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.2258190Z             op = silu_mul_quant
2025-05-07T20:32:22.2258447Z             if compiled:
2025-05-07T20:32:22.2258698Z                 op = torch.compile(op)
2025-05-07T20:32:22.2259001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.2259284Z     
2025-05-07T20:32:22.2259479Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.2259650Z 
2025-05-07T20:32:22.2259749Z moe/activation_test.py:117: 
2025-05-07T20:32:22.2260127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.2260464Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.2260744Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.2261305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.2261871Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.2262536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.2263231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.2263768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.2264459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.2265115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.2265670Z     kernel = self.compile(
2025-05-07T20:32:22.2274235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.2274919Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.2275325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.2275572Z 
2025-05-07T20:32:22.2275864Z self = <triton.compiler.compiler.ASTSource object at 0x7f05089f9950>
2025-05-07T20:32:22.2276967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.2278359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509845b20>}
2025-05-07T20:32:22.2279711Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.2280753Z context = <triton._C.libtriton.ir.context object at 0x7f050896dfb0>
2025-05-07T20:32:22.2281054Z 
2025-05-07T20:32:22.2281227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.2281861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.2282337Z                            module_map=module_map)
2025-05-07T20:32:22.2282718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.2283090Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.2283356Z E       ^
2025-05-07T20:32:22.2283837Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.2284343Z 
2025-05-07T20:32:22.2284764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.2285279Z 
2025-05-07T20:32:22.3168645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.3169314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.3169919Z     T=1,
2025-05-07T20:32:22.3170123Z     D=5120,
2025-05-07T20:32:22.3170324Z     scale_ub=None,
2025-05-07T20:32:22.3170562Z     contiguous=False,
2025-05-07T20:32:22.3170813Z     compiled=False,
2025-05-07T20:32:22.3171029Z )
2025-05-07T20:32:22.3171384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.3171879Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.3172145Z 
2025-05-07T20:32:22.3172227Z     @given(
2025-05-07T20:32:22.3172462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.3173084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.3173402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.3173733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.3174071Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.3174370Z     )
2025-05-07T20:32:22.3174720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.3175172Z     def test_silu_mul_quant(
2025-05-07T20:32:22.3175434Z         self,
2025-05-07T20:32:22.3175639Z         T: int,
2025-05-07T20:32:22.3175840Z         D: int,
2025-05-07T20:32:22.3176067Z         scale_ub: Optional[float],
2025-05-07T20:32:22.3176349Z         contiguous: bool,
2025-05-07T20:32:22.3176589Z         compiled: bool,
2025-05-07T20:32:22.3176834Z     ) -> None:
2025-05-07T20:32:22.3177059Z         torch.manual_seed(2025)
2025-05-07T20:32:22.3177307Z     
2025-05-07T20:32:22.3177588Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.3177953Z     
2025-05-07T20:32:22.3178149Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.3178449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.3178768Z         x = x_sign * x_clamp
2025-05-07T20:32:22.3179017Z         x0 = x[:, :D]
2025-05-07T20:32:22.3179250Z         x1 = x[:, D:]
2025-05-07T20:32:22.3179467Z     
2025-05-07T20:32:22.3179661Z         if contiguous:
2025-05-07T20:32:22.3179903Z             x0 = x0.contiguous()
2025-05-07T20:32:22.3180261Z             x1 = x1.contiguous()
2025-05-07T20:32:22.3180510Z     
2025-05-07T20:32:22.3180710Z         if scale_ub is not None:
2025-05-07T20:32:22.3180989Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.3181336Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.3181643Z             )
2025-05-07T20:32:22.3181852Z         else:
2025-05-07T20:32:22.3182073Z             scale_ub_tensor = None
2025-05-07T20:32:22.3182329Z     
2025-05-07T20:32:22.3182572Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.3182899Z             op = silu_mul_quant
2025-05-07T20:32:22.3183151Z             if compiled:
2025-05-07T20:32:22.3183409Z                 op = torch.compile(op)
2025-05-07T20:32:22.3183711Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3183985Z     
2025-05-07T20:32:22.3184186Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.3184354Z 
2025-05-07T20:32:22.3184461Z moe/activation_test.py:117: 
2025-05-07T20:32:22.3184839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3185184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.3185476Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3186176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.3186867Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.3187490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.3188184Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.3188853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.3189391Z     kernel = self.compile(
2025-05-07T20:32:22.3189945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.3190612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.3191011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3191250Z 
2025-05-07T20:32:22.3191459Z self = <triton.compiler.compiler.ASTSource object at 0x7f050898b250>
2025-05-07T20:32:22.3192549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.3193989Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509846e80>}
2025-05-07T20:32:22.3195359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.3196386Z context = <triton._C.libtriton.ir.context object at 0x7f05089ef870>
2025-05-07T20:32:22.3196681Z 
2025-05-07T20:32:22.3196849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.3197374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.3197852Z                            module_map=module_map)
2025-05-07T20:32:22.3198228Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.3198593Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.3198866Z E       ^
2025-05-07T20:32:22.3199330Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.3199788Z 
2025-05-07T20:32:22.3200206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.3200775Z 
2025-05-07T20:32:22.3200885Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.3201306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.3201759Z     T=4096,
2025-05-07T20:32:22.3201962Z     D=7168,
2025-05-07T20:32:22.3202163Z     scale_ub=1200.0,
2025-05-07T20:32:22.3202390Z     contiguous=False,
2025-05-07T20:32:22.3202627Z     compiled=False,
2025-05-07T20:32:22.3202846Z )
2025-05-07T20:32:22.3203174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.3203677Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.3203954Z 
2025-05-07T20:32:22.3204049Z     @given(
2025-05-07T20:32:22.3204283Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.3204607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.3204926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.3205267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.3205946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.3206249Z     )
2025-05-07T20:32:22.3206605Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.3207043Z     def test_silu_mul_quant(
2025-05-07T20:32:22.3207293Z         self,
2025-05-07T20:32:22.3207495Z         T: int,
2025-05-07T20:32:22.3207781Z         D: int,
2025-05-07T20:32:22.3208073Z         scale_ub: Optional[float],
2025-05-07T20:32:22.3208353Z         contiguous: bool,
2025-05-07T20:32:22.3208593Z         compiled: bool,
2025-05-07T20:32:22.3208825Z     ) -> None:
2025-05-07T20:32:22.3209050Z         torch.manual_seed(2025)
2025-05-07T20:32:22.3209293Z     
2025-05-07T20:32:22.3209573Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.3209929Z     
2025-05-07T20:32:22.3210133Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.3210426Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.3210747Z         x = x_sign * x_clamp
2025-05-07T20:32:22.3210993Z         x0 = x[:, :D]
2025-05-07T20:32:22.3211216Z         x1 = x[:, D:]
2025-05-07T20:32:22.3211424Z     
2025-05-07T20:32:22.3211616Z         if contiguous:
2025-05-07T20:32:22.3211850Z             x0 = x0.contiguous()
2025-05-07T20:32:22.3212106Z             x1 = x1.contiguous()
2025-05-07T20:32:22.3212355Z     
2025-05-07T20:32:22.3212553Z         if scale_ub is not None:
2025-05-07T20:32:22.3212898Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.3213235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.3213548Z             )
2025-05-07T20:32:22.3213742Z         else:
2025-05-07T20:32:22.3213964Z             scale_ub_tensor = None
2025-05-07T20:32:22.3214221Z     
2025-05-07T20:32:22.3214454Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.3214771Z             op = silu_mul_quant
2025-05-07T20:32:22.3215029Z             if compiled:
2025-05-07T20:32:22.3215281Z                 op = torch.compile(op)
2025-05-07T20:32:22.3215582Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3215864Z     
2025-05-07T20:32:22.3216065Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.3216233Z 
2025-05-07T20:32:22.3216334Z moe/activation_test.py:117: 
2025-05-07T20:32:22.3216637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3216981Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.3217267Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3217962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.3218658Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.3219198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.3219952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.3220622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.3221158Z     kernel = self.compile(
2025-05-07T20:32:22.3221701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.3222365Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.3222769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3222998Z 
2025-05-07T20:32:22.3223211Z self = <triton.compiler.compiler.ASTSource object at 0x7f050949a150>
2025-05-07T20:32:22.3224281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.3225700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509424040>}
2025-05-07T20:32:22.3227037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.3228068Z context = <triton._C.libtriton.ir.context object at 0x7f050943e7b0>
2025-05-07T20:32:22.3228428Z 
2025-05-07T20:32:22.3228601Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.3229120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.3229596Z                            module_map=module_map)
2025-05-07T20:32:22.3229967Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.3230323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.3230592Z E       ^
2025-05-07T20:32:22.3231063Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.3231524Z 
2025-05-07T20:32:22.3231992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.3232502Z 
2025-05-07T20:32:22.3232608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.3233029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.3233491Z     T=16384,
2025-05-07T20:32:22.3233688Z     D=7168,
2025-05-07T20:32:22.3233891Z     scale_ub=None,
2025-05-07T20:32:22.3234112Z     contiguous=True,
2025-05-07T20:32:22.3234335Z     compiled=True,
2025-05-07T20:32:22.3234541Z )
2025-05-07T20:32:22.4593264Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.4594051Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.4594396Z 
2025-05-07T20:32:22.4594494Z     @given(
2025-05-07T20:32:22.4594730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.4595050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.4595363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.4595691Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.4596029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.4596329Z     )
2025-05-07T20:32:22.4596691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.4597132Z     def test_silu_mul_quant(
2025-05-07T20:32:22.4597381Z         self,
2025-05-07T20:32:22.4597588Z         T: int,
2025-05-07T20:32:22.4597785Z         D: int,
2025-05-07T20:32:22.4598015Z         scale_ub: Optional[float],
2025-05-07T20:32:22.4598288Z         contiguous: bool,
2025-05-07T20:32:22.4598526Z         compiled: bool,
2025-05-07T20:32:22.4598760Z     ) -> None:
2025-05-07T20:32:22.4599292Z         torch.manual_seed(2025)
2025-05-07T20:32:22.4599539Z     
2025-05-07T20:32:22.4599817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.4600166Z     
2025-05-07T20:32:22.4600369Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.4600669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.4600984Z         x = x_sign * x_clamp
2025-05-07T20:32:22.4601224Z         x0 = x[:, :D]
2025-05-07T20:32:22.4601452Z         x1 = x[:, D:]
2025-05-07T20:32:22.4601669Z     
2025-05-07T20:32:22.4601858Z         if contiguous:
2025-05-07T20:32:22.4602099Z             x0 = x0.contiguous()
2025-05-07T20:32:22.4602359Z             x1 = x1.contiguous()
2025-05-07T20:32:22.4602603Z     
2025-05-07T20:32:22.4602794Z         if scale_ub is not None:
2025-05-07T20:32:22.4603072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.4603409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.4603720Z             )
2025-05-07T20:32:22.4603999Z         else:
2025-05-07T20:32:22.4604217Z             scale_ub_tensor = None
2025-05-07T20:32:22.4604466Z     
2025-05-07T20:32:22.4604701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.4605022Z             op = silu_mul_quant
2025-05-07T20:32:22.4605273Z             if compiled:
2025-05-07T20:32:22.4605527Z                 op = torch.compile(op)
2025-05-07T20:32:22.4606112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4606475Z     
2025-05-07T20:32:22.4606673Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.4606844Z 
2025-05-07T20:32:22.4606948Z moe/activation_test.py:117: 
2025-05-07T20:32:22.4607244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4607667Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.4607952Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4608517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.4609083Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.4609745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.4610433Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.4610970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.4611791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.4612462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.4612998Z     kernel = self.compile(
2025-05-07T20:32:22.4613543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.4614201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.4614610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4614839Z 
2025-05-07T20:32:22.4615052Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509471050>
2025-05-07T20:32:22.4616126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.4617521Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509425260>}
2025-05-07T20:32:22.4618863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.4619962Z context = <triton._C.libtriton.ir.context object at 0x7f05094f9670>
2025-05-07T20:32:22.4620256Z 
2025-05-07T20:32:22.4620432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.4620953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.4621423Z                            module_map=module_map)
2025-05-07T20:32:22.4621794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.4622147Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.4622420Z E       ^
2025-05-07T20:32:22.4622884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.4623333Z 
2025-05-07T20:32:22.4623751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.4624259Z 
2025-05-07T20:32:22.4624366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.4624865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.4625276Z     T=4096,
2025-05-07T20:32:22.4625472Z     D=5120,
2025-05-07T20:32:22.4625667Z     scale_ub=None,
2025-05-07T20:32:22.4625888Z     contiguous=False,
2025-05-07T20:32:22.4626119Z     compiled=True,
2025-05-07T20:32:22.4626322Z )
2025-05-07T20:32:22.4626645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.4627139Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.4627454Z 
2025-05-07T20:32:22.4627537Z     @given(
2025-05-07T20:32:22.4627766Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.4628087Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.4628397Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.4628724Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.4629052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.4629352Z     )
2025-05-07T20:32:22.4629706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.4630149Z     def test_silu_mul_quant(
2025-05-07T20:32:22.4630389Z         self,
2025-05-07T20:32:22.4630589Z         T: int,
2025-05-07T20:32:22.4630792Z         D: int,
2025-05-07T20:32:22.4631010Z         scale_ub: Optional[float],
2025-05-07T20:32:22.4631286Z         contiguous: bool,
2025-05-07T20:32:22.4631530Z         compiled: bool,
2025-05-07T20:32:22.4631802Z     ) -> None:
2025-05-07T20:32:22.4632021Z         torch.manual_seed(2025)
2025-05-07T20:32:22.4632265Z     
2025-05-07T20:32:22.4632535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.4632882Z     
2025-05-07T20:32:22.4633078Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.4633372Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.4633685Z         x = x_sign * x_clamp
2025-05-07T20:32:22.4633930Z         x0 = x[:, :D]
2025-05-07T20:32:22.4634149Z         x1 = x[:, D:]
2025-05-07T20:32:22.4634367Z     
2025-05-07T20:32:22.4634561Z         if contiguous:
2025-05-07T20:32:22.4634792Z             x0 = x0.contiguous()
2025-05-07T20:32:22.4635058Z             x1 = x1.contiguous()
2025-05-07T20:32:22.4635309Z     
2025-05-07T20:32:22.4635500Z         if scale_ub is not None:
2025-05-07T20:32:22.4635779Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.4636121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.4636435Z             )
2025-05-07T20:32:22.4636632Z         else:
2025-05-07T20:32:22.4636852Z             scale_ub_tensor = None
2025-05-07T20:32:22.4637109Z     
2025-05-07T20:32:22.4637337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.4637656Z             op = silu_mul_quant
2025-05-07T20:32:22.4637911Z             if compiled:
2025-05-07T20:32:22.4638162Z                 op = torch.compile(op)
2025-05-07T20:32:22.4638512Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4638797Z     
2025-05-07T20:32:22.4638993Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.4639164Z 
2025-05-07T20:32:22.4639268Z moe/activation_test.py:117: 
2025-05-07T20:32:22.4639567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4639910Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.4640192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4640749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.4641314Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.4641967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.4642657Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.4643197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.4643930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.4644586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.4645119Z     kernel = self.compile(
2025-05-07T20:32:22.4645661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.4646310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.4646770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4647003Z 
2025-05-07T20:32:22.4647209Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be0a67d0>
2025-05-07T20:32:22.4648365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.4649732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509425da0>}
2025-05-07T20:32:22.4651059Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.4652175Z context = <triton._C.libtriton.ir.context object at 0x7f05be13adf0>
2025-05-07T20:32:22.4652462Z 
2025-05-07T20:32:22.4652632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.4653150Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.4653612Z                            module_map=module_map)
2025-05-07T20:32:22.4653978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.4654339Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.4654600Z E       ^
2025-05-07T20:32:22.4655064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.4655512Z 
2025-05-07T20:32:22.4655933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.4656442Z 
2025-05-07T20:32:22.5793421Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.5794096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.5794704Z     T=4096,
2025-05-07T20:32:22.5794971Z     D=5120,
2025-05-07T20:32:22.5795188Z     scale_ub=1200.0,
2025-05-07T20:32:22.5795412Z     contiguous=False,
2025-05-07T20:32:22.5795645Z     compiled=False,
2025-05-07T20:32:22.5795863Z )
2025-05-07T20:32:22.5796180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.5796979Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.5797263Z 
2025-05-07T20:32:22.5797377Z     @given(
2025-05-07T20:32:22.5797620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.5797937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.5798245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.5798585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.5798922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.5799209Z     )
2025-05-07T20:32:22.5799563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.5800007Z     def test_silu_mul_quant(
2025-05-07T20:32:22.5800249Z         self,
2025-05-07T20:32:22.5800449Z         T: int,
2025-05-07T20:32:22.5808946Z         D: int,
2025-05-07T20:32:22.5809205Z         scale_ub: Optional[float],
2025-05-07T20:32:22.5809490Z         contiguous: bool,
2025-05-07T20:32:22.5809756Z         compiled: bool,
2025-05-07T20:32:22.5810126Z     ) -> None:
2025-05-07T20:32:22.5810353Z         torch.manual_seed(2025)
2025-05-07T20:32:22.5810612Z     
2025-05-07T20:32:22.5810903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.5811250Z     
2025-05-07T20:32:22.5811464Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.5811770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.5812087Z         x = x_sign * x_clamp
2025-05-07T20:32:22.5812422Z         x0 = x[:, :D]
2025-05-07T20:32:22.5812657Z         x1 = x[:, D:]
2025-05-07T20:32:22.5812873Z     
2025-05-07T20:32:22.5813076Z         if contiguous:
2025-05-07T20:32:22.5813320Z             x0 = x0.contiguous()
2025-05-07T20:32:22.5813588Z             x1 = x1.contiguous()
2025-05-07T20:32:22.5813826Z     
2025-05-07T20:32:22.5814025Z         if scale_ub is not None:
2025-05-07T20:32:22.5814301Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.5814650Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.5814983Z             )
2025-05-07T20:32:22.5815194Z         else:
2025-05-07T20:32:22.5815410Z             scale_ub_tensor = None
2025-05-07T20:32:22.5815679Z     
2025-05-07T20:32:22.5815927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.5816255Z             op = silu_mul_quant
2025-05-07T20:32:22.5816522Z             if compiled:
2025-05-07T20:32:22.5816783Z                 op = torch.compile(op)
2025-05-07T20:32:22.5817173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.5817452Z     
2025-05-07T20:32:22.5817660Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.5817830Z 
2025-05-07T20:32:22.5817949Z moe/activation_test.py:117: 
2025-05-07T20:32:22.5818257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.5818610Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.5818906Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.5819621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.5820324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.5820879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.5821579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.5822261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.5822804Z     kernel = self.compile(
2025-05-07T20:32:22.5823361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.5824031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.5824434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.5824675Z 
2025-05-07T20:32:22.5824958Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be093b10>
2025-05-07T20:32:22.5826057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.5827454Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509427420>}
2025-05-07T20:32:22.5828812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.5829841Z context = <triton._C.libtriton.ir.context object at 0x7f05be108170>
2025-05-07T20:32:22.5830138Z 
2025-05-07T20:32:22.5830311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.5830888Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.5831367Z                            module_map=module_map)
2025-05-07T20:32:22.5831737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.5832102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.5832374Z E       ^
2025-05-07T20:32:22.5832843Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.5833347Z 
2025-05-07T20:32:22.5833767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.5834290Z 
2025-05-07T20:32:22.5834398Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.5834824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.5835230Z     T=4096,
2025-05-07T20:32:22.5835442Z     D=5120,
2025-05-07T20:32:22.5835654Z     scale_ub=1200.0,
2025-05-07T20:32:22.5835885Z     contiguous=False,
2025-05-07T20:32:22.5836126Z     compiled=True,
2025-05-07T20:32:22.5836345Z )
2025-05-07T20:32:22.5836671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.5837174Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.5837451Z 
2025-05-07T20:32:22.5837541Z     @given(
2025-05-07T20:32:22.5837829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.5838156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.5838475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.5838817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.5839149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.5839444Z     )
2025-05-07T20:32:22.5839803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.5840256Z     def test_silu_mul_quant(
2025-05-07T20:32:22.5840514Z         self,
2025-05-07T20:32:22.5840730Z         T: int,
2025-05-07T20:32:22.5840938Z         D: int,
2025-05-07T20:32:22.5841174Z         scale_ub: Optional[float],
2025-05-07T20:32:22.5841460Z         contiguous: bool,
2025-05-07T20:32:22.5841711Z         compiled: bool,
2025-05-07T20:32:22.5841974Z     ) -> None:
2025-05-07T20:32:22.5842233Z         torch.manual_seed(2025)
2025-05-07T20:32:22.5842489Z     
2025-05-07T20:32:22.5842776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.5843139Z     
2025-05-07T20:32:22.5843350Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.5843648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.5843971Z         x = x_sign * x_clamp
2025-05-07T20:32:22.5844232Z         x0 = x[:, :D]
2025-05-07T20:32:22.5844457Z         x1 = x[:, D:]
2025-05-07T20:32:22.5844681Z     
2025-05-07T20:32:22.5844885Z         if contiguous:
2025-05-07T20:32:22.5845174Z             x0 = x0.contiguous()
2025-05-07T20:32:22.5845449Z             x1 = x1.contiguous()
2025-05-07T20:32:22.5845702Z     
2025-05-07T20:32:22.5845902Z         if scale_ub is not None:
2025-05-07T20:32:22.5846194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.5846544Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.5846859Z             )
2025-05-07T20:32:22.5847068Z         else:
2025-05-07T20:32:22.5847300Z             scale_ub_tensor = None
2025-05-07T20:32:22.5847612Z     
2025-05-07T20:32:22.5847850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.5848171Z             op = silu_mul_quant
2025-05-07T20:32:22.5848430Z             if compiled:
2025-05-07T20:32:22.5848678Z                 op = torch.compile(op)
2025-05-07T20:32:22.5848981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.5849263Z     
2025-05-07T20:32:22.5849460Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.5849637Z 
2025-05-07T20:32:22.5849787Z moe/activation_test.py:117: 
2025-05-07T20:32:22.5850094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.5850433Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.5850722Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.5851284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.5851892Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.5852553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.5853245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.5853788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.5854467Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.5855136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.5855678Z     kernel = self.compile(
2025-05-07T20:32:22.5856225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.5856880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.5857285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.5857561Z 
2025-05-07T20:32:22.5857778Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be0c68d0>
2025-05-07T20:32:22.5858855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.5860227Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be074860>}
2025-05-07T20:32:22.5861551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.5862620Z context = <triton._C.libtriton.ir.context object at 0x7f05be156ef0>
2025-05-07T20:32:22.5862914Z 
2025-05-07T20:32:22.5863081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.5863601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.5864056Z                            module_map=module_map)
2025-05-07T20:32:22.5864416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.5864766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.5865023Z E       ^
2025-05-07T20:32:22.5865529Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.5865979Z 
2025-05-07T20:32:22.5866389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.5866894Z 
2025-05-07T20:32:22.6737230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6737842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6738278Z     T=2048,
2025-05-07T20:32:22.6738467Z     D=7168,
2025-05-07T20:32:22.6738664Z     scale_ub=1200.0,
2025-05-07T20:32:22.6738894Z     contiguous=False,
2025-05-07T20:32:22.6739125Z     compiled=False,
2025-05-07T20:32:22.6739334Z )
2025-05-07T20:32:22.6739658Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.6740160Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.6740435Z 
2025-05-07T20:32:22.6740524Z     @given(
2025-05-07T20:32:22.6741026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.6741347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.6741649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.6741983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.6742315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.6742600Z     )
2025-05-07T20:32:22.6742953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.6743466Z     def test_silu_mul_quant(
2025-05-07T20:32:22.6743715Z         self,
2025-05-07T20:32:22.6743911Z         T: int,
2025-05-07T20:32:22.6744113Z         D: int,
2025-05-07T20:32:22.6744336Z         scale_ub: Optional[float],
2025-05-07T20:32:22.6744605Z         contiguous: bool,
2025-05-07T20:32:22.6744847Z         compiled: bool,
2025-05-07T20:32:22.6745080Z     ) -> None:
2025-05-07T20:32:22.6745297Z         torch.manual_seed(2025)
2025-05-07T20:32:22.6745547Z     
2025-05-07T20:32:22.6745831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.6746179Z     
2025-05-07T20:32:22.6746380Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.6746677Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.6746985Z         x = x_sign * x_clamp
2025-05-07T20:32:22.6747234Z         x0 = x[:, :D]
2025-05-07T20:32:22.6747457Z         x1 = x[:, D:]
2025-05-07T20:32:22.6747788Z     
2025-05-07T20:32:22.6747983Z         if contiguous:
2025-05-07T20:32:22.6748220Z             x0 = x0.contiguous()
2025-05-07T20:32:22.6748484Z             x1 = x1.contiguous()
2025-05-07T20:32:22.6748723Z     
2025-05-07T20:32:22.6748921Z         if scale_ub is not None:
2025-05-07T20:32:22.6749201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.6749536Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.6749848Z             )
2025-05-07T20:32:22.6750045Z         else:
2025-05-07T20:32:22.6750260Z             scale_ub_tensor = None
2025-05-07T20:32:22.6750520Z     
2025-05-07T20:32:22.6750756Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6751068Z             op = silu_mul_quant
2025-05-07T20:32:22.6751323Z             if compiled:
2025-05-07T20:32:22.6751574Z                 op = torch.compile(op)
2025-05-07T20:32:22.6751867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6752143Z     
2025-05-07T20:32:22.6752345Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.6752507Z 
2025-05-07T20:32:22.6752611Z moe/activation_test.py:117: 
2025-05-07T20:32:22.6752903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6753239Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.6753519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6754277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.6754978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.6755518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.6756200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.6756854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.6757392Z     kernel = self.compile(
2025-05-07T20:32:22.6757932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.6758576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.6758970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6759204Z 
2025-05-07T20:32:22.6759408Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508a9b890>
2025-05-07T20:32:22.6760524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.6761901Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be0756c0>}
2025-05-07T20:32:22.6763287Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.6764302Z context = <triton._C.libtriton.ir.context object at 0x7f0508a83ef0>
2025-05-07T20:32:22.6764590Z 
2025-05-07T20:32:22.6764764Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.6765287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.6765757Z                            module_map=module_map)
2025-05-07T20:32:22.6766133Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.6766502Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.6766756Z E       ^
2025-05-07T20:32:22.6767228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6767756Z 
2025-05-07T20:32:22.6768223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.6768734Z 
2025-05-07T20:32:22.6768848Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6769253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6769661Z     T=1,
2025-05-07T20:32:22.6769852Z     D=7168,
2025-05-07T20:32:22.6770049Z     scale_ub=None,
2025-05-07T20:32:22.6770268Z     contiguous=True,
2025-05-07T20:32:22.6770503Z     compiled=False,
2025-05-07T20:32:22.6770716Z )
2025-05-07T20:32:22.6771041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.6771523Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.6771785Z 
2025-05-07T20:32:22.6771867Z     @given(
2025-05-07T20:32:22.6772100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.6772417Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.6772728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.6773057Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.6773384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.6773676Z     )
2025-05-07T20:32:22.6774016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.6774456Z     def test_silu_mul_quant(
2025-05-07T20:32:22.6774708Z         self,
2025-05-07T20:32:22.6774950Z         T: int,
2025-05-07T20:32:22.6775161Z         D: int,
2025-05-07T20:32:22.6775382Z         scale_ub: Optional[float],
2025-05-07T20:32:22.6775655Z         contiguous: bool,
2025-05-07T20:32:22.6775900Z         compiled: bool,
2025-05-07T20:32:22.6776131Z     ) -> None:
2025-05-07T20:32:22.6776341Z         torch.manual_seed(2025)
2025-05-07T20:32:22.6776587Z     
2025-05-07T20:32:22.6776860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.6777214Z     
2025-05-07T20:32:22.6777400Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.6777703Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.6778011Z         x = x_sign * x_clamp
2025-05-07T20:32:22.6778254Z         x0 = x[:, :D]
2025-05-07T20:32:22.6778471Z         x1 = x[:, D:]
2025-05-07T20:32:22.6778685Z     
2025-05-07T20:32:22.6778868Z         if contiguous:
2025-05-07T20:32:22.6779104Z             x0 = x0.contiguous()
2025-05-07T20:32:22.6779367Z             x1 = x1.contiguous()
2025-05-07T20:32:22.6779614Z     
2025-05-07T20:32:22.6779882Z         if scale_ub is not None:
2025-05-07T20:32:22.6780164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.6780493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.6780809Z             )
2025-05-07T20:32:22.6781014Z         else:
2025-05-07T20:32:22.6781229Z             scale_ub_tensor = None
2025-05-07T20:32:22.6781481Z     
2025-05-07T20:32:22.6781720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6782085Z             op = silu_mul_quant
2025-05-07T20:32:22.6782339Z             if compiled:
2025-05-07T20:32:22.6782590Z                 op = torch.compile(op)
2025-05-07T20:32:22.6782900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6783171Z     
2025-05-07T20:32:22.6783372Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.6783537Z 
2025-05-07T20:32:22.6783644Z moe/activation_test.py:117: 
2025-05-07T20:32:22.6783947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6784281Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.6784573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6785254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.6785948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.6786485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.6787223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.6787876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.6788413Z     kernel = self.compile(
2025-05-07T20:32:22.6788954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.6789622Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.6790016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6790251Z 
2025-05-07T20:32:22.6790458Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508a3ba50>
2025-05-07T20:32:22.6791536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.6792959Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be074fe0>}
2025-05-07T20:32:22.6794345Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.6795358Z context = <triton._C.libtriton.ir.context object at 0x7f0508afb6f0>
2025-05-07T20:32:22.6795654Z 
2025-05-07T20:32:22.6795822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.6796343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.6796817Z                            module_map=module_map)
2025-05-07T20:32:22.6797180Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.6797544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.6797806Z E       ^
2025-05-07T20:32:22.6798269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6798721Z 
2025-05-07T20:32:22.6799133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.6799649Z 
2025-05-07T20:32:22.6799755Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6800217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6800613Z     T=16384,
2025-05-07T20:32:22.6800818Z     D=7168,
2025-05-07T20:32:22.6801014Z     scale_ub=1200.0,
2025-05-07T20:32:22.6801239Z     contiguous=False,
2025-05-07T20:32:22.6801467Z     compiled=True,
2025-05-07T20:32:23.0268104Z )
2025-05-07T20:32:23.0268514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.0269481Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:23.0269777Z 
2025-05-07T20:32:23.0269859Z     @given(
2025-05-07T20:32:23.0270097Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.0270413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.0270717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.0271050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.0271396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.0271681Z     )
2025-05-07T20:32:23.0272036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.0272482Z     def test_silu_mul_quant(
2025-05-07T20:32:23.0272723Z         self,
2025-05-07T20:32:23.0272924Z         T: int,
2025-05-07T20:32:23.0273125Z         D: int,
2025-05-07T20:32:23.0273344Z         scale_ub: Optional[float],
2025-05-07T20:32:23.0273740Z         contiguous: bool,
2025-05-07T20:32:23.0273984Z         compiled: bool,
2025-05-07T20:32:23.0274223Z     ) -> None:
2025-05-07T20:32:23.0274443Z         torch.manual_seed(2025)
2025-05-07T20:32:23.0274691Z     
2025-05-07T20:32:23.0274968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.0275310Z     
2025-05-07T20:32:23.0275510Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.0275808Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.0276118Z         x = x_sign * x_clamp
2025-05-07T20:32:23.0276369Z         x0 = x[:, :D]
2025-05-07T20:32:23.0276602Z         x1 = x[:, D:]
2025-05-07T20:32:23.0276811Z     
2025-05-07T20:32:23.0277004Z         if contiguous:
2025-05-07T20:32:23.0277245Z             x0 = x0.contiguous()
2025-05-07T20:32:23.0277506Z             x1 = x1.contiguous()
2025-05-07T20:32:23.0277756Z     
2025-05-07T20:32:23.0277956Z         if scale_ub is not None:
2025-05-07T20:32:23.0278231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.0278579Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.0278893Z             )
2025-05-07T20:32:23.0279092Z         else:
2025-05-07T20:32:23.0279299Z             scale_ub_tensor = None
2025-05-07T20:32:23.0279554Z     
2025-05-07T20:32:23.0279793Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.0280102Z             op = silu_mul_quant
2025-05-07T20:32:23.0280354Z             if compiled:
2025-05-07T20:32:23.0280686Z                 op = torch.compile(op)
2025-05-07T20:32:23.0280982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0281261Z     
2025-05-07T20:32:23.0281454Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.0281617Z 
2025-05-07T20:32:23.0281717Z moe/activation_test.py:117: 
2025-05-07T20:32:23.0282019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0282353Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.0282635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0283197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.0283767Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.0284431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.0285133Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.0295628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.0296346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.0297034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.0297579Z     kernel = self.compile(
2025-05-07T20:32:23.0298171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.0298894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.0299312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0299547Z 
2025-05-07T20:32:23.0299766Z self = <triton.compiler.compiler.ASTSource object at 0x7f05090ddc90>
2025-05-07T20:32:23.0300856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.0302258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be077b00>}
2025-05-07T20:32:23.0303625Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.0304717Z context = <triton._C.libtriton.ir.context object at 0x7f050909a2f0>
2025-05-07T20:32:23.0305011Z 
2025-05-07T20:32:23.0305191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.0306033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.0306519Z                            module_map=module_map)
2025-05-07T20:32:23.0306903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.0307263Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.0307537Z E       ^
2025-05-07T20:32:23.0308015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.0308472Z 
2025-05-07T20:32:23.0308899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.0309422Z 
2025-05-07T20:32:23.0309531Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.0309960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.0310375Z     T=1,
2025-05-07T20:32:23.0310566Z     D=7168,
2025-05-07T20:32:23.0310780Z     scale_ub=None,
2025-05-07T20:32:23.0311009Z     contiguous=False,
2025-05-07T20:32:23.0311253Z     compiled=False,
2025-05-07T20:32:23.0311465Z )
2025-05-07T20:32:23.0311877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.0312380Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:23.0312645Z 
2025-05-07T20:32:23.0312737Z     @given(
2025-05-07T20:32:23.0312976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.0313301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.0313619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.0313958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.0314306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.0314607Z     )
2025-05-07T20:32:23.0314962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.0315416Z     def test_silu_mul_quant(
2025-05-07T20:32:23.0315671Z         self,
2025-05-07T20:32:23.0315881Z         T: int,
2025-05-07T20:32:23.0316086Z         D: int,
2025-05-07T20:32:23.0316317Z         scale_ub: Optional[float],
2025-05-07T20:32:23.0316605Z         contiguous: bool,
2025-05-07T20:32:23.0316919Z         compiled: bool,
2025-05-07T20:32:23.0317160Z     ) -> None:
2025-05-07T20:32:23.0317389Z         torch.manual_seed(2025)
2025-05-07T20:32:23.0317637Z     
2025-05-07T20:32:23.0317923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.0318282Z     
2025-05-07T20:32:23.0318483Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.0318790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.0319195Z         x = x_sign * x_clamp
2025-05-07T20:32:23.0319444Z         x0 = x[:, :D]
2025-05-07T20:32:23.0319675Z         x1 = x[:, D:]
2025-05-07T20:32:23.0319897Z     
2025-05-07T20:32:23.0320089Z         if contiguous:
2025-05-07T20:32:23.0320336Z             x0 = x0.contiguous()
2025-05-07T20:32:23.0320608Z             x1 = x1.contiguous()
2025-05-07T20:32:23.0320856Z     
2025-05-07T20:32:23.0321062Z         if scale_ub is not None:
2025-05-07T20:32:23.0321353Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.0321713Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.0322066Z             )
2025-05-07T20:32:23.0322274Z         else:
2025-05-07T20:32:23.0322500Z             scale_ub_tensor = None
2025-05-07T20:32:23.0322754Z     
2025-05-07T20:32:23.0322998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.0323326Z             op = silu_mul_quant
2025-05-07T20:32:23.0323586Z             if compiled:
2025-05-07T20:32:23.0323929Z                 op = torch.compile(op)
2025-05-07T20:32:23.0324240Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0324520Z     
2025-05-07T20:32:23.0324726Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.0324895Z 
2025-05-07T20:32:23.0325005Z moe/activation_test.py:117: 
2025-05-07T20:32:23.0325305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0325652Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.0325955Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0326663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.0327362Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.0328045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.0328750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.0329432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.0329974Z     kernel = self.compile(
2025-05-07T20:32:23.0330526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.0331195Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.0331649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0331922Z 
2025-05-07T20:32:23.0332158Z self = <triton.compiler.compiler.ASTSource object at 0x7f05090e3450>
2025-05-07T20:32:23.0333248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.0334630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509070680>}
2025-05-07T20:32:23.0335974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.0337000Z context = <triton._C.libtriton.ir.context object at 0x7f0509093a30>
2025-05-07T20:32:23.0337299Z 
2025-05-07T20:32:23.0337512Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.0338042Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.0338518Z                            module_map=module_map)
2025-05-07T20:32:23.0338883Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.0339247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.0339565Z E       ^
2025-05-07T20:32:23.0340028Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.0340485Z 
2025-05-07T20:32:23.0340901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.0341418Z 
2025-05-07T20:32:23.0341525Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.0341945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.0342348Z     T=2048,
2025-05-07T20:32:23.0342543Z     D=7168,
2025-05-07T20:32:23.0342737Z     scale_ub=None,
2025-05-07T20:32:23.0342948Z     contiguous=False,
2025-05-07T20:32:23.0343175Z     compiled=True,
2025-05-07T20:32:23.0343379Z )
2025-05-07T20:32:23.1014258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.1014854Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:23.1015507Z 
2025-05-07T20:32:23.1015593Z     @given(
2025-05-07T20:32:23.1015839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.1016164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.1016475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.1016815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.1017153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.1017452Z     )
2025-05-07T20:32:23.1017812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.1018261Z     def test_silu_mul_quant(
2025-05-07T20:32:23.1018513Z         self,
2025-05-07T20:32:23.1018713Z         T: int,
2025-05-07T20:32:23.1018922Z         D: int,
2025-05-07T20:32:23.1019156Z         scale_ub: Optional[float],
2025-05-07T20:32:23.1019431Z         contiguous: bool,
2025-05-07T20:32:23.1019686Z         compiled: bool,
2025-05-07T20:32:23.1019928Z     ) -> None:
2025-05-07T20:32:23.1020159Z         torch.manual_seed(2025)
2025-05-07T20:32:23.1020413Z     
2025-05-07T20:32:23.1020700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.1021056Z     
2025-05-07T20:32:23.1021255Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.1021561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.1021926Z         x = x_sign * x_clamp
2025-05-07T20:32:23.1022180Z         x0 = x[:, :D]
2025-05-07T20:32:23.1022413Z         x1 = x[:, D:]
2025-05-07T20:32:23.1022725Z     
2025-05-07T20:32:23.1022920Z         if contiguous:
2025-05-07T20:32:23.1023162Z             x0 = x0.contiguous()
2025-05-07T20:32:23.1023427Z             x1 = x1.contiguous()
2025-05-07T20:32:23.1023678Z     
2025-05-07T20:32:23.1023875Z         if scale_ub is not None:
2025-05-07T20:32:23.1024155Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.1024497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.1024815Z             )
2025-05-07T20:32:23.1025017Z         else:
2025-05-07T20:32:23.1025239Z             scale_ub_tensor = None
2025-05-07T20:32:23.1025494Z     
2025-05-07T20:32:23.1025734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.1026056Z             op = silu_mul_quant
2025-05-07T20:32:23.1026312Z             if compiled:
2025-05-07T20:32:23.1026570Z                 op = torch.compile(op)
2025-05-07T20:32:23.1026875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.1027154Z     
2025-05-07T20:32:23.1027443Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.1027620Z 
2025-05-07T20:32:23.1027724Z moe/activation_test.py:117: 
2025-05-07T20:32:23.1028028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.1028364Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.1028654Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.1029221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.1029857Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.1030524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.1031220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.1031764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.1032452Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.1033120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.1033661Z     kernel = self.compile(
2025-05-07T20:32:23.1034208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.1034873Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.1035360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.1035591Z 
2025-05-07T20:32:23.1035807Z self = <triton.compiler.compiler.ASTSource object at 0x7f050905e4d0>
2025-05-07T20:32:23.1036883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.1038274Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509071d00>}
2025-05-07T20:32:23.1039618Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.1040645Z context = <triton._C.libtriton.ir.context object at 0x7f05090bab30>
2025-05-07T20:32:23.1040938Z 
2025-05-07T20:32:23.1041114Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.1041638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.1042114Z                            module_map=module_map)
2025-05-07T20:32:23.1042491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.1042852Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.1043167Z E       ^
2025-05-07T20:32:23.1043647Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.1044103Z 
2025-05-07T20:32:23.1044528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.1045040Z 
2025-05-07T20:32:23.1045147Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.1045578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.1045993Z     T=4096,
2025-05-07T20:32:23.1046187Z     D=7168,
2025-05-07T20:32:23.1046389Z     scale_ub=None,
2025-05-07T20:32:23.1046615Z     contiguous=False,
2025-05-07T20:32:23.1046844Z     compiled=True,
2025-05-07T20:32:23.1047063Z )
2025-05-07T20:32:23.1047387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.1048011Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:23.1048283Z 
2025-05-07T20:32:23.1048417Z     @given(
2025-05-07T20:32:23.1048660Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.1048988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.1049296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.1049634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.1049972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.1050312Z     )
2025-05-07T20:32:23.1050669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.1051117Z     def test_silu_mul_quant(
2025-05-07T20:32:23.1051365Z         self,
2025-05-07T20:32:23.1051567Z         T: int,
2025-05-07T20:32:23.1051773Z         D: int,
2025-05-07T20:32:23.1052000Z         scale_ub: Optional[float],
2025-05-07T20:32:23.1052275Z         contiguous: bool,
2025-05-07T20:32:23.1052525Z         compiled: bool,
2025-05-07T20:32:23.1052759Z     ) -> None:
2025-05-07T20:32:23.1052989Z         torch.manual_seed(2025)
2025-05-07T20:32:23.1053243Z     
2025-05-07T20:32:23.1053525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.1053870Z     
2025-05-07T20:32:23.1054072Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.1054373Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.1054686Z         x = x_sign * x_clamp
2025-05-07T20:32:23.1054941Z         x0 = x[:, :D]
2025-05-07T20:32:23.1055223Z         x1 = x[:, D:]
2025-05-07T20:32:23.1055438Z     
2025-05-07T20:32:23.1055637Z         if contiguous:
2025-05-07T20:32:23.1055880Z             x0 = x0.contiguous()
2025-05-07T20:32:23.1056141Z             x1 = x1.contiguous()
2025-05-07T20:32:23.1056391Z     
2025-05-07T20:32:23.1056597Z         if scale_ub is not None:
2025-05-07T20:32:23.1056874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.1057219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.1057541Z             )
2025-05-07T20:32:23.1057750Z         else:
2025-05-07T20:32:23.1057969Z             scale_ub_tensor = None
2025-05-07T20:32:23.1058230Z     
2025-05-07T20:32:23.1058475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.1058793Z             op = silu_mul_quant
2025-05-07T20:32:23.1059055Z             if compiled:
2025-05-07T20:32:23.1059310Z                 op = torch.compile(op)
2025-05-07T20:32:23.1059608Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.1059899Z     
2025-05-07T20:32:23.1060108Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.1060275Z 
2025-05-07T20:32:23.1060378Z moe/activation_test.py:117: 
2025-05-07T20:32:23.1060682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.1061022Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.1061315Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.1061925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.1062496Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.1063153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.1063847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.1064390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.1065085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.1065758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.1066292Z     kernel = self.compile(
2025-05-07T20:32:23.1066842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.1067507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.1067948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.1068187Z 
2025-05-07T20:32:23.1068394Z self = <triton.compiler.compiler.ASTSource object at 0x7f05087cfd50>
2025-05-07T20:32:23.1069479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.1070892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509072840>}
2025-05-07T20:32:23.1072226Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.1073246Z context = <triton._C.libtriton.ir.context object at 0x7f05087403b0>
2025-05-07T20:32:23.1073537Z 
2025-05-07T20:32:23.1073701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.1074222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.1074688Z                            module_map=module_map)
2025-05-07T20:32:23.1075048Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.1075451Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.1075719Z E       ^
2025-05-07T20:32:23.1076181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.1076634Z 
2025-05-07T20:32:23.1077047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.1077560Z 
2025-05-07T20:32:23.2334354Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.2334989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.2335564Z     T=16384,
2025-05-07T20:32:23.2335816Z     D=5120,
2025-05-07T20:32:23.2336086Z     scale_ub=1200.0,
2025-05-07T20:32:23.2336325Z     contiguous=False,
2025-05-07T20:32:23.2336550Z     compiled=False,
2025-05-07T20:32:23.2336767Z )
2025-05-07T20:32:23.2337091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.2337607Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:23.2337898Z 
2025-05-07T20:32:23.2337981Z     @given(
2025-05-07T20:32:23.2338220Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.2338539Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.2338848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.2339184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.2339516Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.2340072Z     )
2025-05-07T20:32:23.2340430Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.2340874Z     def test_silu_mul_quant(
2025-05-07T20:32:23.2341122Z         self,
2025-05-07T20:32:23.2341314Z         T: int,
2025-05-07T20:32:23.2341524Z         D: int,
2025-05-07T20:32:23.2341752Z         scale_ub: Optional[float],
2025-05-07T20:32:23.2342081Z         contiguous: bool,
2025-05-07T20:32:23.2342325Z         compiled: bool,
2025-05-07T20:32:23.2342559Z     ) -> None:
2025-05-07T20:32:23.2342785Z         torch.manual_seed(2025)
2025-05-07T20:32:23.2343026Z     
2025-05-07T20:32:23.2343302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.2343652Z     
2025-05-07T20:32:23.2343848Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.2344147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.2344464Z         x = x_sign * x_clamp
2025-05-07T20:32:23.2344704Z         x0 = x[:, :D]
2025-05-07T20:32:23.2345006Z         x1 = x[:, D:]
2025-05-07T20:32:23.2345228Z     
2025-05-07T20:32:23.2345417Z         if contiguous:
2025-05-07T20:32:23.2345654Z             x0 = x0.contiguous()
2025-05-07T20:32:23.2345920Z             x1 = x1.contiguous()
2025-05-07T20:32:23.2346166Z     
2025-05-07T20:32:23.2346365Z         if scale_ub is not None:
2025-05-07T20:32:23.2346642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.2347047Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.2347364Z             )
2025-05-07T20:32:23.2347567Z         else:
2025-05-07T20:32:23.2347786Z             scale_ub_tensor = None
2025-05-07T20:32:23.2348041Z     
2025-05-07T20:32:23.2348282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.2348601Z             op = silu_mul_quant
2025-05-07T20:32:23.2348853Z             if compiled:
2025-05-07T20:32:23.2349110Z                 op = torch.compile(op)
2025-05-07T20:32:23.2349415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.2349697Z     
2025-05-07T20:32:23.2349905Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.2350069Z 
2025-05-07T20:32:23.2350176Z moe/activation_test.py:117: 
2025-05-07T20:32:23.2350472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.2350814Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.2351104Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.2351893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.2352584Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.2353128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.2353813Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.2354475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.2355011Z     kernel = self.compile(
2025-05-07T20:32:23.2355554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.2356212Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.2356607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.2356846Z 
2025-05-07T20:32:23.2357053Z self = <triton.compiler.compiler.ASTSource object at 0x7f05087e9110>
2025-05-07T20:32:23.2358132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.2359567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508758040>}
2025-05-07T20:32:23.2360900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.2361927Z context = <triton._C.libtriton.ir.context object at 0x7f05087a9730>
2025-05-07T20:32:23.2362223Z 
2025-05-07T20:32:23.2362390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.2362918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.2363384Z                            module_map=module_map)
2025-05-07T20:32:23.2363755Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.2364116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.2364384Z E       ^
2025-05-07T20:32:23.2364895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.2365348Z 
2025-05-07T20:32:23.2365765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.2366277Z 
2025-05-07T20:32:23.2366388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.2366805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.2367210Z     T=16384,
2025-05-07T20:32:23.2367487Z     D=5120,
2025-05-07T20:32:23.2367804Z     scale_ub=1200.0,
2025-05-07T20:32:23.2368025Z     contiguous=True,
2025-05-07T20:32:23.2368252Z     compiled=True,
2025-05-07T20:32:23.2368466Z )
2025-05-07T20:32:23.2368780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.2369275Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:23.2369550Z 
2025-05-07T20:32:23.2369640Z     @given(
2025-05-07T20:32:23.2369876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.2370200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.2370511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.2370847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.2371175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.2371471Z     )
2025-05-07T20:32:23.2371870Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.2372401Z     def test_silu_mul_quant(
2025-05-07T20:32:23.2372644Z         self,
2025-05-07T20:32:23.2372849Z         T: int,
2025-05-07T20:32:23.2373055Z         D: int,
2025-05-07T20:32:23.2373274Z         scale_ub: Optional[float],
2025-05-07T20:32:23.2373555Z         contiguous: bool,
2025-05-07T20:32:23.2373802Z         compiled: bool,
2025-05-07T20:32:23.2374030Z     ) -> None:
2025-05-07T20:32:23.2374256Z         torch.manual_seed(2025)
2025-05-07T20:32:23.2374508Z     
2025-05-07T20:32:23.2374784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.2375143Z     
2025-05-07T20:32:23.2382912Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.2383235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.2383559Z         x = x_sign * x_clamp
2025-05-07T20:32:23.2383815Z         x0 = x[:, :D]
2025-05-07T20:32:23.2384050Z         x1 = x[:, D:]
2025-05-07T20:32:23.2384266Z     
2025-05-07T20:32:23.2384476Z         if contiguous:
2025-05-07T20:32:23.2384729Z             x0 = x0.contiguous()
2025-05-07T20:32:23.2384993Z             x1 = x1.contiguous()
2025-05-07T20:32:23.2385244Z     
2025-05-07T20:32:23.2385452Z         if scale_ub is not None:
2025-05-07T20:32:23.2385730Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.2386081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.2386401Z             )
2025-05-07T20:32:23.2386608Z         else:
2025-05-07T20:32:23.2386906Z             scale_ub_tensor = None
2025-05-07T20:32:23.2387173Z     
2025-05-07T20:32:23.2387420Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.2387748Z             op = silu_mul_quant
2025-05-07T20:32:23.2388012Z             if compiled:
2025-05-07T20:32:23.2388276Z                 op = torch.compile(op)
2025-05-07T20:32:23.2388578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.2388867Z     
2025-05-07T20:32:23.2389074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.2389248Z 
2025-05-07T20:32:23.2389355Z moe/activation_test.py:117: 
2025-05-07T20:32:23.2389667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.2390014Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.2390297Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.2390871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.2391446Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.2392221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.2392918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.2393466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.2394160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.2394881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.2395417Z     kernel = self.compile(
2025-05-07T20:32:23.2395970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.2396637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.2397039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.2397283Z 
2025-05-07T20:32:23.2397497Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508bcfed0>
2025-05-07T20:32:23.2398585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.2399966Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508759300>}
2025-05-07T20:32:23.2401364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.2402443Z context = <triton._C.libtriton.ir.context object at 0x7f0508bec530>
2025-05-07T20:32:23.2402741Z 
2025-05-07T20:32:23.2402916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.2403450Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.2403930Z                            module_map=module_map)
2025-05-07T20:32:23.2404301Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.2404672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.2404947Z E       ^
2025-05-07T20:32:23.2405422Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.2406184Z 
2025-05-07T20:32:23.2406606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.2407128Z 
2025-05-07T20:32:23.5437871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.5439071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.5440235Z     T=16384,
2025-05-07T20:32:23.5441050Z     D=5120,
2025-05-07T20:32:23.5441455Z     scale_ub=None,
2025-05-07T20:32:23.5441864Z     contiguous=False,
2025-05-07T20:32:23.5442096Z     compiled=True,
2025-05-07T20:32:23.5442316Z )
2025-05-07T20:32:23.5442646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.5443151Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:23.5443443Z 
2025-05-07T20:32:23.5443539Z     @given(
2025-05-07T20:32:23.5443782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.5444106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.5444417Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.5444755Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.5445091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.5445378Z     )
2025-05-07T20:32:23.5445738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.5446264Z     def test_silu_mul_quant(
2025-05-07T20:32:23.5446513Z         self,
2025-05-07T20:32:23.5446720Z         T: int,
2025-05-07T20:32:23.5446926Z         D: int,
2025-05-07T20:32:23.5447143Z         scale_ub: Optional[float],
2025-05-07T20:32:23.5447423Z         contiguous: bool,
2025-05-07T20:32:23.5447776Z         compiled: bool,
2025-05-07T20:32:23.5448006Z     ) -> None:
2025-05-07T20:32:23.5448232Z         torch.manual_seed(2025)
2025-05-07T20:32:23.5448571Z     
2025-05-07T20:32:23.5448848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.5449200Z     
2025-05-07T20:32:23.5449401Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.5449700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.5450012Z         x = x_sign * x_clamp
2025-05-07T20:32:23.5450260Z         x0 = x[:, :D]
2025-05-07T20:32:23.5450481Z         x1 = x[:, D:]
2025-05-07T20:32:23.5450687Z     
2025-05-07T20:32:23.5450885Z         if contiguous:
2025-05-07T20:32:23.5451129Z             x0 = x0.contiguous()
2025-05-07T20:32:23.5451390Z             x1 = x1.contiguous()
2025-05-07T20:32:23.5451637Z     
2025-05-07T20:32:23.5451834Z         if scale_ub is not None:
2025-05-07T20:32:23.5452110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.5452461Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.5452790Z             )
2025-05-07T20:32:23.5452995Z         else:
2025-05-07T20:32:23.5453309Z             scale_ub_tensor = None
2025-05-07T20:32:23.5453572Z     
2025-05-07T20:32:23.5453805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.5454130Z             op = silu_mul_quant
2025-05-07T20:32:23.5454391Z             if compiled:
2025-05-07T20:32:23.5454654Z                 op = torch.compile(op)
2025-05-07T20:32:23.5454954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.5455238Z     
2025-05-07T20:32:23.5455440Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.5455608Z 
2025-05-07T20:32:23.5455714Z moe/activation_test.py:117: 
2025-05-07T20:32:23.5456017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.5456360Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.5456647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.5457215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.5457794Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.5458464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.5459157Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.5459703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.5460393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.5461114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.5461657Z     kernel = self.compile(
2025-05-07T20:32:23.5462201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.5462865Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.5463273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.5463508Z 
2025-05-07T20:32:23.5463723Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508bb67d0>
2025-05-07T20:32:23.5464800Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.5466244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508759e40>}
2025-05-07T20:32:23.5467592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.5468620Z context = <triton._C.libtriton.ir.context object at 0x7f0508b802f0>
2025-05-07T20:32:23.5468952Z 
2025-05-07T20:32:23.5469119Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.5469641Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.5470110Z                            module_map=module_map)
2025-05-07T20:32:23.5470477Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.5470828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.5471092Z E       ^
2025-05-07T20:32:23.5471564Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.5472042Z 
2025-05-07T20:32:23.5472488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.5472997Z 
2025-05-07T20:32:23.5473103Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.5473516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.5473977Z     T=2048,
2025-05-07T20:32:23.5474164Z     D=5120,
2025-05-07T20:32:23.5474363Z     scale_ub=None,
2025-05-07T20:32:23.5474586Z     contiguous=False,
2025-05-07T20:32:23.5474809Z     compiled=True,
2025-05-07T20:32:23.5475016Z )
2025-05-07T20:32:23.6190959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.6191739Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:23.6192146Z 
2025-05-07T20:32:23.6192275Z     @given(
2025-05-07T20:32:23.6192515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.6192834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.6193143Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.6193475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.6193803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.6194095Z     )
2025-05-07T20:32:23.6194465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.6194910Z     def test_silu_mul_quant(
2025-05-07T20:32:23.6195157Z         self,
2025-05-07T20:32:23.6195363Z         T: int,
2025-05-07T20:32:23.6195561Z         D: int,
2025-05-07T20:32:23.6195786Z         scale_ub: Optional[float],
2025-05-07T20:32:23.6196063Z         contiguous: bool,
2025-05-07T20:32:23.6196305Z         compiled: bool,
2025-05-07T20:32:23.6196538Z     ) -> None:
2025-05-07T20:32:23.6197027Z         torch.manual_seed(2025)
2025-05-07T20:32:23.6197272Z     
2025-05-07T20:32:23.6197554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.6197913Z     
2025-05-07T20:32:23.6198113Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.6198403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.6198718Z         x = x_sign * x_clamp
2025-05-07T20:32:23.6198963Z         x0 = x[:, :D]
2025-05-07T20:32:23.6199182Z         x1 = x[:, D:]
2025-05-07T20:32:23.6199406Z     
2025-05-07T20:32:23.6199601Z         if contiguous:
2025-05-07T20:32:23.6199831Z             x0 = x0.contiguous()
2025-05-07T20:32:23.6200094Z             x1 = x1.contiguous()
2025-05-07T20:32:23.6200341Z     
2025-05-07T20:32:23.6200533Z         if scale_ub is not None:
2025-05-07T20:32:23.6200811Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.6201152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.6201458Z             )
2025-05-07T20:32:23.6201664Z         else:
2025-05-07T20:32:23.6201960Z             scale_ub_tensor = None
2025-05-07T20:32:23.6202240Z     
2025-05-07T20:32:23.6202497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.6202820Z             op = silu_mul_quant
2025-05-07T20:32:23.6203077Z             if compiled:
2025-05-07T20:32:23.6203324Z                 op = torch.compile(op)
2025-05-07T20:32:23.6203626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.6204021Z     
2025-05-07T20:32:23.6204214Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.6204386Z 
2025-05-07T20:32:23.6204486Z moe/activation_test.py:117: 
2025-05-07T20:32:23.6204786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.6205124Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.6205413Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.6206355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.6206923Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.6207723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.6208435Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.6208978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.6209653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.6210439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.6210978Z     kernel = self.compile(
2025-05-07T20:32:23.6211528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.6212185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.6212603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.6212834Z 
2025-05-07T20:32:23.6213048Z self = <triton.compiler.compiler.ASTSource object at 0x7f05088af4d0>
2025-05-07T20:32:23.6214137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.6215533Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050875b240>}
2025-05-07T20:32:23.6216880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.6218061Z context = <triton._C.libtriton.ir.context object at 0x7f05088eedb0>
2025-05-07T20:32:23.6218425Z 
2025-05-07T20:32:23.6218603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.6219121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.6219590Z                            module_map=module_map)
2025-05-07T20:32:23.6219956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.6220314Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.6220580Z E       ^
2025-05-07T20:32:23.6221045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.6221492Z 
2025-05-07T20:32:23.6221917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.6222425Z 
2025-05-07T20:32:23.6222530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.6222948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.6223412Z     T=2048,
2025-05-07T20:32:23.6223606Z     D=5120,
2025-05-07T20:32:23.6223795Z     scale_ub=1200.0,
2025-05-07T20:32:23.6224023Z     contiguous=False,
2025-05-07T20:32:23.6224247Z     compiled=True,
2025-05-07T20:32:23.6224449Z )
2025-05-07T20:32:23.6224769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.6225261Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:23.6225596Z 
2025-05-07T20:32:23.6225678Z     @given(
2025-05-07T20:32:23.6225912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.6226224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.6226527Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.6226864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.6227198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.6227491Z     )
2025-05-07T20:32:23.6227845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.6228287Z     def test_silu_mul_quant(
2025-05-07T20:32:23.6228538Z         self,
2025-05-07T20:32:23.6228735Z         T: int,
2025-05-07T20:32:23.6228940Z         D: int,
2025-05-07T20:32:23.6229167Z         scale_ub: Optional[float],
2025-05-07T20:32:23.6229439Z         contiguous: bool,
2025-05-07T20:32:23.6229689Z         compiled: bool,
2025-05-07T20:32:23.6229923Z     ) -> None:
2025-05-07T20:32:23.6230190Z         torch.manual_seed(2025)
2025-05-07T20:32:23.6230438Z     
2025-05-07T20:32:23.6230719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.6231067Z     
2025-05-07T20:32:23.6231270Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.6231570Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.6231890Z         x = x_sign * x_clamp
2025-05-07T20:32:23.6232134Z         x0 = x[:, :D]
2025-05-07T20:32:23.6232369Z         x1 = x[:, D:]
2025-05-07T20:32:23.6232591Z     
2025-05-07T20:32:23.6232786Z         if contiguous:
2025-05-07T20:32:23.6233031Z             x0 = x0.contiguous()
2025-05-07T20:32:23.6233299Z             x1 = x1.contiguous()
2025-05-07T20:32:23.6233542Z     
2025-05-07T20:32:23.6233752Z         if scale_ub is not None:
2025-05-07T20:32:23.6234033Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.6234374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.6234696Z             )
2025-05-07T20:32:23.6234912Z         else:
2025-05-07T20:32:23.6235137Z             scale_ub_tensor = None
2025-05-07T20:32:23.6235401Z     
2025-05-07T20:32:23.6235645Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.6235963Z             op = silu_mul_quant
2025-05-07T20:32:23.6236220Z             if compiled:
2025-05-07T20:32:23.6236474Z                 op = torch.compile(op)
2025-05-07T20:32:23.6236770Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.6237104Z     
2025-05-07T20:32:23.6237310Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.6237477Z 
2025-05-07T20:32:23.6237587Z moe/activation_test.py:117: 
2025-05-07T20:32:23.6237897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.6238231Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.6238524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.6239093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.6239662Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.6240330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.6241031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.6241582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.6242315Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.6242988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.6243528Z     kernel = self.compile(
2025-05-07T20:32:23.6244068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.6244730Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.6245180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.6245412Z 
2025-05-07T20:32:23.6245624Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508803a90>
2025-05-07T20:32:23.6246699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.6248147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050886c720>}
2025-05-07T20:32:23.6249483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.6250511Z context = <triton._C.libtriton.ir.context object at 0x7f05088d80f0>
2025-05-07T20:32:23.6250843Z 
2025-05-07T20:32:23.6251016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.6251535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.6252001Z                            module_map=module_map)
2025-05-07T20:32:23.6252373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.6252723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.6252994Z E       ^
2025-05-07T20:32:23.6253466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.6253912Z 
2025-05-07T20:32:23.6254331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.6254843Z 
2025-05-07T20:32:23.7582359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.7583004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.7583559Z     T=4096,
2025-05-07T20:32:23.7583831Z     D=5120,
2025-05-07T20:32:23.7584038Z     scale_ub=1200.0,
2025-05-07T20:32:23.7584265Z     contiguous=True,
2025-05-07T20:32:23.7584497Z     compiled=True,
2025-05-07T20:32:23.7584703Z )
2025-05-07T20:32:23.7585027Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.7585787Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:23.7586067Z 
2025-05-07T20:32:23.7586150Z     @given(
2025-05-07T20:32:23.7586389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.7586705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.7587012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.7587345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.7587676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.7587970Z     )
2025-05-07T20:32:23.7588317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.7588761Z     def test_silu_mul_quant(
2025-05-07T20:32:23.7589008Z         self,
2025-05-07T20:32:23.7589202Z         T: int,
2025-05-07T20:32:23.7589403Z         D: int,
2025-05-07T20:32:23.7589625Z         scale_ub: Optional[float],
2025-05-07T20:32:23.7589895Z         contiguous: bool,
2025-05-07T20:32:23.7590140Z         compiled: bool,
2025-05-07T20:32:23.7590369Z     ) -> None:
2025-05-07T20:32:23.7590712Z         torch.manual_seed(2025)
2025-05-07T20:32:23.7590957Z     
2025-05-07T20:32:23.7591233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.7591583Z     
2025-05-07T20:32:23.7591777Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.7592071Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.7592392Z         x = x_sign * x_clamp
2025-05-07T20:32:23.7592712Z         x0 = x[:, :D]
2025-05-07T20:32:23.7592940Z         x1 = x[:, D:]
2025-05-07T20:32:23.7593158Z     
2025-05-07T20:32:23.7593347Z         if contiguous:
2025-05-07T20:32:23.7593582Z             x0 = x0.contiguous()
2025-05-07T20:32:23.7593844Z             x1 = x1.contiguous()
2025-05-07T20:32:23.7594085Z     
2025-05-07T20:32:23.7594277Z         if scale_ub is not None:
2025-05-07T20:32:23.7594551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.7594890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.7595206Z             )
2025-05-07T20:32:23.7595407Z         else:
2025-05-07T20:32:23.7595622Z             scale_ub_tensor = None
2025-05-07T20:32:23.7595874Z     
2025-05-07T20:32:23.7596111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.7596430Z             op = silu_mul_quant
2025-05-07T20:32:23.7596679Z             if compiled:
2025-05-07T20:32:23.7596932Z                 op = torch.compile(op)
2025-05-07T20:32:23.7597236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.7597598Z     
2025-05-07T20:32:23.7597799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.7597970Z 
2025-05-07T20:32:23.7598071Z moe/activation_test.py:117: 
2025-05-07T20:32:23.7598371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.7598703Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.7598991Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.7599559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.7600117Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.7600781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.7601488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.7602066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.7602771Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.7603431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.7612498Z     kernel = self.compile(
2025-05-07T20:32:23.7613066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.7613858Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.7614264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.7614502Z 
2025-05-07T20:32:23.7614713Z self = <triton.compiler.compiler.ASTSource object at 0x7f05086c8f50>
2025-05-07T20:32:23.7615809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.7617217Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050886d260>}
2025-05-07T20:32:23.7618570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.7619679Z context = <triton._C.libtriton.ir.context object at 0x7f050867d570>
2025-05-07T20:32:23.7619977Z 
2025-05-07T20:32:23.7620146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.7620675Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.7621146Z                            module_map=module_map)
2025-05-07T20:32:23.7621521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.7621952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.7622213Z E       ^
2025-05-07T20:32:23.7622689Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.7623145Z 
2025-05-07T20:32:23.7623562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.7624072Z 
2025-05-07T20:32:23.7624188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.7624603Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.7625019Z     T=128,
2025-05-07T20:32:23.7625223Z     D=5120,
2025-05-07T20:32:23.7625420Z     scale_ub=1200.0,
2025-05-07T20:32:23.7625652Z     contiguous=False,
2025-05-07T20:32:23.7625883Z     compiled=True,
2025-05-07T20:32:23.7626088Z )
2025-05-07T20:32:24.0141519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.0142625Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:24.0143005Z 
2025-05-07T20:32:24.0143088Z     @given(
2025-05-07T20:32:24.0143332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.0143647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.0143949Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.0144280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.0144616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.0144907Z     )
2025-05-07T20:32:24.0145262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.0145706Z     def test_silu_mul_quant(
2025-05-07T20:32:24.0145956Z         self,
2025-05-07T20:32:24.0146146Z         T: int,
2025-05-07T20:32:24.0146345Z         D: int,
2025-05-07T20:32:24.0146568Z         scale_ub: Optional[float],
2025-05-07T20:32:24.0146834Z         contiguous: bool,
2025-05-07T20:32:24.0147084Z         compiled: bool,
2025-05-07T20:32:24.0147319Z     ) -> None:
2025-05-07T20:32:24.0147537Z         torch.manual_seed(2025)
2025-05-07T20:32:24.0147783Z     
2025-05-07T20:32:24.0148060Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.0148407Z     
2025-05-07T20:32:24.0148609Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.0148905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.0149211Z         x = x_sign * x_clamp
2025-05-07T20:32:24.0149557Z         x0 = x[:, :D]
2025-05-07T20:32:24.0149791Z         x1 = x[:, D:]
2025-05-07T20:32:24.0150000Z     
2025-05-07T20:32:24.0150191Z         if contiguous:
2025-05-07T20:32:24.0150430Z             x0 = x0.contiguous()
2025-05-07T20:32:24.0150686Z             x1 = x1.contiguous()
2025-05-07T20:32:24.0150932Z     
2025-05-07T20:32:24.0151123Z         if scale_ub is not None:
2025-05-07T20:32:24.0151401Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.0151739Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.0152057Z             )
2025-05-07T20:32:24.0152288Z         else:
2025-05-07T20:32:24.0152512Z             scale_ub_tensor = None
2025-05-07T20:32:24.0152771Z     
2025-05-07T20:32:24.0153007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.0153317Z             op = silu_mul_quant
2025-05-07T20:32:24.0153571Z             if compiled:
2025-05-07T20:32:24.0153826Z                 op = torch.compile(op)
2025-05-07T20:32:24.0154210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0154494Z     
2025-05-07T20:32:24.0154691Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.0154858Z 
2025-05-07T20:32:24.0154958Z moe/activation_test.py:117: 
2025-05-07T20:32:24.0155255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0155591Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.0155872Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0156507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.0157073Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.0157736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.0158425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.0158968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.0159655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.0160321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.0160849Z     kernel = self.compile(
2025-05-07T20:32:24.0161396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.0162130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0162558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0162787Z 
2025-05-07T20:32:24.0162993Z self = <triton.compiler.compiler.ASTSource object at 0x7f05086ce2d0>
2025-05-07T20:32:24.0164073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.0165473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050886e480>}
2025-05-07T20:32:24.0166810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.0167950Z context = <triton._C.libtriton.ir.context object at 0x7f050861a930>
2025-05-07T20:32:24.0168247Z 
2025-05-07T20:32:24.0168415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.0168937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0169413Z                            module_map=module_map)
2025-05-07T20:32:24.0169827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0170197Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.0170463Z E       ^
2025-05-07T20:32:24.0170931Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.0171384Z 
2025-05-07T20:32:24.0171798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.0172317Z 
2025-05-07T20:32:24.0172424Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.0172851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.0173246Z     T=16384,
2025-05-07T20:32:24.0173456Z     D=7168,
2025-05-07T20:32:24.0173661Z     scale_ub=1200.0,
2025-05-07T20:32:24.0173885Z     contiguous=True,
2025-05-07T20:32:24.0174114Z     compiled=True,
2025-05-07T20:32:24.0174334Z )
2025-05-07T20:32:24.0174649Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.0175208Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:24.0175498Z 
2025-05-07T20:32:24.0175579Z     @given(
2025-05-07T20:32:24.0175820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.0176129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.0176450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.0176789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.0177159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.0177448Z     )
2025-05-07T20:32:24.0177800Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.0178248Z     def test_silu_mul_quant(
2025-05-07T20:32:24.0178487Z         self,
2025-05-07T20:32:24.0178695Z         T: int,
2025-05-07T20:32:24.0178899Z         D: int,
2025-05-07T20:32:24.0179117Z         scale_ub: Optional[float],
2025-05-07T20:32:24.0179390Z         contiguous: bool,
2025-05-07T20:32:24.0179641Z         compiled: bool,
2025-05-07T20:32:24.0179857Z     ) -> None:
2025-05-07T20:32:24.0180081Z         torch.manual_seed(2025)
2025-05-07T20:32:24.0180321Z     
2025-05-07T20:32:24.0180598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.0180941Z     
2025-05-07T20:32:24.0181143Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.0181427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.0181791Z         x = x_sign * x_clamp
2025-05-07T20:32:24.0182031Z         x0 = x[:, :D]
2025-05-07T20:32:24.0182252Z         x1 = x[:, D:]
2025-05-07T20:32:24.0182464Z     
2025-05-07T20:32:24.0182657Z         if contiguous:
2025-05-07T20:32:24.0182887Z             x0 = x0.contiguous()
2025-05-07T20:32:24.0183147Z             x1 = x1.contiguous()
2025-05-07T20:32:24.0183387Z     
2025-05-07T20:32:24.0183589Z         if scale_ub is not None:
2025-05-07T20:32:24.0183858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.0184202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.0184515Z             )
2025-05-07T20:32:24.0184709Z         else:
2025-05-07T20:32:24.0184921Z             scale_ub_tensor = None
2025-05-07T20:32:24.0185174Z     
2025-05-07T20:32:24.0185402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.0185722Z             op = silu_mul_quant
2025-05-07T20:32:24.0185977Z             if compiled:
2025-05-07T20:32:24.0186233Z                 op = torch.compile(op)
2025-05-07T20:32:24.0186527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0186813Z     
2025-05-07T20:32:24.0187008Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.0187184Z 
2025-05-07T20:32:24.0187288Z moe/activation_test.py:117: 
2025-05-07T20:32:24.0187586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0187923Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.0188199Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0188814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.0189383Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.0190048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.0190744Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.0191291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.0191984Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.0192648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.0193183Z     kernel = self.compile(
2025-05-07T20:32:24.0193730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.0194433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0194831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0195064Z 
2025-05-07T20:32:24.0195268Z self = <triton.compiler.compiler.ASTSource object at 0x7f050864b610>
2025-05-07T20:32:24.0196348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.0197789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050886fd80>}
2025-05-07T20:32:24.0199125Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.0200144Z context = <triton._C.libtriton.ir.context object at 0x7f0508627c70>
2025-05-07T20:32:24.0200435Z 
2025-05-07T20:32:24.0200603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.0201121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0201582Z                            module_map=module_map)
2025-05-07T20:32:24.0202000Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0202352Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.0202612Z E       ^
2025-05-07T20:32:24.0203077Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.0203530Z 
2025-05-07T20:32:24.0203945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.0204453Z 
2025-05-07T20:32:24.1160257Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.1160939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.1161482Z     T=16384,
2025-05-07T20:32:24.1161682Z     D=5120,
2025-05-07T20:32:24.1161884Z     scale_ub=1200.0,
2025-05-07T20:32:24.1162109Z     contiguous=True,
2025-05-07T20:32:24.1162330Z     compiled=False,
2025-05-07T20:32:24.1162541Z )
2025-05-07T20:32:24.1162867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.1163375Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.1163663Z 
2025-05-07T20:32:24.1163742Z     @given(
2025-05-07T20:32:24.1163976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.1164419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.1164727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.1165328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.1165665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.1165946Z     )
2025-05-07T20:32:24.1166298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.1166746Z     def test_silu_mul_quant(
2025-05-07T20:32:24.1166987Z         self,
2025-05-07T20:32:24.1167186Z         T: int,
2025-05-07T20:32:24.1167392Z         D: int,
2025-05-07T20:32:24.1167703Z         scale_ub: Optional[float],
2025-05-07T20:32:24.1167983Z         contiguous: bool,
2025-05-07T20:32:24.1168228Z         compiled: bool,
2025-05-07T20:32:24.1168458Z     ) -> None:
2025-05-07T20:32:24.1168672Z         torch.manual_seed(2025)
2025-05-07T20:32:24.1168920Z     
2025-05-07T20:32:24.1169202Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.1169546Z     
2025-05-07T20:32:24.1169747Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.1170046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.1170438Z         x = x_sign * x_clamp
2025-05-07T20:32:24.1170685Z         x0 = x[:, :D]
2025-05-07T20:32:24.1170904Z         x1 = x[:, D:]
2025-05-07T20:32:24.1171112Z     
2025-05-07T20:32:24.1171301Z         if contiguous:
2025-05-07T20:32:24.1171538Z             x0 = x0.contiguous()
2025-05-07T20:32:24.1171794Z             x1 = x1.contiguous()
2025-05-07T20:32:24.1172035Z     
2025-05-07T20:32:24.1172258Z         if scale_ub is not None:
2025-05-07T20:32:24.1172618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.1172956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.1173267Z             )
2025-05-07T20:32:24.1173465Z         else:
2025-05-07T20:32:24.1173672Z             scale_ub_tensor = None
2025-05-07T20:32:24.1173927Z     
2025-05-07T20:32:24.1174162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.1174476Z             op = silu_mul_quant
2025-05-07T20:32:24.1174729Z             if compiled:
2025-05-07T20:32:24.1174987Z                 op = torch.compile(op)
2025-05-07T20:32:24.1175280Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.1175558Z     
2025-05-07T20:32:24.1175755Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.1175919Z 
2025-05-07T20:32:24.1176017Z moe/activation_test.py:117: 
2025-05-07T20:32:24.1176314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.1176649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.1177011Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.1177700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.1178388Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.1178927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.1179606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.1180269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.1180801Z     kernel = self.compile(
2025-05-07T20:32:24.1181345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.1181997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.1182400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.1182630Z 
2025-05-07T20:32:24.1182842Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508544950>
2025-05-07T20:32:24.1183916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.1185342Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050856ccc0>}
2025-05-07T20:32:24.1186686Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.1187709Z context = <triton._C.libtriton.ir.context object at 0x7f0508518f70>
2025-05-07T20:32:24.1188000Z 
2025-05-07T20:32:24.1188177Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.1188697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.1189167Z                            module_map=module_map)
2025-05-07T20:32:24.1189535Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.1189894Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.1190155Z E       ^
2025-05-07T20:32:24.1190674Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.1191128Z 
2025-05-07T20:32:24.1191547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.1192053Z 
2025-05-07T20:32:24.1192161Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.1192580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.1193036Z     T=1,
2025-05-07T20:32:24.1193227Z     D=7168,
2025-05-07T20:32:24.1193420Z     scale_ub=1200.0,
2025-05-07T20:32:24.1193650Z     contiguous=False,
2025-05-07T20:32:24.1193880Z     compiled=False,
2025-05-07T20:32:24.1194086Z )
2025-05-07T20:32:24.1194409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.1194900Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:24.1195172Z 
2025-05-07T20:32:24.1195262Z     @given(
2025-05-07T20:32:24.1195492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.1195809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.1196118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.1196447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.1196780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.1197073Z     )
2025-05-07T20:32:24.1197484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.1197923Z     def test_silu_mul_quant(
2025-05-07T20:32:24.1198171Z         self,
2025-05-07T20:32:24.1198373Z         T: int,
2025-05-07T20:32:24.1198568Z         D: int,
2025-05-07T20:32:24.1198792Z         scale_ub: Optional[float],
2025-05-07T20:32:24.1199067Z         contiguous: bool,
2025-05-07T20:32:24.1199306Z         compiled: bool,
2025-05-07T20:32:24.1199535Z     ) -> None:
2025-05-07T20:32:24.1199758Z         torch.manual_seed(2025)
2025-05-07T20:32:24.1200001Z     
2025-05-07T20:32:24.1200280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.1200631Z     
2025-05-07T20:32:24.1200822Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.1201118Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.1201437Z         x = x_sign * x_clamp
2025-05-07T20:32:24.1201676Z         x0 = x[:, :D]
2025-05-07T20:32:24.1201904Z         x1 = x[:, D:]
2025-05-07T20:32:24.1202128Z     
2025-05-07T20:32:24.1202347Z         if contiguous:
2025-05-07T20:32:24.1202604Z             x0 = x0.contiguous()
2025-05-07T20:32:24.1202868Z             x1 = x1.contiguous()
2025-05-07T20:32:24.1203118Z     
2025-05-07T20:32:24.1203312Z         if scale_ub is not None:
2025-05-07T20:32:24.1203590Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.1203929Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.1204233Z             )
2025-05-07T20:32:24.1204487Z         else:
2025-05-07T20:32:24.1204710Z             scale_ub_tensor = None
2025-05-07T20:32:24.1204960Z     
2025-05-07T20:32:24.1205196Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.1205517Z             op = silu_mul_quant
2025-05-07T20:32:24.1206081Z             if compiled:
2025-05-07T20:32:24.1206335Z                 op = torch.compile(op)
2025-05-07T20:32:24.1206635Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.1206912Z     
2025-05-07T20:32:24.1207110Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.1207274Z 
2025-05-07T20:32:24.1207385Z moe/activation_test.py:117: 
2025-05-07T20:32:24.1207731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.1208060Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.1208344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.1209038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.1209801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.1210344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.1211032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.1211699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.1212347Z     kernel = self.compile(
2025-05-07T20:32:24.1212892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.1213553Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.1213945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.1214183Z 
2025-05-07T20:32:24.1214393Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508573a10>
2025-05-07T20:32:24.1215476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.1216852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050856d080>}
2025-05-07T20:32:24.1218269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.1219292Z context = <triton._C.libtriton.ir.context object at 0x7f05085c0070>
2025-05-07T20:32:24.1219584Z 
2025-05-07T20:32:24.1219750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.1220281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.1220754Z                            module_map=module_map)
2025-05-07T20:32:24.1221117Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.1221474Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.1221735Z E       ^
2025-05-07T20:32:24.1222195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.1222659Z 
2025-05-07T20:32:24.1223079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.1223600Z 
2025-05-07T20:32:24.2563053Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2563726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2564288Z     T=4096,
2025-05-07T20:32:24.2564496Z     D=7168,
2025-05-07T20:32:24.2564697Z     scale_ub=1200.0,
2025-05-07T20:32:24.2565205Z     contiguous=False,
2025-05-07T20:32:24.2565445Z     compiled=True,
2025-05-07T20:32:24.2565665Z )
2025-05-07T20:32:24.2565993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.2566487Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:24.2566775Z 
2025-05-07T20:32:24.2566856Z     @given(
2025-05-07T20:32:24.2567096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.2567428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.2567863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.2568240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.2568617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.2568935Z     )
2025-05-07T20:32:24.2569348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.2569797Z     def test_silu_mul_quant(
2025-05-07T20:32:24.2570046Z         self,
2025-05-07T20:32:24.2570366Z         T: int,
2025-05-07T20:32:24.2570576Z         D: int,
2025-05-07T20:32:24.2570800Z         scale_ub: Optional[float],
2025-05-07T20:32:24.2579049Z         contiguous: bool,
2025-05-07T20:32:24.2579358Z         compiled: bool,
2025-05-07T20:32:24.2579596Z     ) -> None:
2025-05-07T20:32:24.2579829Z         torch.manual_seed(2025)
2025-05-07T20:32:24.2580086Z     
2025-05-07T20:32:24.2580374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.2580863Z     
2025-05-07T20:32:24.2581069Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.2581367Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.2581689Z         x = x_sign * x_clamp
2025-05-07T20:32:24.2581942Z         x0 = x[:, :D]
2025-05-07T20:32:24.2582165Z         x1 = x[:, D:]
2025-05-07T20:32:24.2582385Z     
2025-05-07T20:32:24.2582585Z         if contiguous:
2025-05-07T20:32:24.2582821Z             x0 = x0.contiguous()
2025-05-07T20:32:24.2583094Z             x1 = x1.contiguous()
2025-05-07T20:32:24.2583346Z     
2025-05-07T20:32:24.2583546Z         if scale_ub is not None:
2025-05-07T20:32:24.2583831Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.2584182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.2584503Z             )
2025-05-07T20:32:24.2584706Z         else:
2025-05-07T20:32:24.2584934Z             scale_ub_tensor = None
2025-05-07T20:32:24.2585200Z     
2025-05-07T20:32:24.2585532Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.2585862Z             op = silu_mul_quant
2025-05-07T20:32:24.2586118Z             if compiled:
2025-05-07T20:32:24.2586378Z                 op = torch.compile(op)
2025-05-07T20:32:24.2586687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.2586969Z     
2025-05-07T20:32:24.2587178Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.2587348Z 
2025-05-07T20:32:24.2587462Z moe/activation_test.py:117: 
2025-05-07T20:32:24.2587771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.2588119Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.2588404Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.2588965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.2589535Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.2590206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.2590913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.2591453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.2592171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.2592930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.2593469Z     kernel = self.compile(
2025-05-07T20:32:24.2594024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.2594691Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.2595098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.2595334Z 
2025-05-07T20:32:24.2595548Z self = <triton.compiler.compiler.ASTSource object at 0x7f05084366d0>
2025-05-07T20:32:24.2596634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.2598024Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050856f060>}
2025-05-07T20:32:24.2599405Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.2600434Z context = <triton._C.libtriton.ir.context object at 0x7f05084d6d30>
2025-05-07T20:32:24.2600717Z 
2025-05-07T20:32:24.2600884Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.2601455Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.2601928Z                            module_map=module_map)
2025-05-07T20:32:24.2602295Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.2602661Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.2602934Z E       ^
2025-05-07T20:32:24.2603411Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.2603862Z 
2025-05-07T20:32:24.2604282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.2604801Z 
2025-05-07T20:32:24.2604909Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2605329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2606064Z     T=128,
2025-05-07T20:32:24.2606259Z     D=7168,
2025-05-07T20:32:24.2606547Z     scale_ub=1200.0,
2025-05-07T20:32:24.2606787Z     contiguous=False,
2025-05-07T20:32:24.2607016Z     compiled=True,
2025-05-07T20:32:24.2607232Z )
2025-05-07T20:32:24.3329719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.3330520Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:24.3330898Z 
2025-05-07T20:32:24.3331012Z     @given(
2025-05-07T20:32:24.3331337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.3331742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.3332050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.3332390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.3332731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.3333026Z     )
2025-05-07T20:32:24.3333377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.3333831Z     def test_silu_mul_quant(
2025-05-07T20:32:24.3334090Z         self,
2025-05-07T20:32:24.3334290Z         T: int,
2025-05-07T20:32:24.3334501Z         D: int,
2025-05-07T20:32:24.3334731Z         scale_ub: Optional[float],
2025-05-07T20:32:24.3335005Z         contiguous: bool,
2025-05-07T20:32:24.3335260Z         compiled: bool,
2025-05-07T20:32:24.3335497Z     ) -> None:
2025-05-07T20:32:24.3335715Z         torch.manual_seed(2025)
2025-05-07T20:32:24.3335972Z     
2025-05-07T20:32:24.3336544Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.3336898Z     
2025-05-07T20:32:24.3337105Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.3337406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.3337719Z         x = x_sign * x_clamp
2025-05-07T20:32:24.3337968Z         x0 = x[:, :D]
2025-05-07T20:32:24.3338198Z         x1 = x[:, D:]
2025-05-07T20:32:24.3338418Z     
2025-05-07T20:32:24.3338609Z         if contiguous:
2025-05-07T20:32:24.3338856Z             x0 = x0.contiguous()
2025-05-07T20:32:24.3339124Z             x1 = x1.contiguous()
2025-05-07T20:32:24.3339368Z     
2025-05-07T20:32:24.3339576Z         if scale_ub is not None:
2025-05-07T20:32:24.3339855Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.3340193Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.3340514Z             )
2025-05-07T20:32:24.3340721Z         else:
2025-05-07T20:32:24.3340937Z             scale_ub_tensor = None
2025-05-07T20:32:24.3341197Z     
2025-05-07T20:32:24.3341523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.3341839Z             op = silu_mul_quant
2025-05-07T20:32:24.3342101Z             if compiled:
2025-05-07T20:32:24.3342363Z                 op = torch.compile(op)
2025-05-07T20:32:24.3342658Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.3342949Z     
2025-05-07T20:32:24.3343159Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.3343328Z 
2025-05-07T20:32:24.3343518Z moe/activation_test.py:117: 
2025-05-07T20:32:24.3343814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.3344155Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.3344446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.3345004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.3345570Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.3346241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.3346938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.3347474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.3348160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.3348830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.3349453Z     kernel = self.compile(
2025-05-07T20:32:24.3349992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.3350653Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.3351057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.3351285Z 
2025-05-07T20:32:24.3351503Z self = <triton.compiler.compiler.ASTSource object at 0x7f05084c3910>
2025-05-07T20:32:24.3352626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.3354016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508438360>}
2025-05-07T20:32:24.3355357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.3356380Z context = <triton._C.libtriton.ir.context object at 0x7f0508463ef0>
2025-05-07T20:32:24.3356663Z 
2025-05-07T20:32:24.3356841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.3357405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.3357875Z                            module_map=module_map)
2025-05-07T20:32:24.3358242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.3358592Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.3358858Z E       ^
2025-05-07T20:32:24.3359326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.3359781Z 
2025-05-07T20:32:24.3360203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.3360714Z 
2025-05-07T20:32:24.3360819Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.3361236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.3361641Z     T=2048,
2025-05-07T20:32:24.3361836Z     D=7168,
2025-05-07T20:32:24.3362085Z     scale_ub=None,
2025-05-07T20:32:24.3362355Z     contiguous=True,
2025-05-07T20:32:24.3362608Z     compiled=True,
2025-05-07T20:32:24.3362811Z )
2025-05-07T20:32:24.3363132Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.3363624Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.3363891Z 
2025-05-07T20:32:24.3363974Z     @given(
2025-05-07T20:32:24.3364249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.3364567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.3364880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.3365206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.3365535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.3365827Z     )
2025-05-07T20:32:24.3366174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.3366618Z     def test_silu_mul_quant(
2025-05-07T20:32:24.3366871Z         self,
2025-05-07T20:32:24.3367068Z         T: int,
2025-05-07T20:32:24.3367278Z         D: int,
2025-05-07T20:32:24.3367503Z         scale_ub: Optional[float],
2025-05-07T20:32:24.3367900Z         contiguous: bool,
2025-05-07T20:32:24.3368138Z         compiled: bool,
2025-05-07T20:32:24.3368367Z     ) -> None:
2025-05-07T20:32:24.3368589Z         torch.manual_seed(2025)
2025-05-07T20:32:24.3368834Z     
2025-05-07T20:32:24.3369167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.3369518Z     
2025-05-07T20:32:24.3369713Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.3370011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.3370332Z         x = x_sign * x_clamp
2025-05-07T20:32:24.3370573Z         x0 = x[:, :D]
2025-05-07T20:32:24.3370800Z         x1 = x[:, D:]
2025-05-07T20:32:24.3371015Z     
2025-05-07T20:32:24.3371204Z         if contiguous:
2025-05-07T20:32:24.3371445Z             x0 = x0.contiguous()
2025-05-07T20:32:24.3371712Z             x1 = x1.contiguous()
2025-05-07T20:32:24.3371947Z     
2025-05-07T20:32:24.3372154Z         if scale_ub is not None:
2025-05-07T20:32:24.3372479Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.3372812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.3373126Z             )
2025-05-07T20:32:24.3373329Z         else:
2025-05-07T20:32:24.3373551Z             scale_ub_tensor = None
2025-05-07T20:32:24.3373806Z     
2025-05-07T20:32:24.3374044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.3374361Z             op = silu_mul_quant
2025-05-07T20:32:24.3374610Z             if compiled:
2025-05-07T20:32:24.3374864Z                 op = torch.compile(op)
2025-05-07T20:32:24.3375164Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.3375439Z     
2025-05-07T20:32:24.3375641Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.3375807Z 
2025-05-07T20:32:24.3376005Z moe/activation_test.py:117: 
2025-05-07T20:32:24.3376305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.3376644Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.3376931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.3377489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.3378047Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.3378715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.3379408Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.3379938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.3380619Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.3381326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.3381864Z     kernel = self.compile(
2025-05-07T20:32:24.3382400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.3383058Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.3383459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.3383731Z 
2025-05-07T20:32:24.3383944Z self = <triton.compiler.compiler.ASTSource object at 0x7f05083bd110>
2025-05-07T20:32:24.3385013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.3386385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508438ea0>}
2025-05-07T20:32:24.3387720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.3388740Z context = <triton._C.libtriton.ir.context object at 0x7f0508381730>
2025-05-07T20:32:24.3389024Z 
2025-05-07T20:32:24.3389191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.3389758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.3390228Z                            module_map=module_map)
2025-05-07T20:32:24.3390600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.3390952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.3391217Z E       ^
2025-05-07T20:32:24.3391690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.3392188Z 
2025-05-07T20:32:24.3392604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.3393119Z 
2025-05-07T20:32:24.4172707Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.4173344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.4173972Z     T=16384,
2025-05-07T20:32:24.4174180Z     D=5120,
2025-05-07T20:32:24.4174372Z     scale_ub=None,
2025-05-07T20:32:24.4174588Z     contiguous=False,
2025-05-07T20:32:24.4174816Z     compiled=False,
2025-05-07T20:32:24.4175025Z )
2025-05-07T20:32:24.4175356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.4175862Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.4176145Z 
2025-05-07T20:32:24.4176229Z     @given(
2025-05-07T20:32:24.4176588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.4176912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.4177213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.4177552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.4177888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.4178182Z     )
2025-05-07T20:32:24.4178523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.4178964Z     def test_silu_mul_quant(
2025-05-07T20:32:24.4179213Z         self,
2025-05-07T20:32:24.4179409Z         T: int,
2025-05-07T20:32:24.4179614Z         D: int,
2025-05-07T20:32:24.4179841Z         scale_ub: Optional[float],
2025-05-07T20:32:24.4180112Z         contiguous: bool,
2025-05-07T20:32:24.4180366Z         compiled: bool,
2025-05-07T20:32:24.4180598Z     ) -> None:
2025-05-07T20:32:24.4180816Z         torch.manual_seed(2025)
2025-05-07T20:32:24.4181063Z     
2025-05-07T20:32:24.4181419Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.4181764Z     
2025-05-07T20:32:24.4181989Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.4182285Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.4184317Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.4186268Z 
2025-05-07T20:32:24.4186391Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:24.4186612Z 
2025-05-07T20:32:24.4186725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.4187148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.4187549Z     T=4096,
2025-05-07T20:32:24.4187750Z     D=7168,
2025-05-07T20:32:24.4187951Z     scale_ub=1200.0,
2025-05-07T20:32:24.4188180Z     contiguous=True,
2025-05-07T20:32:24.4188409Z     compiled=True,
2025-05-07T20:32:24.4188617Z )
2025-05-07T20:32:24.4188937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.4189511Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:24.4189781Z 
2025-05-07T20:32:24.4189871Z     @given(
2025-05-07T20:32:24.4190110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.4190419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.4190732Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.4191069Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.4191399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.4191690Z     )
2025-05-07T20:32:24.4192049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.4192545Z     def test_silu_mul_quant(
2025-05-07T20:32:24.4192787Z         self,
2025-05-07T20:32:24.4192989Z         T: int,
2025-05-07T20:32:24.4193194Z         D: int,
2025-05-07T20:32:24.4193410Z         scale_ub: Optional[float],
2025-05-07T20:32:24.4193692Z         contiguous: bool,
2025-05-07T20:32:24.4193938Z         compiled: bool,
2025-05-07T20:32:24.4194160Z     ) -> None:
2025-05-07T20:32:24.4194385Z         torch.manual_seed(2025)
2025-05-07T20:32:24.4194631Z     
2025-05-07T20:32:24.4194903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.4195255Z     
2025-05-07T20:32:24.4195462Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.4195754Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.4197799Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.4199678Z 
2025-05-07T20:32:24.4199802Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:24.4200022Z 
2025-05-07T20:32:24.4200128Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.4200545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.4200950Z     T=16384,
2025-05-07T20:32:24.4201150Z     D=7168,
2025-05-07T20:32:24.4201351Z     scale_ub=None,
2025-05-07T20:32:24.4201613Z     contiguous=False,
2025-05-07T20:32:24.4201851Z     compiled=False,
2025-05-07T20:32:24.4202070Z )
2025-05-07T20:32:24.4202390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.4202890Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.4203174Z 
2025-05-07T20:32:24.4203256Z     @given(
2025-05-07T20:32:24.4203495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.4203854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.4204167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.4204499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.4204827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.4205118Z     )
2025-05-07T20:32:24.4205473Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.4206216Z     def test_silu_mul_quant(
2025-05-07T20:32:24.4206487Z         self,
2025-05-07T20:32:24.4206699Z         T: int,
2025-05-07T20:32:24.4206906Z         D: int,
2025-05-07T20:32:24.4207130Z         scale_ub: Optional[float],
2025-05-07T20:32:24.4207406Z         contiguous: bool,
2025-05-07T20:32:24.4207730Z         compiled: bool,
2025-05-07T20:32:24.4207950Z     ) -> None:
2025-05-07T20:32:24.4208173Z         torch.manual_seed(2025)
2025-05-07T20:32:24.4208421Z     
2025-05-07T20:32:24.4208692Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.4210836Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.4212700Z 
2025-05-07T20:32:24.4212821Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.4213032Z 
2025-05-07T20:32:24.4213145Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.4213565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.4213968Z     T=2048,
2025-05-07T20:32:24.4214171Z     D=7168,
2025-05-07T20:32:24.4214370Z     scale_ub=1200.0,
2025-05-07T20:32:24.4214593Z     contiguous=True,
2025-05-07T20:32:24.4214820Z     compiled=True,
2025-05-07T20:32:24.4215031Z )
2025-05-07T20:32:24.4215348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.4215841Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:24.4216111Z 
2025-05-07T20:32:24.4216198Z     @given(
2025-05-07T20:32:24.4216492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.4216817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.4217125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.4217458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.4217786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.4218078Z     )
2025-05-07T20:32:24.4218431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.4218874Z     def test_silu_mul_quant(
2025-05-07T20:32:24.4219121Z         self,
2025-05-07T20:32:24.4219324Z         T: int,
2025-05-07T20:32:24.4219521Z         D: int,
2025-05-07T20:32:24.4219747Z         scale_ub: Optional[float],
2025-05-07T20:32:24.4220026Z         contiguous: bool,
2025-05-07T20:32:24.4220267Z         compiled: bool,
2025-05-07T20:32:24.4220500Z     ) -> None:
2025-05-07T20:32:24.4220727Z         torch.manual_seed(2025)
2025-05-07T20:32:24.4220968Z     
2025-05-07T20:32:24.4221313Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.4221668Z     
2025-05-07T20:32:24.4221863Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.4222162Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.4224152Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.4226059Z 
2025-05-07T20:32:24.4226178Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:24.4226393Z 
2025-05-07T20:32:24.4226504Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.4226936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.4227334Z     T=2048,
2025-05-07T20:32:24.4227533Z     D=7168,
2025-05-07T20:32:24.4227739Z     scale_ub=None,
2025-05-07T20:32:24.4227955Z     contiguous=True,
2025-05-07T20:32:24.4228189Z     compiled=False,
2025-05-07T20:32:24.4235919Z )
2025-05-07T20:32:24.5104898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.5105575Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.5106137Z 
2025-05-07T20:32:24.5106248Z     @given(
2025-05-07T20:32:24.5106571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.5106968Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.5107280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.5107612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.5107954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.5108277Z     )
2025-05-07T20:32:24.5108635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.5109072Z     def test_silu_mul_quant(
2025-05-07T20:32:24.5109322Z         self,
2025-05-07T20:32:24.5109524Z         T: int,
2025-05-07T20:32:24.5109722Z         D: int,
2025-05-07T20:32:24.5109952Z         scale_ub: Optional[float],
2025-05-07T20:32:24.5110236Z         contiguous: bool,
2025-05-07T20:32:24.5110488Z         compiled: bool,
2025-05-07T20:32:24.5110716Z     ) -> None:
2025-05-07T20:32:24.5110940Z         torch.manual_seed(2025)
2025-05-07T20:32:24.5111191Z     
2025-05-07T20:32:24.5111464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.5111819Z     
2025-05-07T20:32:24.5112022Z >       x_sign = torch.sign(x)
2025-05-07T20:32:24.5114074Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.5115957Z 
2025-05-07T20:32:24.5116078Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:24.5116299Z 
2025-05-07T20:32:24.5116405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.5116828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.5117237Z     T=1,
2025-05-07T20:32:24.5117426Z     D=7168,
2025-05-07T20:32:24.5117631Z     scale_ub=1200.0,
2025-05-07T20:32:24.5117861Z     contiguous=True,
2025-05-07T20:32:24.5118084Z     compiled=False,
2025-05-07T20:32:24.5118303Z )
2025-05-07T20:32:24.5118733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.5119226Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.5119499Z 
2025-05-07T20:32:24.5119578Z     @given(
2025-05-07T20:32:24.5119816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.5120135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.5120504Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.5120845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.5121181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.5121468Z     )
2025-05-07T20:32:24.5121828Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.5122278Z     def test_silu_mul_quant(
2025-05-07T20:32:24.5122526Z         self,
2025-05-07T20:32:24.5122732Z         T: int,
2025-05-07T20:32:24.5122938Z         D: int,
2025-05-07T20:32:24.5123163Z         scale_ub: Optional[float],
2025-05-07T20:32:24.5123443Z         contiguous: bool,
2025-05-07T20:32:24.5123695Z         compiled: bool,
2025-05-07T20:32:24.5123919Z     ) -> None:
2025-05-07T20:32:24.5124143Z         torch.manual_seed(2025)
2025-05-07T20:32:24.5124396Z     
2025-05-07T20:32:24.5124667Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.5125021Z     
2025-05-07T20:32:24.5125225Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.5125587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.5125903Z         x = x_sign * x_clamp
2025-05-07T20:32:24.5126150Z         x0 = x[:, :D]
2025-05-07T20:32:24.5126374Z         x1 = x[:, D:]
2025-05-07T20:32:24.5126583Z     
2025-05-07T20:32:24.5126775Z         if contiguous:
2025-05-07T20:32:24.5127024Z             x0 = x0.contiguous()
2025-05-07T20:32:24.5127282Z             x1 = x1.contiguous()
2025-05-07T20:32:24.5127639Z     
2025-05-07T20:32:24.5127855Z         if scale_ub is not None:
2025-05-07T20:32:24.5128124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.5128463Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.5128776Z             )
2025-05-07T20:32:24.5128969Z         else:
2025-05-07T20:32:24.5129188Z             scale_ub_tensor = None
2025-05-07T20:32:24.5129446Z     
2025-05-07T20:32:24.5129675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.5130005Z             op = silu_mul_quant
2025-05-07T20:32:24.5130261Z             if compiled:
2025-05-07T20:32:24.5130517Z                 op = torch.compile(op)
2025-05-07T20:32:24.5130816Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5131101Z     
2025-05-07T20:32:24.5131306Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.5131472Z 
2025-05-07T20:32:24.5131576Z moe/activation_test.py:117: 
2025-05-07T20:32:24.5131876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5132266Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.5132550Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5133256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.5133959Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.5134507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.5135201Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.5135873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.5136412Z     kernel = self.compile(
2025-05-07T20:32:24.5136956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.5137623Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.5138068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5138301Z 
2025-05-07T20:32:24.5138515Z self = <triton.compiler.compiler.ASTSource object at 0x7f05083d3150>
2025-05-07T20:32:24.5139600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.5141028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050831c680>}
2025-05-07T20:32:24.5142383Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.5143559Z context = <triton._C.libtriton.ir.context object at 0x7f05083cf6b0>
2025-05-07T20:32:24.5143851Z 
2025-05-07T20:32:24.5144025Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.5144547Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.5145019Z                            module_map=module_map)
2025-05-07T20:32:24.5145388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.5145801Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.5146069Z E       ^
2025-05-07T20:32:24.5146537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.5146987Z 
2025-05-07T20:32:24.5147408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.5147917Z 
2025-05-07T20:32:24.5148024Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.5148445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.5148851Z     T=128,
2025-05-07T20:32:24.5149042Z     D=5120,
2025-05-07T20:32:24.5149242Z     scale_ub=None,
2025-05-07T20:32:24.5149453Z     contiguous=True,
2025-05-07T20:32:24.5149676Z     compiled=False,
2025-05-07T20:32:24.5149885Z )
2025-05-07T20:32:24.7384230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.7384761Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.7385063Z 
2025-05-07T20:32:24.7385173Z     @given(
2025-05-07T20:32:24.7385519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.7385849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.7386164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.7386507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.7386945Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.7387369Z     )
2025-05-07T20:32:24.7387731Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.7388174Z     def test_silu_mul_quant(
2025-05-07T20:32:24.7388425Z         self,
2025-05-07T20:32:24.7388630Z         T: int,
2025-05-07T20:32:24.7388824Z         D: int,
2025-05-07T20:32:24.7389050Z         scale_ub: Optional[float],
2025-05-07T20:32:24.7389325Z         contiguous: bool,
2025-05-07T20:32:24.7389573Z         compiled: bool,
2025-05-07T20:32:24.7389803Z     ) -> None:
2025-05-07T20:32:24.7390024Z         torch.manual_seed(2025)
2025-05-07T20:32:24.7390268Z     
2025-05-07T20:32:24.7390542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.7390893Z     
2025-05-07T20:32:24.7391093Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.7391385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.7391700Z         x = x_sign * x_clamp
2025-05-07T20:32:24.7391951Z         x0 = x[:, :D]
2025-05-07T20:32:24.7392236Z         x1 = x[:, D:]
2025-05-07T20:32:24.7392451Z     
2025-05-07T20:32:24.7392674Z         if contiguous:
2025-05-07T20:32:24.7392914Z             x0 = x0.contiguous()
2025-05-07T20:32:24.7393180Z             x1 = x1.contiguous()
2025-05-07T20:32:24.7393420Z     
2025-05-07T20:32:24.7393621Z         if scale_ub is not None:
2025-05-07T20:32:24.7393899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.7394296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.7394615Z             )
2025-05-07T20:32:24.7394813Z         else:
2025-05-07T20:32:24.7395024Z             scale_ub_tensor = None
2025-05-07T20:32:24.7395281Z     
2025-05-07T20:32:24.7395517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.7395831Z             op = silu_mul_quant
2025-05-07T20:32:24.7396088Z             if compiled:
2025-05-07T20:32:24.7396342Z                 op = torch.compile(op)
2025-05-07T20:32:24.7396642Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.7396925Z     
2025-05-07T20:32:24.7397122Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.7397286Z 
2025-05-07T20:32:24.7397390Z moe/activation_test.py:117: 
2025-05-07T20:32:24.7397683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.7398020Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.7398307Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.7399076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.7399780Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.7400327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.7401011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.7401687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.7402224Z     kernel = self.compile(
2025-05-07T20:32:24.7402772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.7403429Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.7403830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.7404070Z 
2025-05-07T20:32:24.7404280Z self = <triton.compiler.compiler.ASTSource object at 0x7f050811fad0>
2025-05-07T20:32:24.7405366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.7406996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050831d8a0>}
2025-05-07T20:32:24.7408397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.7409421Z context = <triton._C.libtriton.ir.context object at 0x7f0508172570>
2025-05-07T20:32:24.7409702Z 
2025-05-07T20:32:24.7409869Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.7410402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.7410859Z                            module_map=module_map)
2025-05-07T20:32:24.7411228Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.7411581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.7411842Z E       ^
2025-05-07T20:32:24.7412429Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.7412891Z 
2025-05-07T20:32:24.7413310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.7413824Z 
2025-05-07T20:32:24.7413937Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.7414340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.7414752Z     T=128,
2025-05-07T20:32:24.7415002Z     D=7168,
2025-05-07T20:32:24.7415198Z     scale_ub=None,
2025-05-07T20:32:24.7415410Z     contiguous=True,
2025-05-07T20:32:24.7415639Z     compiled=False,
2025-05-07T20:32:24.7415840Z )
2025-05-07T20:32:24.7416167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.7416654Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.7416924Z 
2025-05-07T20:32:24.7417004Z     @given(
2025-05-07T20:32:24.7417236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.7417553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.7417859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.7418191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.7418515Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.7418804Z     )
2025-05-07T20:32:24.7419143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.7419661Z     def test_silu_mul_quant(
2025-05-07T20:32:24.7419902Z         self,
2025-05-07T20:32:24.7420096Z         T: int,
2025-05-07T20:32:24.7420294Z         D: int,
2025-05-07T20:32:24.7420517Z         scale_ub: Optional[float],
2025-05-07T20:32:24.7420784Z         contiguous: bool,
2025-05-07T20:32:24.7421032Z         compiled: bool,
2025-05-07T20:32:24.7421253Z     ) -> None:
2025-05-07T20:32:24.7421471Z         torch.manual_seed(2025)
2025-05-07T20:32:24.7421710Z     
2025-05-07T20:32:24.7421994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.7422336Z     
2025-05-07T20:32:24.7422530Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.7422819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.7423139Z         x = x_sign * x_clamp
2025-05-07T20:32:24.7423371Z         x0 = x[:, :D]
2025-05-07T20:32:24.7423595Z         x1 = x[:, D:]
2025-05-07T20:32:24.7423803Z     
2025-05-07T20:32:24.7423991Z         if contiguous:
2025-05-07T20:32:24.7424226Z             x0 = x0.contiguous()
2025-05-07T20:32:24.7424497Z             x1 = x1.contiguous()
2025-05-07T20:32:24.7424732Z     
2025-05-07T20:32:24.7424938Z         if scale_ub is not None:
2025-05-07T20:32:24.7425210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.7425549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.7425860Z             )
2025-05-07T20:32:24.7426062Z         else:
2025-05-07T20:32:24.7426322Z             scale_ub_tensor = None
2025-05-07T20:32:24.7426585Z     
2025-05-07T20:32:24.7426819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.7427145Z             op = silu_mul_quant
2025-05-07T20:32:24.7427392Z             if compiled:
2025-05-07T20:32:24.7427657Z                 op = torch.compile(op)
2025-05-07T20:32:24.7427952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.7428233Z     
2025-05-07T20:32:24.7428427Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.7428604Z 
2025-05-07T20:32:24.7428710Z moe/activation_test.py:117: 
2025-05-07T20:32:24.7429002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.7429335Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.7429626Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.7430311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.7431017Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.7431602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.7432296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.7432955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.7433498Z     kernel = self.compile(
2025-05-07T20:32:24.7434090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.7434755Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.7435147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.7435385Z 
2025-05-07T20:32:24.7435593Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508178e90>
2025-05-07T20:32:24.7436680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.7438045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050831e7a0>}
2025-05-07T20:32:24.7439391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.7440461Z context = <triton._C.libtriton.ir.context object at 0x7f05081ed470>
2025-05-07T20:32:24.7440758Z 
2025-05-07T20:32:24.7440930Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.7441450Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.7441928Z                            module_map=module_map)
2025-05-07T20:32:24.7442304Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.7442711Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.7442966Z E       ^
2025-05-07T20:32:24.7443440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.7443896Z 
2025-05-07T20:32:24.7444311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.7444831Z 
2025-05-07T20:32:24.7444946Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.7445356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.7445765Z     T=2048,
2025-05-07T20:32:24.7445956Z     D=7168,
2025-05-07T20:32:24.7446155Z     scale_ub=1200.0,
2025-05-07T20:32:24.7446380Z     contiguous=True,
2025-05-07T20:32:24.7446610Z     compiled=False,
2025-05-07T20:32:24.7446864Z )
2025-05-07T20:32:24.8121128Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8121736Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.8124403Z 
2025-05-07T20:32:24.8124659Z     @given(
2025-05-07T20:32:24.8124901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8125225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8125537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8125874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8126236Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8126525Z     )
2025-05-07T20:32:24.8126877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8127318Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8127611Z         self,
2025-05-07T20:32:24.8127808Z         T: int,
2025-05-07T20:32:24.8128008Z         D: int,
2025-05-07T20:32:24.8128341Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8128619Z         contiguous: bool,
2025-05-07T20:32:24.8128858Z         compiled: bool,
2025-05-07T20:32:24.8129084Z     ) -> None:
2025-05-07T20:32:24.8129304Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8129544Z     
2025-05-07T20:32:24.8129820Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8131880Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8133876Z 
2025-05-07T20:32:24.8133998Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8134209Z 
2025-05-07T20:32:24.8134319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8134732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8135138Z     T=1,
2025-05-07T20:32:24.8135331Z     D=5120,
2025-05-07T20:32:24.8135527Z     scale_ub=1200.0,
2025-05-07T20:32:24.8135757Z     contiguous=True,
2025-05-07T20:32:24.8136052Z     compiled=False,
2025-05-07T20:32:24.8136252Z )
2025-05-07T20:32:24.8136573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8137059Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.8137324Z 
2025-05-07T20:32:24.8137411Z     @given(
2025-05-07T20:32:24.8137640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8137954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8138265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8138590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8138915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8139202Z     )
2025-05-07T20:32:24.8139547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8139994Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8140243Z         self,
2025-05-07T20:32:24.8140449Z         T: int,
2025-05-07T20:32:24.8140652Z         D: int,
2025-05-07T20:32:24.8140874Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8141141Z         contiguous: bool,
2025-05-07T20:32:24.8141387Z         compiled: bool,
2025-05-07T20:32:24.8141614Z     ) -> None:
2025-05-07T20:32:24.8141830Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8142072Z     
2025-05-07T20:32:24.8142347Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8142691Z     
2025-05-07T20:32:24.8142958Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.8143260Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.8143573Z         x = x_sign * x_clamp
2025-05-07T20:32:24.8143814Z         x0 = x[:, :D]
2025-05-07T20:32:24.8144039Z         x1 = x[:, D:]
2025-05-07T20:32:24.8144247Z     
2025-05-07T20:32:24.8144433Z         if contiguous:
2025-05-07T20:32:24.8144671Z             x0 = x0.contiguous()
2025-05-07T20:32:24.8144941Z             x1 = x1.contiguous()
2025-05-07T20:32:24.8145188Z     
2025-05-07T20:32:24.8145386Z         if scale_ub is not None:
2025-05-07T20:32:24.8145659Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.8145993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.8146305Z             )
2025-05-07T20:32:24.8146503Z         else:
2025-05-07T20:32:24.8146714Z             scale_ub_tensor = None
2025-05-07T20:32:24.8146969Z     
2025-05-07T20:32:24.8147202Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.8147569Z             op = silu_mul_quant
2025-05-07T20:32:24.8147820Z             if compiled:
2025-05-07T20:32:24.8148067Z                 op = torch.compile(op)
2025-05-07T20:32:24.8148368Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8148636Z     
2025-05-07T20:32:24.8148835Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.8149000Z 
2025-05-07T20:32:24.8149107Z moe/activation_test.py:117: 
2025-05-07T20:32:24.8149400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8149780Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.8150064Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8150754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.8151441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.8151983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.8152669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.8153326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.8153861Z     kernel = self.compile(
2025-05-07T20:32:24.8154399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.8155101Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.8155491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8155721Z 
2025-05-07T20:32:24.8155924Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508290cd0>
2025-05-07T20:32:24.8157003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.8158368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050831fb00>}
2025-05-07T20:32:24.8159708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.8160729Z context = <triton._C.libtriton.ir.context object at 0x7f05082a92b0>
2025-05-07T20:32:24.8161018Z 
2025-05-07T20:32:24.8161182Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.8161703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.8162173Z                            module_map=module_map)
2025-05-07T20:32:24.8162604Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.8169966Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.8170259Z E       ^
2025-05-07T20:32:24.8170740Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.8171190Z 
2025-05-07T20:32:24.8171612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.8172133Z 
2025-05-07T20:32:24.8172245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8172719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8173127Z     T=2048,
2025-05-07T20:32:24.8173321Z     D=5120,
2025-05-07T20:32:24.8173521Z     scale_ub=None,
2025-05-07T20:32:24.8173738Z     contiguous=True,
2025-05-07T20:32:24.8173958Z     compiled=False,
2025-05-07T20:32:24.8174162Z )
2025-05-07T20:32:24.8174488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8175050Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.8175327Z 
2025-05-07T20:32:24.8175408Z     @given(
2025-05-07T20:32:24.8175641Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8175951Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8176263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8176594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8176972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8177253Z     )
2025-05-07T20:32:24.8177605Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8178045Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8178282Z         self,
2025-05-07T20:32:24.8178477Z         T: int,
2025-05-07T20:32:24.8178677Z         D: int,
2025-05-07T20:32:24.8178892Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8179166Z         contiguous: bool,
2025-05-07T20:32:24.8179416Z         compiled: bool,
2025-05-07T20:32:24.8179642Z     ) -> None:
2025-05-07T20:32:24.8179853Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8180097Z     
2025-05-07T20:32:24.8180372Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8180713Z     
2025-05-07T20:32:24.8180905Z >       x_sign = torch.sign(x)
2025-05-07T20:32:24.8182906Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8184811Z 
2025-05-07T20:32:24.8184937Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:24.8185148Z 
2025-05-07T20:32:24.8185251Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8185665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8186072Z     T=16384,
2025-05-07T20:32:24.8186271Z     D=5120,
2025-05-07T20:32:24.8186459Z     scale_ub=None,
2025-05-07T20:32:24.8186675Z     contiguous=True,
2025-05-07T20:32:24.8186900Z     compiled=False,
2025-05-07T20:32:24.8187102Z )
2025-05-07T20:32:24.8885300Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8885827Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.8886147Z 
2025-05-07T20:32:24.8886243Z     @given(
2025-05-07T20:32:24.8886496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8886960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8887409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8887824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8888157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8888457Z     )
2025-05-07T20:32:24.8888813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8889256Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8889508Z         self,
2025-05-07T20:32:24.8889711Z         T: int,
2025-05-07T20:32:24.8889919Z         D: int,
2025-05-07T20:32:24.8890143Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8890418Z         contiguous: bool,
2025-05-07T20:32:24.8890661Z         compiled: bool,
2025-05-07T20:32:24.8890894Z     ) -> None:
2025-05-07T20:32:24.8891112Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8891353Z     
2025-05-07T20:32:24.8891635Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8893769Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8895701Z 
2025-05-07T20:32:24.8895826Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8896040Z 
2025-05-07T20:32:24.8896152Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8896567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8896976Z     T=4096,
2025-05-07T20:32:24.8897179Z     D=5120,
2025-05-07T20:32:24.8897376Z     scale_ub=None,
2025-05-07T20:32:24.8897597Z     contiguous=True,
2025-05-07T20:32:24.8897837Z     compiled=False,
2025-05-07T20:32:24.8898048Z )
2025-05-07T20:32:24.8898375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8898876Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.8899145Z 
2025-05-07T20:32:24.8899232Z     @given(
2025-05-07T20:32:24.8899464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8899782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8900169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8900499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8900831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8901128Z     )
2025-05-07T20:32:24.8901478Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8901927Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8902174Z         self,
2025-05-07T20:32:24.8902383Z         T: int,
2025-05-07T20:32:24.8902592Z         D: int,
2025-05-07T20:32:24.8902826Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8903115Z         contiguous: bool,
2025-05-07T20:32:24.8903364Z         compiled: bool,
2025-05-07T20:32:24.8903596Z     ) -> None:
2025-05-07T20:32:24.8903821Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8904069Z     
2025-05-07T20:32:24.8904349Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8906640Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8908505Z 
2025-05-07T20:32:24.8908633Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8908848Z 
2025-05-07T20:32:24.8908959Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8909380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8909793Z     T=2048,
2025-05-07T20:32:24.8909988Z     D=5120,
2025-05-07T20:32:24.8910184Z     scale_ub=None,
2025-05-07T20:32:24.8910417Z     contiguous=False,
2025-05-07T20:32:24.8910651Z     compiled=False,
2025-05-07T20:32:24.8910859Z )
2025-05-07T20:32:24.8911206Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8911704Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.8911977Z 
2025-05-07T20:32:24.8912062Z     @given(
2025-05-07T20:32:24.8912303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8912628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8913005Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8913342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8913676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8913971Z     )
2025-05-07T20:32:24.8914325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8914776Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8915094Z         self,
2025-05-07T20:32:24.8915303Z         T: int,
2025-05-07T20:32:24.8915504Z         D: int,
2025-05-07T20:32:24.8915728Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8916010Z         contiguous: bool,
2025-05-07T20:32:24.8916257Z         compiled: bool,
2025-05-07T20:32:24.8916487Z     ) -> None:
2025-05-07T20:32:24.8916712Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8916954Z     
2025-05-07T20:32:24.8917233Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8919310Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8921245Z 
2025-05-07T20:32:24.8921368Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8921589Z 
2025-05-07T20:32:24.8921695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8922119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8922525Z     T=4096,
2025-05-07T20:32:24.8922718Z     D=7168,
2025-05-07T20:32:24.8922918Z     scale_ub=None,
2025-05-07T20:32:24.8923142Z     contiguous=True,
2025-05-07T20:32:24.8923371Z     compiled=True,
2025-05-07T20:32:24.8923582Z )
2025-05-07T20:32:24.8923891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8924379Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.8924645Z 
2025-05-07T20:32:24.8924723Z     @given(
2025-05-07T20:32:24.8924951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8925262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8925566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8925892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8926217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8926506Z     )
2025-05-07T20:32:24.8926854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8927289Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8927656Z         self,
2025-05-07T20:32:24.8927855Z         T: int,
2025-05-07T20:32:24.8928042Z         D: int,
2025-05-07T20:32:24.8928256Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8928520Z         contiguous: bool,
2025-05-07T20:32:24.8928755Z         compiled: bool,
2025-05-07T20:32:24.8928968Z     ) -> None:
2025-05-07T20:32:24.8929175Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8929408Z     
2025-05-07T20:32:24.8929667Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8931765Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8933628Z 
2025-05-07T20:32:24.8933748Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8933962Z 
2025-05-07T20:32:24.8934073Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8934486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8934890Z     T=2048,
2025-05-07T20:32:24.8935126Z     D=5120,
2025-05-07T20:32:24.8935323Z     scale_ub=1200.0,
2025-05-07T20:32:24.8935549Z     contiguous=False,
2025-05-07T20:32:24.8935773Z     compiled=False,
2025-05-07T20:32:24.8935980Z )
2025-05-07T20:32:24.8936285Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8936782Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:24.8937054Z 
2025-05-07T20:32:24.8937134Z     @given(
2025-05-07T20:32:24.8937364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8937675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8937981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8938302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8938634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8938914Z     )
2025-05-07T20:32:24.8939264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8939771Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8940020Z         self,
2025-05-07T20:32:24.8940209Z         T: int,
2025-05-07T20:32:24.8940404Z         D: int,
2025-05-07T20:32:24.8940620Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8940893Z         contiguous: bool,
2025-05-07T20:32:24.8941120Z         compiled: bool,
2025-05-07T20:32:24.8941343Z     ) -> None:
2025-05-07T20:32:24.8941556Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8941798Z     
2025-05-07T20:32:24.8942074Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8944118Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8945979Z 
2025-05-07T20:32:24.8946095Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8946308Z 
2025-05-07T20:32:24.8946421Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8946824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8947228Z     T=4096,
2025-05-07T20:32:24.8947459Z     D=7168,
2025-05-07T20:32:24.8947654Z     scale_ub=1200.0,
2025-05-07T20:32:24.8947882Z     contiguous=True,
2025-05-07T20:32:24.8948104Z     compiled=False,
2025-05-07T20:32:24.8948312Z )
2025-05-07T20:32:24.9864108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9865079Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.9865597Z 
2025-05-07T20:32:24.9865765Z     @given(
2025-05-07T20:32:24.9866188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9866754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9867319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9867927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9868527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9869042Z     )
2025-05-07T20:32:24.9869690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9870676Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9871118Z         self,
2025-05-07T20:32:24.9871476Z         T: int,
2025-05-07T20:32:24.9871842Z         D: int,
2025-05-07T20:32:24.9872235Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9872678Z         contiguous: bool,
2025-05-07T20:32:24.9872930Z         compiled: bool,
2025-05-07T20:32:24.9873155Z     ) -> None:
2025-05-07T20:32:24.9873374Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9873690Z     
2025-05-07T20:32:24.9873960Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9876033Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.9877902Z 
2025-05-07T20:32:24.9878026Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.9878244Z 
2025-05-07T20:32:24.9878347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9878765Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9879239Z     T=16384,
2025-05-07T20:32:24.9879439Z     D=7168,
2025-05-07T20:32:24.9879636Z     scale_ub=None,
2025-05-07T20:32:24.9879851Z     contiguous=False,
2025-05-07T20:32:24.9880077Z     compiled=True,
2025-05-07T20:32:24.9880281Z )
2025-05-07T20:32:24.9880595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9881091Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:24.9881372Z 
2025-05-07T20:32:24.9881456Z     @given(
2025-05-07T20:32:24.9881692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9881999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9882306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9882638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9882964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9883253Z     )
2025-05-07T20:32:24.9883608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9884048Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9884320Z         self,
2025-05-07T20:32:24.9884514Z         T: int,
2025-05-07T20:32:24.9884716Z         D: int,
2025-05-07T20:32:24.9884932Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9885206Z         contiguous: bool,
2025-05-07T20:32:24.9885448Z         compiled: bool,
2025-05-07T20:32:24.9885669Z     ) -> None:
2025-05-07T20:32:24.9885959Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9886205Z     
2025-05-07T20:32:24.9886479Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9888597Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.9890462Z 
2025-05-07T20:32:24.9890581Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.9890795Z 
2025-05-07T20:32:24.9890902Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9891318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9891766Z     T=4096,
2025-05-07T20:32:24.9891954Z     D=7168,
2025-05-07T20:32:24.9892146Z     scale_ub=None,
2025-05-07T20:32:24.9892360Z     contiguous=True,
2025-05-07T20:32:24.9892581Z     compiled=False,
2025-05-07T20:32:24.9892797Z )
2025-05-07T20:32:24.9893123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9893618Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.9893941Z 
2025-05-07T20:32:24.9894023Z     @given(
2025-05-07T20:32:24.9894253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9894562Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9894865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9895194Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9895528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9895811Z     )
2025-05-07T20:32:24.9896163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9896609Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9896849Z         self,
2025-05-07T20:32:24.9897046Z         T: int,
2025-05-07T20:32:24.9897245Z         D: int,
2025-05-07T20:32:24.9897455Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9897726Z         contiguous: bool,
2025-05-07T20:32:24.9897963Z         compiled: bool,
2025-05-07T20:32:24.9898234Z     ) -> None:
2025-05-07T20:32:24.9898452Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9898691Z     
2025-05-07T20:32:24.9898960Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9900998Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.9902906Z 
2025-05-07T20:32:24.9903027Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.9903243Z 
2025-05-07T20:32:24.9903351Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9903765Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9904162Z     T=16384,
2025-05-07T20:32:24.9904362Z     D=7168,
2025-05-07T20:32:24.9904559Z     scale_ub=None,
2025-05-07T20:32:24.9904766Z     contiguous=True,
2025-05-07T20:32:24.9905001Z     compiled=False,
2025-05-07T20:32:24.9905212Z )
2025-05-07T20:32:24.9905525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9906257Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.9906544Z 
2025-05-07T20:32:24.9906625Z     @given(
2025-05-07T20:32:24.9906857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9907166Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9907470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9907799Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9908126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9908424Z     )
2025-05-07T20:32:24.9908778Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9909215Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9909457Z         self,
2025-05-07T20:32:24.9909653Z         T: int,
2025-05-07T20:32:24.9909843Z         D: int,
2025-05-07T20:32:24.9910064Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9910339Z         contiguous: bool,
2025-05-07T20:32:24.9910581Z         compiled: bool,
2025-05-07T20:32:24.9910802Z     ) -> None:
2025-05-07T20:32:24.9911086Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9911329Z     
2025-05-07T20:32:24.9911595Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9913640Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.9915553Z 
2025-05-07T20:32:24.9915672Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.9915882Z 
2025-05-07T20:32:24.9915990Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9916407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9916804Z     T=16384,
2025-05-07T20:32:24.9917000Z     D=7168,
2025-05-07T20:32:24.9917199Z     scale_ub=1200.0,
2025-05-07T20:32:24.9917420Z     contiguous=True,
2025-05-07T20:32:24.9917645Z     compiled=False,
2025-05-07T20:32:24.9917850Z )
2025-05-07T20:32:24.9918165Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9918731Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.9919007Z 
2025-05-07T20:32:24.9919090Z     @given(
2025-05-07T20:32:24.9919315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9919630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9919935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9920262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9920589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9920880Z     )
2025-05-07T20:32:24.9921229Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9921666Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9921908Z         self,
2025-05-07T20:32:24.9922108Z         T: int,
2025-05-07T20:32:24.9922299Z         D: int,
2025-05-07T20:32:24.9922524Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9922794Z         contiguous: bool,
2025-05-07T20:32:24.9923036Z         compiled: bool,
2025-05-07T20:32:24.9923259Z     ) -> None:
2025-05-07T20:32:24.9923479Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9923722Z     
2025-05-07T20:32:24.9923998Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9926094Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.9928006Z 
2025-05-07T20:32:24.9928126Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.9928341Z 
2025-05-07T20:32:24.9928449Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9928860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9929260Z     T=128,
2025-05-07T20:32:24.9929450Z     D=5120,
2025-05-07T20:32:24.9929641Z     scale_ub=1200.0,
2025-05-07T20:32:24.9929869Z     contiguous=False,
2025-05-07T20:32:24.9930097Z     compiled=False,
2025-05-07T20:32:24.9930305Z )
2025-05-07T20:32:25.0940809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.0942486Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.0943043Z 
2025-05-07T20:32:25.0943171Z     @given(
2025-05-07T20:32:25.0943534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.0944038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.0944529Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.0957977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.0958719Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.0959135Z     )
2025-05-07T20:32:25.0959648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.0960306Z     def test_silu_mul_quant(
2025-05-07T20:32:25.0960661Z         self,
2025-05-07T20:32:25.0960957Z         T: int,
2025-05-07T20:32:25.0961254Z         D: int,
2025-05-07T20:32:25.0961586Z         scale_ub: Optional[float],
2025-05-07T20:32:25.0962032Z         contiguous: bool,
2025-05-07T20:32:25.0962419Z         compiled: bool,
2025-05-07T20:32:25.0962792Z     ) -> None:
2025-05-07T20:32:25.0963106Z         torch.manual_seed(2025)
2025-05-07T20:32:25.0963472Z     
2025-05-07T20:32:25.0963872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.0964445Z     
2025-05-07T20:32:25.0964751Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.0965223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.0965926Z         x = x_sign * x_clamp
2025-05-07T20:32:25.0966339Z         x0 = x[:, :D]
2025-05-07T20:32:25.0966694Z         x1 = x[:, D:]
2025-05-07T20:32:25.0967055Z     
2025-05-07T20:32:25.0967371Z         if contiguous:
2025-05-07T20:32:25.0967906Z             x0 = x0.contiguous()
2025-05-07T20:32:25.0968358Z             x1 = x1.contiguous()
2025-05-07T20:32:25.0968777Z     
2025-05-07T20:32:25.0969090Z         if scale_ub is not None:
2025-05-07T20:32:25.0969567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.0970158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.0970696Z             )
2025-05-07T20:32:25.0971012Z         else:
2025-05-07T20:32:25.0971371Z             scale_ub_tensor = None
2025-05-07T20:32:25.0971782Z     
2025-05-07T20:32:25.0972131Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.0972651Z             op = silu_mul_quant
2025-05-07T20:32:25.0973081Z             if compiled:
2025-05-07T20:32:25.0973469Z                 op = torch.compile(op)
2025-05-07T20:32:25.0973942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.0974399Z     
2025-05-07T20:32:25.0974702Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.0975000Z 
2025-05-07T20:32:25.0975171Z moe/activation_test.py:117: 
2025-05-07T20:32:25.0975680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.0976242Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.0976863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.0978112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.0979355Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.0980286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.0981526Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.0982722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.0983680Z     kernel = self.compile(
2025-05-07T20:32:25.0984581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.0985656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.0986333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.0986830Z 
2025-05-07T20:32:25.0987164Z self = <triton.compiler.compiler.ASTSource object at 0x7f050800b690>
2025-05-07T20:32:25.0988932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.0991343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f005df3e700>}
2025-05-07T20:32:25.0993885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.0995719Z context = <triton._C.libtriton.ir.context object at 0x7f05082d4d30>
2025-05-07T20:32:25.0996202Z 
2025-05-07T20:32:25.0996491Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.0997414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.0998238Z                            module_map=module_map)
2025-05-07T20:32:25.0998843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.0999440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.0999881Z E       ^
2025-05-07T20:32:25.1000774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1001543Z 
2025-05-07T20:32:25.1002262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1003175Z 
2025-05-07T20:32:25.1003345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1004048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1004758Z     T=2048,
2025-05-07T20:32:25.1005077Z     D=7168,
2025-05-07T20:32:25.1005402Z     scale_ub=None,
2025-05-07T20:32:25.1006367Z     contiguous=False,
2025-05-07T20:32:25.1006750Z     compiled=False,
2025-05-07T20:32:25.1007102Z )
2025-05-07T20:32:25.1007711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1008528Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.1009045Z 
2025-05-07T20:32:25.1009170Z     @given(
2025-05-07T20:32:25.1009526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1009979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1010438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1010990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1011569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1012056Z     )
2025-05-07T20:32:25.1012817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1013608Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1014011Z         self,
2025-05-07T20:32:25.1014332Z         T: int,
2025-05-07T20:32:25.1014657Z         D: int,
2025-05-07T20:32:25.1015008Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1015468Z         contiguous: bool,
2025-05-07T20:32:25.1015877Z         compiled: bool,
2025-05-07T20:32:25.1016238Z     ) -> None:
2025-05-07T20:32:25.1016605Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1017016Z     
2025-05-07T20:32:25.1017466Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1021313Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.1024709Z 
2025-05-07T20:32:25.1024904Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.1025268Z 
2025-05-07T20:32:25.1025433Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1026122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1026888Z     T=128,
2025-05-07T20:32:25.1027194Z     D=7168,
2025-05-07T20:32:25.1027516Z     scale_ub=1200.0,
2025-05-07T20:32:25.1027879Z     contiguous=True,
2025-05-07T20:32:25.1028259Z     compiled=True,
2025-05-07T20:32:25.1028605Z )
2025-05-07T20:32:25.1320578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1321469Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.1321937Z 
2025-05-07T20:32:25.1322077Z     @given(
2025-05-07T20:32:25.1322435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1322943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1323444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1323987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1324526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1325002Z     )
2025-05-07T20:32:25.1325783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1326528Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1326889Z         self,
2025-05-07T20:32:25.1327159Z         T: int,
2025-05-07T20:32:25.1327456Z         D: int,
2025-05-07T20:32:25.1327910Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1328356Z         contiguous: bool,
2025-05-07T20:32:25.1328759Z         compiled: bool,
2025-05-07T20:32:25.1329122Z     ) -> None:
2025-05-07T20:32:25.1329503Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1329923Z     
2025-05-07T20:32:25.1330333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1330860Z     
2025-05-07T20:32:25.1331143Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1331558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1332033Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1332428Z         x0 = x[:, :D]
2025-05-07T20:32:25.1332784Z         x1 = x[:, D:]
2025-05-07T20:32:25.1333084Z     
2025-05-07T20:32:25.1333356Z         if contiguous:
2025-05-07T20:32:25.1333707Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1334127Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1334515Z     
2025-05-07T20:32:25.1334828Z         if scale_ub is not None:
2025-05-07T20:32:25.1335211Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1335680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1336257Z             )
2025-05-07T20:32:25.1336534Z         else:
2025-05-07T20:32:25.1336831Z             scale_ub_tensor = None
2025-05-07T20:32:25.1337193Z     
2025-05-07T20:32:25.1337518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1337982Z             op = silu_mul_quant
2025-05-07T20:32:25.1338363Z             if compiled:
2025-05-07T20:32:25.1338719Z                 op = torch.compile(op)
2025-05-07T20:32:25.1339171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1339613Z     
2025-05-07T20:32:25.1339903Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.1340147Z 
2025-05-07T20:32:25.1340295Z moe/activation_test.py:117: 
2025-05-07T20:32:25.1340731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1341216Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.1341613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1342480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.1343441Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.1344427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.1345463Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.1346271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1347409Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1348408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1349221Z     kernel = self.compile(
2025-05-07T20:32:25.1350039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1351037Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1351633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1351989Z 
2025-05-07T20:32:25.1352283Z self = <triton.compiler.compiler.ASTSource object at 0x7f05080930d0>
2025-05-07T20:32:25.1353951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1356199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f005df3ff60>}
2025-05-07T20:32:25.1358300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1359835Z context = <triton._C.libtriton.ir.context object at 0x7f05080dcd30>
2025-05-07T20:32:25.1360262Z 
2025-05-07T20:32:25.1360505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1361302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1362030Z                            module_map=module_map)
2025-05-07T20:32:25.1362552Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1363052Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1363481Z E       ^
2025-05-07T20:32:25.1364155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1364869Z 
2025-05-07T20:32:25.1365524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1366338Z 
2025-05-07T20:32:25.1366492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1367180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1367915Z     T=128,
2025-05-07T20:32:25.1368208Z     D=7168,
2025-05-07T20:32:25.1368514Z     scale_ub=1200.0,
2025-05-07T20:32:25.1368857Z     contiguous=True,
2025-05-07T20:32:25.1369214Z     compiled=False,
2025-05-07T20:32:25.1369529Z )
2025-05-07T20:32:25.1369973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1370707Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.1371131Z 
2025-05-07T20:32:25.1371264Z     @given(
2025-05-07T20:32:25.1371620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1372127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1372632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1373106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1373602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1374042Z     )
2025-05-07T20:32:25.1374653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1375360Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1375707Z         self,
2025-05-07T20:32:25.1375983Z         T: int,
2025-05-07T20:32:25.1376263Z         D: int,
2025-05-07T20:32:25.1376584Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1376994Z         contiguous: bool,
2025-05-07T20:32:25.1377368Z         compiled: bool,
2025-05-07T20:32:25.1377801Z     ) -> None:
2025-05-07T20:32:25.1378163Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1378529Z     
2025-05-07T20:32:25.1378932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1379429Z     
2025-05-07T20:32:25.1379705Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1380140Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1383345Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.1386291Z 
2025-05-07T20:32:25.1386477Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.1386802Z 
2025-05-07T20:32:25.1386965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1387558Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1388144Z     T=128,
2025-05-07T20:32:25.1388412Z     D=5120,
2025-05-07T20:32:25.1388685Z     scale_ub=1200.0,
2025-05-07T20:32:25.1389001Z     contiguous=True,
2025-05-07T20:32:25.1389321Z     compiled=True,
2025-05-07T20:32:25.1389612Z )
2025-05-07T20:32:25.1390068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1390762Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.1391146Z 
2025-05-07T20:32:25.1391268Z     @given(
2025-05-07T20:32:25.1391596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1392059Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1392522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1393006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1393499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1393916Z     )
2025-05-07T20:32:25.1394415Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1395043Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1395383Z         self,
2025-05-07T20:32:25.1395650Z         T: int,
2025-05-07T20:32:25.1395997Z         D: int,
2025-05-07T20:32:25.1396307Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1396692Z         contiguous: bool,
2025-05-07T20:32:25.1397023Z         compiled: bool,
2025-05-07T20:32:25.1397338Z     ) -> None:
2025-05-07T20:32:25.1397644Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1397984Z     
2025-05-07T20:32:25.1398367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1398860Z     
2025-05-07T20:32:25.1399136Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1399545Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1402526Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.1405253Z 
2025-05-07T20:32:25.1405437Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.1406287Z 
2025-05-07T20:32:25.1406451Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1407051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1407870Z     T=128,
2025-05-07T20:32:25.1408142Z     D=7168,
2025-05-07T20:32:25.1408418Z     scale_ub=None,
2025-05-07T20:32:25.1408732Z     contiguous=True,
2025-05-07T20:32:25.1409058Z     compiled=True,
2025-05-07T20:32:25.1409349Z )
2025-05-07T20:32:25.3348348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3349410Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.3349952Z 
2025-05-07T20:32:25.3350115Z     @given(
2025-05-07T20:32:25.3350610Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3351241Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3351858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3352444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3352822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3353108Z     )
2025-05-07T20:32:25.3353457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3354067Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3354309Z         self,
2025-05-07T20:32:25.3354503Z         T: int,
2025-05-07T20:32:25.3354704Z         D: int,
2025-05-07T20:32:25.3354923Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3355191Z         contiguous: bool,
2025-05-07T20:32:25.3355433Z         compiled: bool,
2025-05-07T20:32:25.3355661Z     ) -> None:
2025-05-07T20:32:25.3355875Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3356121Z     
2025-05-07T20:32:25.3356397Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3358458Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.3360331Z 
2025-05-07T20:32:25.3360460Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.3360675Z 
2025-05-07T20:32:25.3367693Z FAILED
2025-05-07T20:32:25.3367833Z 
2025-05-07T20:32:25.3368103Z =================================== FAILURES ===================================
2025-05-07T20:32:25.3368564Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:25.3369202Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:25.3370127Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:25.3370922Z   |     yield
2025-05-07T20:32:25.3371510Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:25.3372228Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:25.3372978Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:25.3373730Z   |     if method() is not None:
2025-05-07T20:32:25.3374082Z   |        ^^^^^^^^
2025-05-07T20:32:25.3374940Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:25.3376026Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3376438Z   |            ^^^^^^^
2025-05-07T20:32:25.3377194Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:25.3378032Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:25.3378631Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:25.3379286Z   +-+---------------- 1 ----------------
2025-05-07T20:32:25.3379675Z     | Traceback (most recent call last):
2025-05-07T20:32:25.3380635Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:25.3381685Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3382192Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3384975Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.3387770Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:25.3388362Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3388907Z     |     T=2048,
2025-05-07T20:32:25.3389222Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:25.3389681Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:25.3390183Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:25.3390668Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:25.3391084Z     | )
2025-05-07T20:32:25.3391341Z     | 
2025-05-07T20:32:25.3392044Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:25.3392905Z     +---------------- 2 ----------------
2025-05-07T20:32:25.3393328Z     | Traceback (most recent call last):
2025-05-07T20:32:25.3394285Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:25.3395333Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3395841Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3398610Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.3401320Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:25.3401918Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3402486Z     |     T=128,
2025-05-07T20:32:25.3402773Z     |     D=7168,
2025-05-07T20:32:25.3403079Z     |     scale_ub=None,
2025-05-07T20:32:25.3403393Z     |     contiguous=True,
2025-05-07T20:32:25.3403725Z     |     compiled=True,
2025-05-07T20:32:25.3422148Z     | )
2025-05-07T20:32:25.3422455Z     | 
2025-05-07T20:32:25.3423371Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:25.3424221Z     +---------------- 3 ----------------
2025-05-07T20:32:25.3424629Z     | Traceback (most recent call last):
2025-05-07T20:32:25.3425631Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:25.3426831Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3427374Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3430173Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.3432944Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:25.3433547Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3434184Z     |     T=128,
2025-05-07T20:32:25.3434468Z     |     D=5120,
2025-05-07T20:32:25.3434775Z     |     scale_ub=1200.0,
2025-05-07T20:32:25.3435104Z     |     contiguous=True,
2025-05-07T20:32:25.3435445Z     |     compiled=True,
2025-05-07T20:32:25.3435764Z     | )
2025-05-07T20:32:25.3436011Z     | 
2025-05-07T20:32:25.3436742Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:25.3437585Z     +---------------- 4 ----------------
2025-05-07T20:32:25.3437986Z     | Traceback (most recent call last):
2025-05-07T20:32:25.3438963Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:25.3439956Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3440359Z     |                              ^^^^^^^^
2025-05-07T20:32:25.3441257Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:25.3442206Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3442677Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3443787Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:25.3444976Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3445768Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:25.3446774Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3447376Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3448408Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:25.3449468Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3450124Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3451049Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:25.3452222Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3452867Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3453772Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:25.3454759Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3455329Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3456144Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:25.3456930Z     |     fn()
2025-05-07T20:32:25.3457722Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:25.3458590Z     |     self.fn.run(
2025-05-07T20:32:25.3459325Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:25.3460130Z     |     kernel = self.compile(
2025-05-07T20:32:25.3460493Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:25.3461306Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:25.3462364Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3462935Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3463821Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.3464907Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3465580Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3466109Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3466588Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3466955Z     | ^
2025-05-07T20:32:25.3467598Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3468365Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:25.3468893Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:25.3469605Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3470198Z     |     T=1,  # or any other generated value
2025-05-07T20:32:25.3470601Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:25.3471060Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:25.3471618Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:25.3472121Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:25.3472552Z     | )
2025-05-07T20:32:25.3472811Z     | 
2025-05-07T20:32:25.3473520Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:25.3474346Z     +------------------------------------
2025-05-07T20:32:25.3474819Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:25.3475340Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3475915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3476476Z     T=1,
2025-05-07T20:32:25.3476735Z     D=5120,
2025-05-07T20:32:25.3476991Z     scale_ub=None,
2025-05-07T20:32:25.3477277Z     contiguous=True,
2025-05-07T20:32:25.3477584Z     compiled=True,
2025-05-07T20:32:25.3477861Z )
2025-05-07T20:32:25.3478312Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3479037Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.3479367Z 
2025-05-07T20:32:25.3479482Z     @given(
2025-05-07T20:32:25.3479788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3480210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3480626Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3481105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3481549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3481932Z     )
2025-05-07T20:32:25.3482397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3482987Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3483315Z         self,
2025-05-07T20:32:25.3483574Z         T: int,
2025-05-07T20:32:25.3483849Z         D: int,
2025-05-07T20:32:25.3484145Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3484512Z         contiguous: bool,
2025-05-07T20:32:25.3484827Z         compiled: bool,
2025-05-07T20:32:25.3485127Z     ) -> None:
2025-05-07T20:32:25.3485415Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3485734Z     
2025-05-07T20:32:25.3486098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3486561Z     
2025-05-07T20:32:25.3486814Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3487205Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3487792Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3488114Z         x0 = x[:, :D]
2025-05-07T20:32:25.3488418Z         x1 = x[:, D:]
2025-05-07T20:32:25.3488713Z     
2025-05-07T20:32:25.3488972Z         if contiguous:
2025-05-07T20:32:25.3489297Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3489657Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3489983Z     
2025-05-07T20:32:25.3490250Z         if scale_ub is not None:
2025-05-07T20:32:25.3490637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3491086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3491496Z             )
2025-05-07T20:32:25.3491763Z         else:
2025-05-07T20:32:25.3492060Z             scale_ub_tensor = None
2025-05-07T20:32:25.3492416Z     
2025-05-07T20:32:25.3492746Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3493192Z             op = silu_mul_quant
2025-05-07T20:32:25.3493543Z             if compiled:
2025-05-07T20:32:25.3493893Z                 op = torch.compile(op)
2025-05-07T20:32:25.3494298Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3494678Z     
2025-05-07T20:32:25.3494928Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3495308Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3495714Z     
2025-05-07T20:32:25.3496031Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3496499Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3496962Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3497388Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3497883Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3498322Z     
2025-05-07T20:32:25.3498594Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3498875Z 
2025-05-07T20:32:25.3499011Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3499403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3499854Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3500279Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3501314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3502308Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3503079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3503969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3504868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3506084Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3507193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3508232Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3509263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3510186Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3511041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3511765Z     fn()
2025-05-07T20:32:25.3512506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3513310Z     self.fn.run(
2025-05-07T20:32:25.3513929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3514646Z     kernel = self.compile(
2025-05-07T20:32:25.3515546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3516458Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3517001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3517337Z 
2025-05-07T20:32:25.3517617Z self = <triton.compiler.compiler.ASTSource object at 0x7f07695ecf50>
2025-05-07T20:32:25.3519125Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3521035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05c48553a0>}
2025-05-07T20:32:25.3522917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3524370Z context = <triton._C.libtriton.ir.context object at 0x7f05c625a9f0>
2025-05-07T20:32:25.3524785Z 
2025-05-07T20:32:25.3525016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3525832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3526488Z                            module_map=module_map)
2025-05-07T20:32:25.3526993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3527494Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3527968Z E       ^
2025-05-07T20:32:25.3528626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3529273Z 
2025-05-07T20:32:25.3529859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3530580Z 
2025-05-07T20:32:25.3530728Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3531291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3531858Z     T=2048,
2025-05-07T20:32:25.3532121Z     D=5120,
2025-05-07T20:32:25.3532404Z     scale_ub=1200.0,
2025-05-07T20:32:25.3532738Z     contiguous=True,
2025-05-07T20:32:25.3533040Z     compiled=False,
2025-05-07T20:32:25.3533391Z )
2025-05-07T20:32:25.3533814Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3534466Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.3534826Z 
2025-05-07T20:32:25.3534946Z     @given(
2025-05-07T20:32:25.3535244Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3535663Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3536128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3536562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3537008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3537404Z     )
2025-05-07T20:32:25.3537876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3538494Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3538839Z         self,
2025-05-07T20:32:25.3539109Z         T: int,
2025-05-07T20:32:25.3539396Z         D: int,
2025-05-07T20:32:25.3539702Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3540072Z         contiguous: bool,
2025-05-07T20:32:25.3540397Z         compiled: bool,
2025-05-07T20:32:25.3540715Z     ) -> None:
2025-05-07T20:32:25.3541018Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3541338Z     
2025-05-07T20:32:25.3541718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3542268Z     
2025-05-07T20:32:25.3542536Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3542945Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3543379Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3543714Z         x0 = x[:, :D]
2025-05-07T20:32:25.3544023Z         x1 = x[:, D:]
2025-05-07T20:32:25.3544322Z     
2025-05-07T20:32:25.3544582Z         if contiguous:
2025-05-07T20:32:25.3544911Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3545279Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3545615Z     
2025-05-07T20:32:25.3545891Z         if scale_ub is not None:
2025-05-07T20:32:25.3546277Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3546729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3547173Z             )
2025-05-07T20:32:25.3547449Z         else:
2025-05-07T20:32:25.3547743Z             scale_ub_tensor = None
2025-05-07T20:32:25.3548087Z     
2025-05-07T20:32:25.3548399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3548823Z             op = silu_mul_quant
2025-05-07T20:32:25.3549983Z             if compiled:
2025-05-07T20:32:25.3550332Z                 op = torch.compile(op)
2025-05-07T20:32:25.3550734Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3551109Z     
2025-05-07T20:32:25.3551383Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3551595Z 
2025-05-07T20:32:25.3551737Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3552202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3552706Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3553090Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3554000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3554910Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3555615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3556566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3557472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3558194Z     kernel = self.compile(
2025-05-07T20:32:25.3558929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3559885Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3560415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3560731Z 
2025-05-07T20:32:25.3561005Z self = <triton.compiler.compiler.ASTSource object at 0x7f05c4a1f7d0>
2025-05-07T20:32:25.3562471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3564362Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05c45102c0>}
2025-05-07T20:32:25.3566178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3567688Z context = <triton._C.libtriton.ir.context object at 0x7f05c4a31830>
2025-05-07T20:32:25.3568092Z 
2025-05-07T20:32:25.3568315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3568907Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3569373Z                            module_map=module_map)
2025-05-07T20:32:25.3569735Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3570158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3570424Z E       ^
2025-05-07T20:32:25.3570886Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3571340Z 
2025-05-07T20:32:25.3571755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3572286Z 
2025-05-07T20:32:25.3572406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3572844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3573244Z     T=2048,
2025-05-07T20:32:25.3573441Z     D=5120,
2025-05-07T20:32:25.3573642Z     scale_ub=1200.0,
2025-05-07T20:32:25.3573865Z     contiguous=True,
2025-05-07T20:32:25.3574091Z     compiled=True,
2025-05-07T20:32:25.3574304Z )
2025-05-07T20:32:25.3574622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3575122Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.3575391Z 
2025-05-07T20:32:25.3575480Z     @given(
2025-05-07T20:32:25.3575712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3576029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3576347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3576682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3577063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3577358Z     )
2025-05-07T20:32:25.3577602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3577709Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3577791Z         self,
2025-05-07T20:32:25.3577871Z         T: int,
2025-05-07T20:32:25.3577957Z         D: int,
2025-05-07T20:32:25.3578057Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3578154Z         contiguous: bool,
2025-05-07T20:32:25.3578253Z         compiled: bool,
2025-05-07T20:32:25.3578336Z     ) -> None:
2025-05-07T20:32:25.3578440Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3578516Z     
2025-05-07T20:32:25.3578688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3578772Z     
2025-05-07T20:32:25.3578867Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3578994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3579095Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3579221Z         x0 = x[:, :D]
2025-05-07T20:32:25.3579318Z         x1 = x[:, D:]
2025-05-07T20:32:25.3579395Z     
2025-05-07T20:32:25.3579482Z         if contiguous:
2025-05-07T20:32:25.3579582Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3579676Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3579752Z     
2025-05-07T20:32:25.3579855Z         if scale_ub is not None:
2025-05-07T20:32:25.3579966Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3580144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3580231Z             )
2025-05-07T20:32:25.3580312Z         else:
2025-05-07T20:32:25.3580416Z             scale_ub_tensor = None
2025-05-07T20:32:25.3580494Z     
2025-05-07T20:32:25.3580626Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3580725Z             op = silu_mul_quant
2025-05-07T20:32:25.3580813Z             if compiled:
2025-05-07T20:32:25.3580919Z                 op = torch.compile(op)
2025-05-07T20:32:25.3581037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3581115Z     
2025-05-07T20:32:25.3581208Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3581336Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3581411Z     
2025-05-07T20:32:25.3581547Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3581657Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3581805Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3581934Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3582079Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3582156Z     
2025-05-07T20:32:25.3582267Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3582271Z 
2025-05-07T20:32:25.3582373Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3582504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3582619Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3582755Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3583316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3583421Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3583780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3584021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3584386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3584645Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3585094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3585349Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3585728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3585894Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3586235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3586329Z     fn()
2025-05-07T20:32:25.3586731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3586827Z     self.fn.run(
2025-05-07T20:32:25.3587164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3587264Z     kernel = self.compile(
2025-05-07T20:32:25.3587689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3587866Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3588000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3588005Z 
2025-05-07T20:32:25.3588222Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bf1bf050>
2025-05-07T20:32:25.3588998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3589570Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05c4510900>}
2025-05-07T20:32:25.3590321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3590522Z context = <triton._C.libtriton.ir.context object at 0x7f05bf1c3570>
2025-05-07T20:32:25.3590527Z 
2025-05-07T20:32:25.3590694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3590959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3591123Z                            module_map=module_map)
2025-05-07T20:32:25.3591287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3591394Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3591485Z E       ^
2025-05-07T20:32:25.3591839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3591844Z 
2025-05-07T20:32:25.3592267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3592272Z 
2025-05-07T20:32:25.3592378Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3592599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3592690Z     T=16384,
2025-05-07T20:32:25.3592771Z     D=7168,
2025-05-07T20:32:25.3592859Z     scale_ub=1200.0,
2025-05-07T20:32:25.3592953Z     contiguous=False,
2025-05-07T20:32:25.3593044Z     compiled=False,
2025-05-07T20:32:25.3593133Z )
2025-05-07T20:32:25.3593352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3593534Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.3593539Z 
2025-05-07T20:32:25.3593625Z     @given(
2025-05-07T20:32:25.3593746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3593848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3594015Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3594136Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3594252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3594336Z     )
2025-05-07T20:32:25.3594579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3594682Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3594759Z         self,
2025-05-07T20:32:25.3594836Z         T: int,
2025-05-07T20:32:25.3594925Z         D: int,
2025-05-07T20:32:25.3595029Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3595119Z         contiguous: bool,
2025-05-07T20:32:25.3595210Z         compiled: bool,
2025-05-07T20:32:25.3595294Z     ) -> None:
2025-05-07T20:32:25.3595390Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3595474Z     
2025-05-07T20:32:25.3595643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3595720Z     
2025-05-07T20:32:25.3595821Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3595990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3596088Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3596172Z         x0 = x[:, :D]
2025-05-07T20:32:25.3596257Z         x1 = x[:, D:]
2025-05-07T20:32:25.3596343Z     
2025-05-07T20:32:25.3596432Z         if contiguous:
2025-05-07T20:32:25.3596525Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3596622Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3596740Z     
2025-05-07T20:32:25.3596840Z         if scale_ub is not None:
2025-05-07T20:32:25.3596957Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3597094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3597171Z             )
2025-05-07T20:32:25.3597259Z         else:
2025-05-07T20:32:25.3597356Z             scale_ub_tensor = None
2025-05-07T20:32:25.3597433Z     
2025-05-07T20:32:25.3597569Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3597664Z             op = silu_mul_quant
2025-05-07T20:32:25.3597758Z             if compiled:
2025-05-07T20:32:25.3597862Z                 op = torch.compile(op)
2025-05-07T20:32:25.3597968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3598050Z     
2025-05-07T20:32:25.3598143Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3598148Z 
2025-05-07T20:32:25.3598246Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3598386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3598537Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3598638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3599143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3599244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3599608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3599836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3600178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3600281Z     kernel = self.compile(
2025-05-07T20:32:25.3600662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3600844Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3600979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3600984Z 
2025-05-07T20:32:25.3601190Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bf259550>
2025-05-07T20:32:25.3602008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3602511Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bf51bc40>}
2025-05-07T20:32:25.3603260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3603459Z context = <triton._C.libtriton.ir.context object at 0x7f05bf221af0>
2025-05-07T20:32:25.3603464Z 
2025-05-07T20:32:25.3603631Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3603901Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3604012Z                            module_map=module_map)
2025-05-07T20:32:25.3604183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3604288Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3604409Z E       ^
2025-05-07T20:32:25.3604772Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3604777Z 
2025-05-07T20:32:25.3605188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3605193Z 
2025-05-07T20:32:25.3605311Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3605574Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3605908Z     T=1,
2025-05-07T20:32:25.3606040Z     D=7168,
2025-05-07T20:32:25.3606138Z     scale_ub=None,
2025-05-07T20:32:25.3606228Z     contiguous=True,
2025-05-07T20:32:25.3606332Z     compiled=True,
2025-05-07T20:32:25.3606408Z )
2025-05-07T20:32:25.3606627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3606809Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.3606814Z 
2025-05-07T20:32:25.3606897Z     @given(
2025-05-07T20:32:25.3607027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3607129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3607246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3607372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3607486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3607738Z     )
2025-05-07T20:32:25.3607998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3608096Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3608179Z         self,
2025-05-07T20:32:25.3608266Z         T: int,
2025-05-07T20:32:25.3608346Z         D: int,
2025-05-07T20:32:25.3608447Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3608546Z         contiguous: bool,
2025-05-07T20:32:25.3608637Z         compiled: bool,
2025-05-07T20:32:25.3608725Z     ) -> None:
2025-05-07T20:32:25.3608825Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3608902Z     
2025-05-07T20:32:25.3609078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3609155Z     
2025-05-07T20:32:25.3618627Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3618794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3618890Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3618989Z         x0 = x[:, :D]
2025-05-07T20:32:25.3619067Z         x1 = x[:, D:]
2025-05-07T20:32:25.3619142Z     
2025-05-07T20:32:25.3619229Z         if contiguous:
2025-05-07T20:32:25.3619320Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3619414Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3619492Z     
2025-05-07T20:32:25.3619586Z         if scale_ub is not None:
2025-05-07T20:32:25.3619697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3619951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3620035Z             )
2025-05-07T20:32:25.3620115Z         else:
2025-05-07T20:32:25.3620225Z             scale_ub_tensor = None
2025-05-07T20:32:25.3620303Z     
2025-05-07T20:32:25.3620446Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3620541Z             op = silu_mul_quant
2025-05-07T20:32:25.3620629Z             if compiled:
2025-05-07T20:32:25.3620740Z                 op = torch.compile(op)
2025-05-07T20:32:25.3620855Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3620932Z     
2025-05-07T20:32:25.3621034Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3621158Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3621236Z     
2025-05-07T20:32:25.3621379Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3621487Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3621594Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3621789Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3621934Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3622017Z     
2025-05-07T20:32:25.3622118Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3622123Z 
2025-05-07T20:32:25.3622225Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3622365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3622559Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3622722Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3623536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3623644Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3624014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3624244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3624615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3624883Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3625284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3625593Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3625972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3626140Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3626490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3626575Z     fn()
2025-05-07T20:32:25.3626981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3627074Z     self.fn.run(
2025-05-07T20:32:25.3627415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3627520Z     kernel = self.compile(
2025-05-07T20:32:25.3627903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3628087Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3628227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3628232Z 
2025-05-07T20:32:25.3628440Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bed43790>
2025-05-07T20:32:25.3629264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3629778Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bf2e79c0>}
2025-05-07T20:32:25.3630529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3630735Z context = <triton._C.libtriton.ir.context object at 0x7f05bee7a9b0>
2025-05-07T20:32:25.3630740Z 
2025-05-07T20:32:25.3630907Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3631177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3631288Z                            module_map=module_map)
2025-05-07T20:32:25.3631526Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3631640Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3631721Z E       ^
2025-05-07T20:32:25.3632078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3632083Z 
2025-05-07T20:32:25.3632508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3632553Z 
2025-05-07T20:32:25.3632663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3632893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3632974Z     T=4096,
2025-05-07T20:32:25.3633052Z     D=5120,
2025-05-07T20:32:25.3633145Z     scale_ub=None,
2025-05-07T20:32:25.3633234Z     contiguous=False,
2025-05-07T20:32:25.3633323Z     compiled=False,
2025-05-07T20:32:25.3633409Z )
2025-05-07T20:32:25.3633635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3633819Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.3633824Z 
2025-05-07T20:32:25.3633905Z     @given(
2025-05-07T20:32:25.3634027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3634137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3634256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3634421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3634547Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3634626Z     )
2025-05-07T20:32:25.3634873Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3634979Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3635061Z         self,
2025-05-07T20:32:25.3635154Z         T: int,
2025-05-07T20:32:25.3635235Z         D: int,
2025-05-07T20:32:25.3635339Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3635441Z         contiguous: bool,
2025-05-07T20:32:25.3635529Z         compiled: bool,
2025-05-07T20:32:25.3635611Z     ) -> None:
2025-05-07T20:32:25.3635717Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3635795Z     
2025-05-07T20:32:25.3635967Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3636054Z     
2025-05-07T20:32:25.3636148Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3636280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3636383Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3636466Z         x0 = x[:, :D]
2025-05-07T20:32:25.3636549Z         x1 = x[:, D:]
2025-05-07T20:32:25.3636636Z     
2025-05-07T20:32:25.3636723Z         if contiguous:
2025-05-07T20:32:25.3636825Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3636919Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3636995Z     
2025-05-07T20:32:25.3637094Z         if scale_ub is not None:
2025-05-07T20:32:25.3637247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3637388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3637475Z             )
2025-05-07T20:32:25.3637555Z         else:
2025-05-07T20:32:25.3637652Z             scale_ub_tensor = None
2025-05-07T20:32:25.3637736Z     
2025-05-07T20:32:25.3637867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3637958Z             op = silu_mul_quant
2025-05-07T20:32:25.3638061Z             if compiled:
2025-05-07T20:32:25.3638163Z                 op = torch.compile(op)
2025-05-07T20:32:25.3638277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3638354Z     
2025-05-07T20:32:25.3638447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3638452Z 
2025-05-07T20:32:25.3638560Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3638691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3638795Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3638946Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3639447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3639554Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3639914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3640175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3640525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3640622Z     kernel = self.compile(
2025-05-07T20:32:25.3641008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3641192Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3641326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3641331Z 
2025-05-07T20:32:25.3641544Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bef0cad0>
2025-05-07T20:32:25.3642321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3642867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bf2c40e0>}
2025-05-07T20:32:25.3643622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3643821Z context = <triton._C.libtriton.ir.context object at 0x7f05bef110b0>
2025-05-07T20:32:25.3643826Z 
2025-05-07T20:32:25.3644007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3644271Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3644381Z                            module_map=module_map)
2025-05-07T20:32:25.3644557Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3644658Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3644754Z E       ^
2025-05-07T20:32:25.3645112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3645117Z 
2025-05-07T20:32:25.3645534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3645539Z 
2025-05-07T20:32:25.3645657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3645924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3646019Z     T=4096,
2025-05-07T20:32:25.3646101Z     D=7168,
2025-05-07T20:32:25.3646187Z     scale_ub=None,
2025-05-07T20:32:25.3646287Z     contiguous=False,
2025-05-07T20:32:25.3646376Z     compiled=False,
2025-05-07T20:32:25.3646453Z )
2025-05-07T20:32:25.3646680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3646855Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.3646865Z 
2025-05-07T20:32:25.3646947Z     @given(
2025-05-07T20:32:25.3647078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3647181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3647305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3647426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3647625Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3647714Z     )
2025-05-07T20:32:25.3648008Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3648108Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3648199Z         self,
2025-05-07T20:32:25.3648281Z         T: int,
2025-05-07T20:32:25.3648361Z         D: int,
2025-05-07T20:32:25.3648471Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3648566Z         contiguous: bool,
2025-05-07T20:32:25.3648654Z         compiled: bool,
2025-05-07T20:32:25.3648787Z     ) -> None:
2025-05-07T20:32:25.3648884Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3648971Z     
2025-05-07T20:32:25.3649143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3649221Z     
2025-05-07T20:32:25.3649322Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3649449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3649543Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3649633Z         x0 = x[:, :D]
2025-05-07T20:32:25.3649721Z         x1 = x[:, D:]
2025-05-07T20:32:25.3649806Z     
2025-05-07T20:32:25.3649904Z         if contiguous:
2025-05-07T20:32:25.3649998Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3650088Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3650172Z     
2025-05-07T20:32:25.3650266Z         if scale_ub is not None:
2025-05-07T20:32:25.3650377Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3650520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3650645Z             )
2025-05-07T20:32:25.3650730Z         else:
2025-05-07T20:32:25.3650826Z             scale_ub_tensor = None
2025-05-07T20:32:25.3650900Z     
2025-05-07T20:32:25.3651039Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3651133Z             op = silu_mul_quant
2025-05-07T20:32:25.3651225Z             if compiled:
2025-05-07T20:32:25.3651336Z                 op = torch.compile(op)
2025-05-07T20:32:25.3651446Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3651525Z     
2025-05-07T20:32:25.3651629Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3651633Z 
2025-05-07T20:32:25.3651732Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3651872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3651975Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3652081Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3652592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3652697Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3653057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3653289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3653672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3653779Z     kernel = self.compile(
2025-05-07T20:32:25.3654160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3654339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3654474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3654479Z 
2025-05-07T20:32:25.3654689Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bef4b2d0>
2025-05-07T20:32:25.3655474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3655978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bf2c4ea0>}
2025-05-07T20:32:25.3656761Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3656961Z context = <triton._C.libtriton.ir.context object at 0x7f05bef378b0>
2025-05-07T20:32:25.3656965Z 
2025-05-07T20:32:25.3657133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3657442Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3657553Z                            module_map=module_map)
2025-05-07T20:32:25.3657717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3657827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3657909Z E       ^
2025-05-07T20:32:25.3658268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3658281Z 
2025-05-07T20:32:25.3658700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3658705Z 
2025-05-07T20:32:25.3658813Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3659043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3659126Z     T=128,
2025-05-07T20:32:25.3659205Z     D=7168,
2025-05-07T20:32:25.3659345Z     scale_ub=None,
2025-05-07T20:32:25.3659437Z     contiguous=False,
2025-05-07T20:32:25.3659522Z     compiled=True,
2025-05-07T20:32:25.3659608Z )
2025-05-07T20:32:25.3659827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3660010Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.3660015Z 
2025-05-07T20:32:25.3660096Z     @given(
2025-05-07T20:32:25.3660220Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3660329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3660449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3660576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3660690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3660768Z     )
2025-05-07T20:32:25.3661020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3661120Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3661206Z         self,
2025-05-07T20:32:25.3661287Z         T: int,
2025-05-07T20:32:25.3661365Z         D: int,
2025-05-07T20:32:25.3661473Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3661565Z         contiguous: bool,
2025-05-07T20:32:25.3661652Z         compiled: bool,
2025-05-07T20:32:25.3661739Z     ) -> None:
2025-05-07T20:32:25.3661839Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3661915Z     
2025-05-07T20:32:25.3662165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3662248Z     
2025-05-07T20:32:25.3662361Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3662513Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3662616Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3662698Z         x0 = x[:, :D]
2025-05-07T20:32:25.3662789Z         x1 = x[:, D:]
2025-05-07T20:32:25.3662865Z     
2025-05-07T20:32:25.3662955Z         if contiguous:
2025-05-07T20:32:25.3663051Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3663146Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3663232Z     
2025-05-07T20:32:25.3663324Z         if scale_ub is not None:
2025-05-07T20:32:25.3663430Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3663570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3663649Z             )
2025-05-07T20:32:25.3663726Z         else:
2025-05-07T20:32:25.3663827Z             scale_ub_tensor = None
2025-05-07T20:32:25.3663905Z     
2025-05-07T20:32:25.3664082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3664181Z             op = silu_mul_quant
2025-05-07T20:32:25.3664268Z             if compiled:
2025-05-07T20:32:25.3664375Z                 op = torch.compile(op)
2025-05-07T20:32:25.3664482Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3664558Z     
2025-05-07T20:32:25.3664658Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3664781Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3664898Z     
2025-05-07T20:32:25.3665041Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3665143Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3665249Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3665380Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3665522Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3665599Z     
2025-05-07T20:32:25.3665710Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3665717Z 
2025-05-07T20:32:25.3665816Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3665952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3666060Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3666194Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3666759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3666905Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3667270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3667494Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3667863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3668127Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3668524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3668776Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3669156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3669330Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3669677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3669759Z     fn()
2025-05-07T20:32:25.3670156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3670247Z     self.fn.run(
2025-05-07T20:32:25.3670627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3670731Z     kernel = self.compile(
2025-05-07T20:32:25.3671110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3671285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3671421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3671431Z 
2025-05-07T20:32:25.3671638Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bed9b790>
2025-05-07T20:32:25.3672412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3672960Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05becbf060>}
2025-05-07T20:32:25.3673705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3673904Z context = <triton._C.libtriton.ir.context object at 0x7f05beddc7f0>
2025-05-07T20:32:25.3673949Z 
2025-05-07T20:32:25.3674115Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3674386Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3674496Z                            module_map=module_map)
2025-05-07T20:32:25.3674659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3674770Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3674851Z E       ^
2025-05-07T20:32:25.3675208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3675212Z 
2025-05-07T20:32:25.3675630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3675635Z 
2025-05-07T20:32:25.3675742Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3675975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3676102Z     T=128,
2025-05-07T20:32:25.3676183Z     D=7168,
2025-05-07T20:32:25.3676280Z     scale_ub=None,
2025-05-07T20:32:25.3676371Z     contiguous=False,
2025-05-07T20:32:25.3676459Z     compiled=False,
2025-05-07T20:32:25.3676543Z )
2025-05-07T20:32:25.3676760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3676936Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.3676941Z 
2025-05-07T20:32:25.3677022Z     @given(
2025-05-07T20:32:25.3677146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3677254Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3677371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3677491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3677612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3677688Z     )
2025-05-07T20:32:25.3677937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3678043Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3678124Z         self,
2025-05-07T20:32:25.3678208Z         T: int,
2025-05-07T20:32:25.3678288Z         D: int,
2025-05-07T20:32:25.3678389Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3678486Z         contiguous: bool,
2025-05-07T20:32:25.3678576Z         compiled: bool,
2025-05-07T20:32:25.3678656Z     ) -> None:
2025-05-07T20:32:25.3678800Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3678879Z     
2025-05-07T20:32:25.3679050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3679131Z     
2025-05-07T20:32:25.3679224Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3679348Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3679444Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3679527Z         x0 = x[:, :D]
2025-05-07T20:32:25.3679613Z         x1 = x[:, D:]
2025-05-07T20:32:25.3679697Z     
2025-05-07T20:32:25.3679781Z         if contiguous:
2025-05-07T20:32:25.3679879Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3679971Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3680045Z     
2025-05-07T20:32:25.3680143Z         if scale_ub is not None:
2025-05-07T20:32:25.3680250Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3680386Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3680470Z             )
2025-05-07T20:32:25.3680551Z         else:
2025-05-07T20:32:25.3680688Z             scale_ub_tensor = None
2025-05-07T20:32:25.3680772Z     
2025-05-07T20:32:25.3680902Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3680993Z             op = silu_mul_quant
2025-05-07T20:32:25.3681085Z             if compiled:
2025-05-07T20:32:25.3681186Z                 op = torch.compile(op)
2025-05-07T20:32:25.3681299Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3681418Z     
2025-05-07T20:32:25.3681513Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3681517Z 
2025-05-07T20:32:25.3681623Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3681754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3681858Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3681967Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3682467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3682574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3682932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3683155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3683501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3683641Z     kernel = self.compile(
2025-05-07T20:32:25.3684019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3684200Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3684330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3684335Z 
2025-05-07T20:32:25.3684545Z self = <triton.compiler.compiler.ASTSource object at 0x7f05bee21350>
2025-05-07T20:32:25.3685320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3685821Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be790cc0>}
2025-05-07T20:32:25.3686576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3686768Z context = <triton._C.libtriton.ir.context object at 0x7f05bee159b0>
2025-05-07T20:32:25.3686773Z 
2025-05-07T20:32:25.3686944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3687251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3687362Z                            module_map=module_map)
2025-05-07T20:32:25.3687592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3687695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3687780Z E       ^
2025-05-07T20:32:25.3688131Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3688141Z 
2025-05-07T20:32:25.3688552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3688557Z 
2025-05-07T20:32:25.3688669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3688892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3688982Z     T=4096,
2025-05-07T20:32:25.3689061Z     D=5120,
2025-05-07T20:32:25.3689146Z     scale_ub=1200.0,
2025-05-07T20:32:25.3689241Z     contiguous=True,
2025-05-07T20:32:25.3689371Z     compiled=False,
2025-05-07T20:32:25.3689451Z )
2025-05-07T20:32:25.3689677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3689853Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.3689858Z 
2025-05-07T20:32:25.3689940Z     @given(
2025-05-07T20:32:25.3690067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3690209Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3690337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3690455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3690571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3690654Z     )
2025-05-07T20:32:25.3690897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3690992Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3691080Z         self,
2025-05-07T20:32:25.3691163Z         T: int,
2025-05-07T20:32:25.3691248Z         D: int,
2025-05-07T20:32:25.3691356Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3691448Z         contiguous: bool,
2025-05-07T20:32:25.3691537Z         compiled: bool,
2025-05-07T20:32:25.3691622Z     ) -> None:
2025-05-07T20:32:25.3691718Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3691803Z     
2025-05-07T20:32:25.3691975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3692123Z     
2025-05-07T20:32:25.3692219Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3692343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3692431Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3692514Z         x0 = x[:, :D]
2025-05-07T20:32:25.3692598Z         x1 = x[:, D:]
2025-05-07T20:32:25.3692673Z     
2025-05-07T20:32:25.3692765Z         if contiguous:
2025-05-07T20:32:25.3692860Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3692956Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3693041Z     
2025-05-07T20:32:25.3693135Z         if scale_ub is not None:
2025-05-07T20:32:25.3693241Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3693384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3693464Z             )
2025-05-07T20:32:25.3693548Z         else:
2025-05-07T20:32:25.3693643Z             scale_ub_tensor = None
2025-05-07T20:32:25.3693722Z     
2025-05-07T20:32:25.3693862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3693955Z             op = silu_mul_quant
2025-05-07T20:32:25.3694043Z             if compiled:
2025-05-07T20:32:25.3694151Z                 op = torch.compile(op)
2025-05-07T20:32:25.3694258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3694334Z     
2025-05-07T20:32:25.3694432Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3694437Z 
2025-05-07T20:32:25.3694534Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3694718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3694823Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3694924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3695424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3695524Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3695880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3696112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3696451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3696556Z     kernel = self.compile(
2025-05-07T20:32:25.3696935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3697153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3697291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3697296Z 
2025-05-07T20:32:25.3697499Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509e91c90>
2025-05-07T20:32:25.3698277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3698819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be791f80>}
2025-05-07T20:32:25.3699565Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3699762Z context = <triton._C.libtriton.ir.context object at 0x7f0509e4e270>
2025-05-07T20:32:25.3699767Z 
2025-05-07T20:32:25.3699931Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3700200Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3700308Z                            module_map=module_map)
2025-05-07T20:32:25.3700511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3700616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3700698Z E       ^
2025-05-07T20:32:25.3701051Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3701061Z 
2025-05-07T20:32:25.3701473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3701478Z 
2025-05-07T20:32:25.3701589Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3701815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3701895Z     T=1,
2025-05-07T20:32:25.3701977Z     D=5120,
2025-05-07T20:32:25.3702067Z     scale_ub=None,
2025-05-07T20:32:25.3702155Z     contiguous=True,
2025-05-07T20:32:25.3702239Z     compiled=True,
2025-05-07T20:32:25.3702322Z )
2025-05-07T20:32:25.3702540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3702713Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.3702717Z 
2025-05-07T20:32:25.3702795Z     @given(
2025-05-07T20:32:25.3702915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3703020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3703137Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3703253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3703421Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3703500Z     )
2025-05-07T20:32:25.3703747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3703849Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3703929Z         self,
2025-05-07T20:32:25.3704014Z         T: int,
2025-05-07T20:32:25.3704093Z         D: int,
2025-05-07T20:32:25.3704192Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3704299Z         contiguous: bool,
2025-05-07T20:32:25.3704386Z         compiled: bool,
2025-05-07T20:32:25.3704467Z     ) -> None:
2025-05-07T20:32:25.3704570Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3704645Z     
2025-05-07T20:32:25.3704814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3704899Z     
2025-05-07T20:32:25.3704994Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3705122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3705224Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3705351Z         x0 = x[:, :D]
2025-05-07T20:32:25.3705445Z         x1 = x[:, D:]
2025-05-07T20:32:25.3705521Z     
2025-05-07T20:32:25.3705808Z         if contiguous:
2025-05-07T20:32:25.3705957Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3706087Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3706182Z     
2025-05-07T20:32:25.3706283Z         if scale_ub is not None:
2025-05-07T20:32:25.3706503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3706645Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3706729Z             )
2025-05-07T20:32:25.3706808Z         else:
2025-05-07T20:32:25.3706904Z             scale_ub_tensor = None
2025-05-07T20:32:25.3706986Z     
2025-05-07T20:32:25.3707117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3707216Z             op = silu_mul_quant
2025-05-07T20:32:25.3707304Z             if compiled:
2025-05-07T20:32:25.3707411Z                 op = torch.compile(op)
2025-05-07T20:32:25.3707526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3707602Z     
2025-05-07T20:32:25.3707697Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3707825Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3707902Z     
2025-05-07T20:32:25.3708040Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3708149Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3708326Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3708449Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3708597Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3708673Z     
2025-05-07T20:32:25.3708779Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3708784Z 
2025-05-07T20:32:25.3708885Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3709021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3709139Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3709272Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3709835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3709941Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3710300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3710536Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3710899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3711161Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3711622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3711877Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3712257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3712426Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3712772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3712857Z     fn()
2025-05-07T20:32:25.3713255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3713343Z     self.fn.run(
2025-05-07T20:32:25.3713682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3713778Z     kernel = self.compile(
2025-05-07T20:32:25.3714225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3714402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3714538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3714543Z 
2025-05-07T20:32:25.3714747Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509ebbe50>
2025-05-07T20:32:25.3715558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3716064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be792fc0>}
2025-05-07T20:32:25.3716812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3717010Z context = <triton._C.libtriton.ir.context object at 0x7f0509e144f0>
2025-05-07T20:32:25.3717015Z 
2025-05-07T20:32:25.3717183Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3717449Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3717599Z                            module_map=module_map)
2025-05-07T20:32:25.3717761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3717869Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3717950Z E       ^
2025-05-07T20:32:25.3718302Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3718307Z 
2025-05-07T20:32:25.3718728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3718733Z 
2025-05-07T20:32:25.3718839Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3719067Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3719149Z     T=2048,
2025-05-07T20:32:25.3719230Z     D=5120,
2025-05-07T20:32:25.3719321Z     scale_ub=None,
2025-05-07T20:32:25.3719407Z     contiguous=True,
2025-05-07T20:32:25.3719497Z     compiled=True,
2025-05-07T20:32:25.3719581Z )
2025-05-07T20:32:25.3719798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3719968Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.3719979Z 
2025-05-07T20:32:25.3720062Z     @given(
2025-05-07T20:32:25.3720184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3720294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3720453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3720577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3720697Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3720775Z     )
2025-05-07T20:32:25.3721020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3721123Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3721203Z         self,
2025-05-07T20:32:25.3721283Z         T: int,
2025-05-07T20:32:25.3721374Z         D: int,
2025-05-07T20:32:25.3721474Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3721572Z         contiguous: bool,
2025-05-07T20:32:25.3721659Z         compiled: bool,
2025-05-07T20:32:25.3721738Z     ) -> None:
2025-05-07T20:32:25.3721840Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3721921Z     
2025-05-07T20:32:25.3722091Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3722176Z     
2025-05-07T20:32:25.3722272Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3722445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3722543Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3722628Z         x0 = x[:, :D]
2025-05-07T20:32:25.3722712Z         x1 = x[:, D:]
2025-05-07T20:32:25.3722795Z     
2025-05-07T20:32:25.3722882Z         if contiguous:
2025-05-07T20:32:25.3722982Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3723075Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3723218Z     
2025-05-07T20:32:25.3723316Z         if scale_ub is not None:
2025-05-07T20:32:25.3723428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3723565Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3723650Z             )
2025-05-07T20:32:25.3723730Z         else:
2025-05-07T20:32:25.3723828Z             scale_ub_tensor = None
2025-05-07T20:32:25.3723915Z     
2025-05-07T20:32:25.3724044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3724137Z             op = silu_mul_quant
2025-05-07T20:32:25.3724237Z             if compiled:
2025-05-07T20:32:25.3724339Z                 op = torch.compile(op)
2025-05-07T20:32:25.3724457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3724534Z     
2025-05-07T20:32:25.3724626Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3724753Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3724830Z     
2025-05-07T20:32:25.3724966Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3725118Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3725224Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3725347Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3725494Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3725572Z     
2025-05-07T20:32:25.3725673Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3725683Z 
2025-05-07T20:32:25.3725786Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3725919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3726037Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3726172Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3726728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3726843Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3727203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3727432Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3727880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3728182Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3728587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3728839Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3729217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3729392Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3729737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3729824Z     fn()
2025-05-07T20:32:25.3730222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3730308Z     self.fn.run(
2025-05-07T20:32:25.3730659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3730794Z     kernel = self.compile(
2025-05-07T20:32:25.3731175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3731354Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3731485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3731490Z 
2025-05-07T20:32:25.3731741Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be1ef0d0>
2025-05-07T20:32:25.3732514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3733020Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be712ac0>}
2025-05-07T20:32:25.3733766Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3733957Z context = <triton._C.libtriton.ir.context object at 0x7f05be213570>
2025-05-07T20:32:25.3733962Z 
2025-05-07T20:32:25.3734133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3734439Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3734556Z                            module_map=module_map)
2025-05-07T20:32:25.3734719Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3734822Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3734907Z E       ^
2025-05-07T20:32:25.3735263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3735270Z 
2025-05-07T20:32:25.3735684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3735694Z 
2025-05-07T20:32:25.3735802Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3736024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3736113Z     T=128,
2025-05-07T20:32:25.3736192Z     D=5120,
2025-05-07T20:32:25.3736281Z     scale_ub=None,
2025-05-07T20:32:25.3736374Z     contiguous=True,
2025-05-07T20:32:25.3736459Z     compiled=True,
2025-05-07T20:32:25.3736540Z )
2025-05-07T20:32:25.3736767Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3736935Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.3736940Z 
2025-05-07T20:32:25.3737034Z     @given(
2025-05-07T20:32:25.3737155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3737301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3737427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3737546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3737661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3737744Z     )
2025-05-07T20:32:25.3737990Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3738091Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3738181Z         self,
2025-05-07T20:32:25.3738261Z         T: int,
2025-05-07T20:32:25.3738344Z         D: int,
2025-05-07T20:32:25.3738455Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3738548Z         contiguous: bool,
2025-05-07T20:32:25.3738640Z         compiled: bool,
2025-05-07T20:32:25.3738721Z     ) -> None:
2025-05-07T20:32:25.3738819Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3738907Z     
2025-05-07T20:32:25.3739081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3739199Z     
2025-05-07T20:32:25.3739301Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3739430Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3739523Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3739613Z         x0 = x[:, :D]
2025-05-07T20:32:25.3739697Z         x1 = x[:, D:]
2025-05-07T20:32:25.3739770Z     
2025-05-07T20:32:25.3739864Z         if contiguous:
2025-05-07T20:32:25.3740001Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3740096Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3740171Z     
2025-05-07T20:32:25.3740263Z         if scale_ub is not None:
2025-05-07T20:32:25.3740373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3740510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3740598Z             )
2025-05-07T20:32:25.3740678Z         else:
2025-05-07T20:32:25.3740775Z             scale_ub_tensor = None
2025-05-07T20:32:25.3740860Z     
2025-05-07T20:32:25.3740996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3741090Z             op = silu_mul_quant
2025-05-07T20:32:25.3741190Z             if compiled:
2025-05-07T20:32:25.3741296Z                 op = torch.compile(op)
2025-05-07T20:32:25.3741403Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3741491Z     
2025-05-07T20:32:25.3741587Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3741711Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3741841Z     
2025-05-07T20:32:25.3741980Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3742092Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3742198Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3757990Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3758174Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3758249Z     
2025-05-07T20:32:25.3758363Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3758369Z 
2025-05-07T20:32:25.3758471Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3758604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3758715Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3758853Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3759424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3759537Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3759901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3760127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3760567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3760828Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3761229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3761483Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3761866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3762040Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3762385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3762474Z     fn()
2025-05-07T20:32:25.3762926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3763013Z     self.fn.run(
2025-05-07T20:32:25.3763408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3763512Z     kernel = self.compile(
2025-05-07T20:32:25.3763902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3764080Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3764214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3764259Z 
2025-05-07T20:32:25.3764477Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509c08610>
2025-05-07T20:32:25.3765258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3765779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05bea95c60>}
2025-05-07T20:32:25.3766530Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3766734Z context = <triton._C.libtriton.ir.context object at 0x7f0509c28af0>
2025-05-07T20:32:25.3766779Z 
2025-05-07T20:32:25.3766949Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3767216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3767335Z                            module_map=module_map)
2025-05-07T20:32:25.3767500Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3767702Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3767791Z E       ^
2025-05-07T20:32:25.3768153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3768158Z 
2025-05-07T20:32:25.3768579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3768584Z 
2025-05-07T20:32:25.3768691Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3768915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3769016Z     T=4096,
2025-05-07T20:32:25.3769097Z     D=5120,
2025-05-07T20:32:25.3769185Z     scale_ub=None,
2025-05-07T20:32:25.3769282Z     contiguous=True,
2025-05-07T20:32:25.3769373Z     compiled=True,
2025-05-07T20:32:25.3769456Z )
2025-05-07T20:32:25.3769683Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3769856Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.3769861Z 
2025-05-07T20:32:25.3769994Z     @given(
2025-05-07T20:32:25.3770123Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3770228Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3770358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3770478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3770595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3770680Z     )
2025-05-07T20:32:25.3770932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3771034Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3771111Z         self,
2025-05-07T20:32:25.3771195Z         T: int,
2025-05-07T20:32:25.3771274Z         D: int,
2025-05-07T20:32:25.3771373Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3771472Z         contiguous: bool,
2025-05-07T20:32:25.3771559Z         compiled: bool,
2025-05-07T20:32:25.3771641Z     ) -> None:
2025-05-07T20:32:25.3771749Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3771867Z     
2025-05-07T20:32:25.3772040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3772129Z     
2025-05-07T20:32:25.3772226Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3772366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3772482Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3772582Z         x0 = x[:, :D]
2025-05-07T20:32:25.3772679Z         x1 = x[:, D:]
2025-05-07T20:32:25.3772797Z     
2025-05-07T20:32:25.3772881Z         if contiguous:
2025-05-07T20:32:25.3772984Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3773074Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3773151Z     
2025-05-07T20:32:25.3773251Z         if scale_ub is not None:
2025-05-07T20:32:25.3773359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3773496Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3773583Z             )
2025-05-07T20:32:25.3773663Z         else:
2025-05-07T20:32:25.3773762Z             scale_ub_tensor = None
2025-05-07T20:32:25.3773849Z     
2025-05-07T20:32:25.3773980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3774073Z             op = silu_mul_quant
2025-05-07T20:32:25.3774170Z             if compiled:
2025-05-07T20:32:25.3774272Z                 op = torch.compile(op)
2025-05-07T20:32:25.3774390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3774511Z     
2025-05-07T20:32:25.3774605Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3774735Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3774808Z     
2025-05-07T20:32:25.3774946Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3775057Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3775159Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3775282Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3775434Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3775511Z     
2025-05-07T20:32:25.3775621Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3775625Z 
2025-05-07T20:32:25.3775728Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3775861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3775975Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3776106Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3776670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3776779Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3777139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3777371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3777781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3778039Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3778444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3778695Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3779086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3779254Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3779597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3779688Z     fn()
2025-05-07T20:32:25.3780128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3780222Z     self.fn.run(
2025-05-07T20:32:25.3780570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3780666Z     kernel = self.compile(
2025-05-07T20:32:25.3781052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3781269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3781402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3781407Z 
2025-05-07T20:32:25.3781622Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be56f210>
2025-05-07T20:32:25.3782403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3782912Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05beb02840>}
2025-05-07T20:32:25.3783658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3783894Z context = <triton._C.libtriton.ir.context object at 0x7f05be5737f0>
2025-05-07T20:32:25.3783898Z 
2025-05-07T20:32:25.3784069Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3784335Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3784449Z                            module_map=module_map)
2025-05-07T20:32:25.3784610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3784715Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3784803Z E       ^
2025-05-07T20:32:25.3785157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3785162Z 
2025-05-07T20:32:25.3785579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3785584Z 
2025-05-07T20:32:25.3785695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3785918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3786009Z     T=16384,
2025-05-07T20:32:25.3786089Z     D=5120,
2025-05-07T20:32:25.3786172Z     scale_ub=None,
2025-05-07T20:32:25.3786268Z     contiguous=True,
2025-05-07T20:32:25.3786353Z     compiled=True,
2025-05-07T20:32:25.3786429Z )
2025-05-07T20:32:25.3786651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3786869Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.3786874Z 
2025-05-07T20:32:25.3786963Z     @given(
2025-05-07T20:32:25.3787085Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3787187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3787309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3787427Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3787552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3787637Z     )
2025-05-07T20:32:25.3787883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3787978Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3788068Z         self,
2025-05-07T20:32:25.3788146Z         T: int,
2025-05-07T20:32:25.3788231Z         D: int,
2025-05-07T20:32:25.3788334Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3788424Z         contiguous: bool,
2025-05-07T20:32:25.3788522Z         compiled: bool,
2025-05-07T20:32:25.3788646Z     ) -> None:
2025-05-07T20:32:25.3788745Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3788826Z     
2025-05-07T20:32:25.3788997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3789076Z     
2025-05-07T20:32:25.3789180Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3789310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3789464Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3789558Z         x0 = x[:, :D]
2025-05-07T20:32:25.3789642Z         x1 = x[:, D:]
2025-05-07T20:32:25.3789718Z     
2025-05-07T20:32:25.3789810Z         if contiguous:
2025-05-07T20:32:25.3789903Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3790004Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3790082Z     
2025-05-07T20:32:25.3790176Z         if scale_ub is not None:
2025-05-07T20:32:25.3790291Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3790435Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3790514Z             )
2025-05-07T20:32:25.3790602Z         else:
2025-05-07T20:32:25.3790700Z             scale_ub_tensor = None
2025-05-07T20:32:25.3790776Z     
2025-05-07T20:32:25.3790913Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3791006Z             op = silu_mul_quant
2025-05-07T20:32:25.3791094Z             if compiled:
2025-05-07T20:32:25.3791202Z                 op = torch.compile(op)
2025-05-07T20:32:25.3791354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3791433Z     
2025-05-07T20:32:25.3791527Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3791649Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3791728Z     
2025-05-07T20:32:25.3791865Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3791972Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3792080Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3792210Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3792351Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3792449Z     
2025-05-07T20:32:25.3792562Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3792567Z 
2025-05-07T20:32:25.3792697Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3792828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3792942Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3793086Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3793645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3793748Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3794114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3794377Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3794751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3795010Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3795406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3795666Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3796040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3796211Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3796554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3796636Z     fn()
2025-05-07T20:32:25.3797081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3797170Z     self.fn.run(
2025-05-07T20:32:25.3797510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3797613Z     kernel = self.compile(
2025-05-07T20:32:25.3797991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3798215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3798344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3798349Z 
2025-05-07T20:32:25.3798555Z self = <triton.compiler.compiler.ASTSource object at 0x7f050971c150>
2025-05-07T20:32:25.3799342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3799842Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509b25d00>}
2025-05-07T20:32:25.3800591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3800825Z context = <triton._C.libtriton.ir.context object at 0x7f050970c630>
2025-05-07T20:32:25.3800830Z 
2025-05-07T20:32:25.3800995Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3801267Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3801378Z                            module_map=module_map)
2025-05-07T20:32:25.3801547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3801650Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3801732Z E       ^
2025-05-07T20:32:25.3802092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3802097Z 
2025-05-07T20:32:25.3802510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3802521Z 
2025-05-07T20:32:25.3802657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3802906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3802988Z     T=1,
2025-05-07T20:32:25.3803071Z     D=5120,
2025-05-07T20:32:25.3803159Z     scale_ub=1200.0,
2025-05-07T20:32:25.3803246Z     contiguous=True,
2025-05-07T20:32:25.3803337Z     compiled=True,
2025-05-07T20:32:25.3803412Z )
2025-05-07T20:32:25.3803674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3803846Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.3803851Z 
2025-05-07T20:32:25.3803931Z     @given(
2025-05-07T20:32:25.3804058Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3804160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3804279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3804410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3804524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3804604Z     )
2025-05-07T20:32:25.3804854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3804948Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3805032Z         self,
2025-05-07T20:32:25.3805113Z         T: int,
2025-05-07T20:32:25.3805191Z         D: int,
2025-05-07T20:32:25.3805305Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3805439Z         contiguous: bool,
2025-05-07T20:32:25.3805533Z         compiled: bool,
2025-05-07T20:32:25.3805795Z     ) -> None:
2025-05-07T20:32:25.3805934Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3806050Z     
2025-05-07T20:32:25.3806254Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3806331Z     
2025-05-07T20:32:25.3806432Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3806655Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3806748Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3806842Z         x0 = x[:, :D]
2025-05-07T20:32:25.3806925Z         x1 = x[:, D:]
2025-05-07T20:32:25.3807001Z     
2025-05-07T20:32:25.3807092Z         if contiguous:
2025-05-07T20:32:25.3807188Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3807277Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3807358Z     
2025-05-07T20:32:25.3807450Z         if scale_ub is not None:
2025-05-07T20:32:25.3807618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3807754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3807833Z             )
2025-05-07T20:32:25.3807916Z         else:
2025-05-07T20:32:25.3808012Z             scale_ub_tensor = None
2025-05-07T20:32:25.3808088Z     
2025-05-07T20:32:25.3808224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3808314Z             op = silu_mul_quant
2025-05-07T20:32:25.3808475Z             if compiled:
2025-05-07T20:32:25.3808581Z                 op = torch.compile(op)
2025-05-07T20:32:25.3808688Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3808763Z     
2025-05-07T20:32:25.3808860Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3808864Z 
2025-05-07T20:32:25.3808961Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3809100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3809204Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3809309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3809680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3809773Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3810265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3810375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3810732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3810958Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3811298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3811396Z     kernel = self.compile(
2025-05-07T20:32:25.3811853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3812030Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3812163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3812167Z 
2025-05-07T20:32:25.3812371Z self = <triton.compiler.compiler.ASTSource object at 0x7f05097fd8d0>
2025-05-07T20:32:25.3813145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3813657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509a30a40>}
2025-05-07T20:32:25.3814458Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3814652Z context = <triton._C.libtriton.ir.context object at 0x7f05097e2030>
2025-05-07T20:32:25.3814657Z 
2025-05-07T20:32:25.3814820Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3815082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3815236Z                            module_map=module_map)
2025-05-07T20:32:25.3815400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3815506Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3815586Z E       ^
2025-05-07T20:32:25.3815939Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3815944Z 
2025-05-07T20:32:25.3816366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3816371Z 
2025-05-07T20:32:25.3816477Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3816703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3816784Z     T=1,
2025-05-07T20:32:25.3816868Z     D=5120,
2025-05-07T20:32:25.3816960Z     scale_ub=None,
2025-05-07T20:32:25.3817047Z     contiguous=False,
2025-05-07T20:32:25.3817133Z     compiled=True,
2025-05-07T20:32:25.3817257Z )
2025-05-07T20:32:25.3817474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3817641Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.3817645Z 
2025-05-07T20:32:25.3817731Z     @given(
2025-05-07T20:32:25.3817851Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3817955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3818073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3818193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3818311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3818390Z     )
2025-05-07T20:32:25.3818632Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3818731Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3818808Z         self,
2025-05-07T20:32:25.3818886Z         T: int,
2025-05-07T20:32:25.3818976Z         D: int,
2025-05-07T20:32:25.3819077Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3819166Z         contiguous: bool,
2025-05-07T20:32:25.3819259Z         compiled: bool,
2025-05-07T20:32:25.3819345Z     ) -> None:
2025-05-07T20:32:25.3819447Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3819523Z     
2025-05-07T20:32:25.3819693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3819780Z     
2025-05-07T20:32:25.3819878Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3820053Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3820154Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3820237Z         x0 = x[:, :D]
2025-05-07T20:32:25.3820321Z         x1 = x[:, D:]
2025-05-07T20:32:25.3820401Z     
2025-05-07T20:32:25.3820488Z         if contiguous:
2025-05-07T20:32:25.3820582Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3820680Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3820759Z     
2025-05-07T20:32:25.3820857Z         if scale_ub is not None:
2025-05-07T20:32:25.3820972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3821109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3821194Z             )
2025-05-07T20:32:25.3821277Z         else:
2025-05-07T20:32:25.3821375Z             scale_ub_tensor = None
2025-05-07T20:32:25.3821455Z     
2025-05-07T20:32:25.3821588Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3821687Z             op = silu_mul_quant
2025-05-07T20:32:25.3821843Z             if compiled:
2025-05-07T20:32:25.3821948Z                 op = torch.compile(op)
2025-05-07T20:32:25.3822057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3822137Z     
2025-05-07T20:32:25.3822231Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3822359Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3822462Z     
2025-05-07T20:32:25.3822621Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3822777Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3822879Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3823004Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3823152Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3823231Z     
2025-05-07T20:32:25.3823339Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3823344Z 
2025-05-07T20:32:25.3823449Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3823586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3823701Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3823836Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3824395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3824509Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3824909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3825132Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3825508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3825767Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3826173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3826427Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3826804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3826980Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3827326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3827413Z     fn()
2025-05-07T20:32:25.3827814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3827901Z     self.fn.run(
2025-05-07T20:32:25.3828243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3828384Z     kernel = self.compile(
2025-05-07T20:32:25.3828766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3828948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3829078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3829083Z 
2025-05-07T20:32:25.3829292Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508d31390>
2025-05-07T20:32:25.3830073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3830573Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509a12200>}
2025-05-07T20:32:25.3831365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3831557Z context = <triton._C.libtriton.ir.context object at 0x7f0508d49970>
2025-05-07T20:32:25.3831562Z 
2025-05-07T20:32:25.3831732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3832037Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3832149Z                            module_map=module_map)
2025-05-07T20:32:25.3832317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3832421Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3832508Z E       ^
2025-05-07T20:32:25.3832861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3832868Z 
2025-05-07T20:32:25.3833285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3833289Z 
2025-05-07T20:32:25.3833400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3833623Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3833708Z     T=1,
2025-05-07T20:32:25.3833788Z     D=5120,
2025-05-07T20:32:25.3833917Z     scale_ub=None,
2025-05-07T20:32:25.3834009Z     contiguous=True,
2025-05-07T20:32:25.3834096Z     compiled=False,
2025-05-07T20:32:25.3834177Z )
2025-05-07T20:32:25.3834403Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3834568Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.3834573Z 
2025-05-07T20:32:25.3834654Z     @given(
2025-05-07T20:32:25.3834787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3834892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3835018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3835137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3835250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3835333Z     )
2025-05-07T20:32:25.3835579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3835676Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3835772Z         self,
2025-05-07T20:32:25.3835852Z         T: int,
2025-05-07T20:32:25.3835931Z         D: int,
2025-05-07T20:32:25.3836038Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3836130Z         contiguous: bool,
2025-05-07T20:32:25.3836218Z         compiled: bool,
2025-05-07T20:32:25.3836306Z     ) -> None:
2025-05-07T20:32:25.3836403Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3836479Z     
2025-05-07T20:32:25.3836701Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3836779Z     
2025-05-07T20:32:25.3836881Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3837009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3837101Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3837187Z         x0 = x[:, :D]
2025-05-07T20:32:25.3837272Z         x1 = x[:, D:]
2025-05-07T20:32:25.3837346Z     
2025-05-07T20:32:25.3837439Z         if contiguous:
2025-05-07T20:32:25.3837533Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3837631Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3837712Z     
2025-05-07T20:32:25.3837805Z         if scale_ub is not None:
2025-05-07T20:32:25.3837913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3838057Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3838137Z             )
2025-05-07T20:32:25.3838444Z         else:
2025-05-07T20:32:25.3838543Z             scale_ub_tensor = None
2025-05-07T20:32:25.3838618Z     
2025-05-07T20:32:25.3838803Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3838900Z             op = silu_mul_quant
2025-05-07T20:32:25.3838988Z             if compiled:
2025-05-07T20:32:25.3839096Z                 op = torch.compile(op)
2025-05-07T20:32:25.3839204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3839279Z     
2025-05-07T20:32:25.3839382Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3839387Z 
2025-05-07T20:32:25.3839487Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3839666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3839770Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3839874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3840379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3840479Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3840842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3841069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3841408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3841512Z     kernel = self.compile(
2025-05-07T20:32:25.3841895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3842113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3842246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3842251Z 
2025-05-07T20:32:25.3842456Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508dc1950>
2025-05-07T20:32:25.3843288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3843785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509a13560>}
2025-05-07T20:32:25.3844530Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3844731Z context = <triton._C.libtriton.ir.context object at 0x7f0508d89df0>
2025-05-07T20:32:25.3844735Z 
2025-05-07T20:32:25.3844901Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3845170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3845279Z                            module_map=module_map)
2025-05-07T20:32:25.3845484Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3845592Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3845672Z E       ^
2025-05-07T20:32:25.3846027Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3846036Z 
2025-05-07T20:32:25.3846449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3846458Z 
2025-05-07T20:32:25.3846565Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3846794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3846875Z     T=128,
2025-05-07T20:32:25.3846955Z     D=5120,
2025-05-07T20:32:25.3847046Z     scale_ub=None,
2025-05-07T20:32:25.3847135Z     contiguous=False,
2025-05-07T20:32:25.3847221Z     compiled=True,
2025-05-07T20:32:25.3847304Z )
2025-05-07T20:32:25.3847612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3847795Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.3847800Z 
2025-05-07T20:32:25.3847880Z     @given(
2025-05-07T20:32:25.3848003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3848112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3848229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3848390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3848511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3848588Z     )
2025-05-07T20:32:25.3848834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3848936Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3849015Z         self,
2025-05-07T20:32:25.3849101Z         T: int,
2025-05-07T20:32:25.3849180Z         D: int,
2025-05-07T20:32:25.3849280Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3849380Z         contiguous: bool,
2025-05-07T20:32:25.3849468Z         compiled: bool,
2025-05-07T20:32:25.3849544Z     ) -> None:
2025-05-07T20:32:25.3849646Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3849721Z     
2025-05-07T20:32:25.3849892Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3849971Z     
2025-05-07T20:32:25.3850064Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3850188Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3850326Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3850409Z         x0 = x[:, :D]
2025-05-07T20:32:25.3850493Z         x1 = x[:, D:]
2025-05-07T20:32:25.3850568Z     
2025-05-07T20:32:25.3850652Z         if contiguous:
2025-05-07T20:32:25.3850750Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3850841Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3850913Z     
2025-05-07T20:32:25.3851010Z         if scale_ub is not None:
2025-05-07T20:32:25.3851120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3851256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3851340Z             )
2025-05-07T20:32:25.3851419Z         else:
2025-05-07T20:32:25.3851512Z             scale_ub_tensor = None
2025-05-07T20:32:25.3851591Z     
2025-05-07T20:32:25.3851724Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3851816Z             op = silu_mul_quant
2025-05-07T20:32:25.3851914Z             if compiled:
2025-05-07T20:32:25.3852016Z                 op = torch.compile(op)
2025-05-07T20:32:25.3852126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3852201Z     
2025-05-07T20:32:25.3852292Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3852296Z 
2025-05-07T20:32:25.3852400Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3852531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3852632Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3852805Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3853174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3853272Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3853762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3853862Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3854229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3854447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3854785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3854886Z     kernel = self.compile(
2025-05-07T20:32:25.3855304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3855483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3855610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3855615Z 
2025-05-07T20:32:25.3855820Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508d7b0d0>
2025-05-07T20:32:25.3856592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3857136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509a304a0>}
2025-05-07T20:32:25.3857882Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3858071Z context = <triton._C.libtriton.ir.context object at 0x7f0508cd5f70>
2025-05-07T20:32:25.3858075Z 
2025-05-07T20:32:25.3858244Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3858504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3858616Z                            module_map=module_map)
2025-05-07T20:32:25.3858816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3858914Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3859000Z E       ^
2025-05-07T20:32:25.3859352Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3859357Z 
2025-05-07T20:32:25.3859771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3859784Z 
2025-05-07T20:32:25.3859889Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3860109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3860191Z     T=128,
2025-05-07T20:32:25.3860269Z     D=7168,
2025-05-07T20:32:25.3860353Z     scale_ub=1200.0,
2025-05-07T20:32:25.3860446Z     contiguous=False,
2025-05-07T20:32:25.3860532Z     compiled=False,
2025-05-07T20:32:25.3860612Z )
2025-05-07T20:32:25.3860834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3861005Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.3861009Z 
2025-05-07T20:32:25.3861094Z     @given(
2025-05-07T20:32:25.3861213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3861314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3861431Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3861614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3861729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3861808Z     )
2025-05-07T20:32:25.3862052Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3862143Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3862225Z         self,
2025-05-07T20:32:25.3862303Z         T: int,
2025-05-07T20:32:25.3862381Z         D: int,
2025-05-07T20:32:25.3862489Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3862579Z         contiguous: bool,
2025-05-07T20:32:25.3862670Z         compiled: bool,
2025-05-07T20:32:25.3862749Z     ) -> None:
2025-05-07T20:32:25.3862844Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3862929Z     
2025-05-07T20:32:25.3863098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3863175Z     
2025-05-07T20:32:25.3863271Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3863439Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3863532Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3863619Z         x0 = x[:, :D]
2025-05-07T20:32:25.3863697Z         x1 = x[:, D:]
2025-05-07T20:32:25.3863769Z     
2025-05-07T20:32:25.3863858Z         if contiguous:
2025-05-07T20:32:25.3863950Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3864039Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3864117Z     
2025-05-07T20:32:25.3864249Z         if scale_ub is not None:
2025-05-07T20:32:25.3864361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3864494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3864568Z             )
2025-05-07T20:32:25.3864655Z         else:
2025-05-07T20:32:25.3864750Z             scale_ub_tensor = None
2025-05-07T20:32:25.3864824Z     
2025-05-07T20:32:25.3864961Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3865053Z             op = silu_mul_quant
2025-05-07T20:32:25.3865141Z             if compiled:
2025-05-07T20:32:25.3865254Z                 op = torch.compile(op)
2025-05-07T20:32:25.3865358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3865432Z     
2025-05-07T20:32:25.3865526Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3865531Z 
2025-05-07T20:32:25.3865628Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3865760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3865908Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3866007Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3866507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3866607Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3866965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3867196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3867536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3867635Z     kernel = self.compile(
2025-05-07T20:32:25.3868015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3868187Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3868325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3868329Z 
2025-05-07T20:32:25.3868530Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508c415d0>
2025-05-07T20:32:25.3869305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3869845Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05091fca40>}
2025-05-07T20:32:25.3870593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3870784Z context = <triton._C.libtriton.ir.context object at 0x7f0508ce9bf0>
2025-05-07T20:32:25.3870792Z 
2025-05-07T20:32:25.3870956Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3871222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3871329Z                            module_map=module_map)
2025-05-07T20:32:25.3871489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3871598Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3871717Z E       ^
2025-05-07T20:32:25.3872074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3872079Z 
2025-05-07T20:32:25.3872489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3872494Z 
2025-05-07T20:32:25.3872597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3872862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3872941Z     T=128,
2025-05-07T20:32:25.3873021Z     D=5120,
2025-05-07T20:32:25.3873110Z     scale_ub=None,
2025-05-07T20:32:25.3873197Z     contiguous=False,
2025-05-07T20:32:25.3873285Z     compiled=False,
2025-05-07T20:32:25.3873361Z )
2025-05-07T20:32:25.3873576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3873754Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.3873761Z 
2025-05-07T20:32:25.3873842Z     @given(
2025-05-07T20:32:25.3873963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3874072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3874188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3874303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3874422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3874542Z     )
2025-05-07T20:32:25.3874788Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3874883Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3874963Z         self,
2025-05-07T20:32:25.3875046Z         T: int,
2025-05-07T20:32:25.3875123Z         D: int,
2025-05-07T20:32:25.3875222Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3875317Z         contiguous: bool,
2025-05-07T20:32:25.3875405Z         compiled: bool,
2025-05-07T20:32:25.3875482Z     ) -> None:
2025-05-07T20:32:25.3875584Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3875659Z     
2025-05-07T20:32:25.3875827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3875910Z     
2025-05-07T20:32:25.3875999Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3876127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3876216Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3876302Z         x0 = x[:, :D]
2025-05-07T20:32:25.3876391Z         x1 = x[:, D:]
2025-05-07T20:32:25.3876466Z     
2025-05-07T20:32:25.3876549Z         if contiguous:
2025-05-07T20:32:25.3876644Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3876735Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3876807Z     
2025-05-07T20:32:25.3876903Z         if scale_ub is not None:
2025-05-07T20:32:25.3877010Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3877145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3877271Z             )
2025-05-07T20:32:25.3877351Z         else:
2025-05-07T20:32:25.3877451Z             scale_ub_tensor = None
2025-05-07T20:32:25.3877526Z     
2025-05-07T20:32:25.3877653Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3877747Z             op = silu_mul_quant
2025-05-07T20:32:25.3877838Z             if compiled:
2025-05-07T20:32:25.3877942Z                 op = torch.compile(op)
2025-05-07T20:32:25.3878057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3878130Z     
2025-05-07T20:32:25.3878222Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3878226Z 
2025-05-07T20:32:25.3878328Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3878459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3878563Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3878661Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3879200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3879306Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3879663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3879882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3880228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3880363Z     kernel = self.compile(
2025-05-07T20:32:25.3880746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3880919Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3881047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3881051Z 
2025-05-07T20:32:25.3881265Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508ccc950>
2025-05-07T20:32:25.3882034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3882559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508c84720>}
2025-05-07T20:32:25.3883394Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3883581Z context = <triton._C.libtriton.ir.context object at 0x7f0508c78fb0>
2025-05-07T20:32:25.3883586Z 
2025-05-07T20:32:25.3883754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3884020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3884134Z                            module_map=module_map)
2025-05-07T20:32:25.3884298Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3884398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3884484Z E       ^
2025-05-07T20:32:25.3884836Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3884846Z 
2025-05-07T20:32:25.3885261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3885265Z 
2025-05-07T20:32:25.3885370Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3885591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3885680Z     T=128,
2025-05-07T20:32:25.3890263Z     D=5120,
2025-05-07T20:32:25.3890445Z     scale_ub=1200.0,
2025-05-07T20:32:25.3890540Z     contiguous=True,
2025-05-07T20:32:25.3890625Z     compiled=False,
2025-05-07T20:32:25.3890704Z )
2025-05-07T20:32:25.3890930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3891104Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.3891109Z 
2025-05-07T20:32:25.3891191Z     @given(
2025-05-07T20:32:25.3891313Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3891417Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3891537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3891654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3891777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3891853Z     )
2025-05-07T20:32:25.3892100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3892205Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3892327Z         self,
2025-05-07T20:32:25.3892408Z         T: int,
2025-05-07T20:32:25.3892494Z         D: int,
2025-05-07T20:32:25.3892596Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3892688Z         contiguous: bool,
2025-05-07T20:32:25.3892786Z         compiled: bool,
2025-05-07T20:32:25.3892874Z     ) -> None:
2025-05-07T20:32:25.3892974Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3893053Z     
2025-05-07T20:32:25.3893302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3893382Z     
2025-05-07T20:32:25.3893481Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3893608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3893711Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3893794Z         x0 = x[:, :D]
2025-05-07T20:32:25.3893878Z         x1 = x[:, D:]
2025-05-07T20:32:25.3893958Z     
2025-05-07T20:32:25.3894048Z         if contiguous:
2025-05-07T20:32:25.3894145Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3894245Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3894320Z     
2025-05-07T20:32:25.3894413Z         if scale_ub is not None:
2025-05-07T20:32:25.3894525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3894662Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3894740Z             )
2025-05-07T20:32:25.3894821Z         else:
2025-05-07T20:32:25.3894920Z             scale_ub_tensor = None
2025-05-07T20:32:25.3895053Z     
2025-05-07T20:32:25.3895185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3895278Z             op = silu_mul_quant
2025-05-07T20:32:25.3895367Z             if compiled:
2025-05-07T20:32:25.3895473Z                 op = torch.compile(op)
2025-05-07T20:32:25.3895582Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3895666Z     
2025-05-07T20:32:25.3895764Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3895769Z 
2025-05-07T20:32:25.3895872Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3896015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3896118Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3896224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3896730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3896834Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3897204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3897427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3897769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3897870Z     kernel = self.compile(
2025-05-07T20:32:25.3898301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3898478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3898607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3898612Z 
2025-05-07T20:32:25.3898820Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508e739d0>
2025-05-07T20:32:25.3899596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3900108Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508c858a0>}
2025-05-07T20:32:25.3900896Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3901090Z context = <triton._C.libtriton.ir.context object at 0x7f0508e18030>
2025-05-07T20:32:25.3901095Z 
2025-05-07T20:32:25.3901264Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3901529Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3901683Z                            module_map=module_map)
2025-05-07T20:32:25.3901848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3901950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3902037Z E       ^
2025-05-07T20:32:25.3902395Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3902401Z 
2025-05-07T20:32:25.3902863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3902873Z 
2025-05-07T20:32:25.3902980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3903202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3903287Z     T=1,
2025-05-07T20:32:25.3903368Z     D=7168,
2025-05-07T20:32:25.3903456Z     scale_ub=1200.0,
2025-05-07T20:32:25.3903547Z     contiguous=True,
2025-05-07T20:32:25.3903634Z     compiled=True,
2025-05-07T20:32:25.3903755Z )
2025-05-07T20:32:25.3903978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3904144Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.3904149Z 
2025-05-07T20:32:25.3904234Z     @given(
2025-05-07T20:32:25.3904355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3904458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3904579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3904704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3904820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3904898Z     )
2025-05-07T20:32:25.3905146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3905247Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3905331Z         self,
2025-05-07T20:32:25.3905411Z         T: int,
2025-05-07T20:32:25.3905493Z         D: int,
2025-05-07T20:32:25.3905815Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3905954Z         contiguous: bool,
2025-05-07T20:32:25.3906088Z         compiled: bool,
2025-05-07T20:32:25.3906185Z     ) -> None:
2025-05-07T20:32:25.3906284Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3906366Z     
2025-05-07T20:32:25.3906539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3906614Z     
2025-05-07T20:32:25.3906710Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3906931Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3907023Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3907112Z         x0 = x[:, :D]
2025-05-07T20:32:25.3907197Z         x1 = x[:, D:]
2025-05-07T20:32:25.3907275Z     
2025-05-07T20:32:25.3907366Z         if contiguous:
2025-05-07T20:32:25.3907460Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3907556Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3907635Z     
2025-05-07T20:32:25.3907733Z         if scale_ub is not None:
2025-05-07T20:32:25.3907852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3907992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3908069Z             )
2025-05-07T20:32:25.3908149Z         else:
2025-05-07T20:32:25.3908247Z             scale_ub_tensor = None
2025-05-07T20:32:25.3908323Z     
2025-05-07T20:32:25.3908467Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3908559Z             op = silu_mul_quant
2025-05-07T20:32:25.3908713Z             if compiled:
2025-05-07T20:32:25.3908822Z                 op = torch.compile(op)
2025-05-07T20:32:25.3908930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3909011Z     
2025-05-07T20:32:25.3909103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3909108Z 
2025-05-07T20:32:25.3909206Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3909340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3909505Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3909607Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3909978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3910072Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3910567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3910675Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3911038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3911262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3911601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3911699Z     kernel = self.compile(
2025-05-07T20:32:25.3912149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3912326Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3912458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3912463Z 
2025-05-07T20:32:25.3912666Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508e32e90>
2025-05-07T20:32:25.3913444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3913948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508c86e80>}
2025-05-07T20:32:25.3914695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3914894Z context = <triton._C.libtriton.ir.context object at 0x7f0508ec3e70>
2025-05-07T20:32:25.3914898Z 
2025-05-07T20:32:25.3915062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3915327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3915486Z                            module_map=module_map)
2025-05-07T20:32:25.3915652Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3915760Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3915840Z E       ^
2025-05-07T20:32:25.3916197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3916201Z 
2025-05-07T20:32:25.3916616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3916625Z 
2025-05-07T20:32:25.3916730Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3916958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3917037Z     T=1,
2025-05-07T20:32:25.3917117Z     D=7168,
2025-05-07T20:32:25.3917207Z     scale_ub=1200.0,
2025-05-07T20:32:25.3917298Z     contiguous=False,
2025-05-07T20:32:25.3917386Z     compiled=True,
2025-05-07T20:32:25.3917468Z )
2025-05-07T20:32:25.3917739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3917910Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.3917915Z 
2025-05-07T20:32:25.3918003Z     @given(
2025-05-07T20:32:25.3918126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3918233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3918419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3918538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3918657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3918733Z     )
2025-05-07T20:32:25.3918980Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3919085Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3919167Z         self,
2025-05-07T20:32:25.3919245Z         T: int,
2025-05-07T20:32:25.3919332Z         D: int,
2025-05-07T20:32:25.3919438Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3919530Z         contiguous: bool,
2025-05-07T20:32:25.3919620Z         compiled: bool,
2025-05-07T20:32:25.3919705Z     ) -> None:
2025-05-07T20:32:25.3919805Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3919881Z     
2025-05-07T20:32:25.3920052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3920135Z     
2025-05-07T20:32:25.3920276Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3920403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3920500Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3920584Z         x0 = x[:, :D]
2025-05-07T20:32:25.3920668Z         x1 = x[:, D:]
2025-05-07T20:32:25.3920746Z     
2025-05-07T20:32:25.3920832Z         if contiguous:
2025-05-07T20:32:25.3920926Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3921025Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3921102Z     
2025-05-07T20:32:25.3921196Z         if scale_ub is not None:
2025-05-07T20:32:25.3921314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3921451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3921536Z             )
2025-05-07T20:32:25.3921617Z         else:
2025-05-07T20:32:25.3921713Z             scale_ub_tensor = None
2025-05-07T20:32:25.3921793Z     
2025-05-07T20:32:25.3921924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3922024Z             op = silu_mul_quant
2025-05-07T20:32:25.3922116Z             if compiled:
2025-05-07T20:32:25.3922217Z                 op = torch.compile(op)
2025-05-07T20:32:25.3922328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3922409Z     
2025-05-07T20:32:25.3922501Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3922506Z 
2025-05-07T20:32:25.3922608Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3922743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3922891Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3922997Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3923366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3923462Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3923960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3924066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3924433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3924653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3924993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3925095Z     kernel = self.compile(
2025-05-07T20:32:25.3925519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3925695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3925829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3925834Z 
2025-05-07T20:32:25.3926043Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508f5cad0>
2025-05-07T20:32:25.3927065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3927624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508e44680>}
2025-05-07T20:32:25.3928378Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3928570Z context = <triton._C.libtriton.ir.context object at 0x7f0508f41130>
2025-05-07T20:32:25.3928575Z 
2025-05-07T20:32:25.3928739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3929004Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3929167Z                            module_map=module_map)
2025-05-07T20:32:25.3929331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3929431Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3929512Z E       ^
2025-05-07T20:32:25.3929875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3929880Z 
2025-05-07T20:32:25.3930297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3930301Z 
2025-05-07T20:32:25.3930407Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3930634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3930713Z     T=1,
2025-05-07T20:32:25.3930794Z     D=7168,
2025-05-07T20:32:25.3930878Z     scale_ub=None,
2025-05-07T20:32:25.3930967Z     contiguous=False,
2025-05-07T20:32:25.3931062Z     compiled=True,
2025-05-07T20:32:25.3931139Z )
2025-05-07T20:32:25.3931357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3931529Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.3931533Z 
2025-05-07T20:32:25.3931614Z     @given(
2025-05-07T20:32:25.3931741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3931847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3932010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3932137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3932250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3932334Z     )
2025-05-07T20:32:25.3932579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3932677Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3932759Z         self,
2025-05-07T20:32:25.3932843Z         T: int,
2025-05-07T20:32:25.3932921Z         D: int,
2025-05-07T20:32:25.3933028Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3933120Z         contiguous: bool,
2025-05-07T20:32:25.3933214Z         compiled: bool,
2025-05-07T20:32:25.3933300Z     ) -> None:
2025-05-07T20:32:25.3933395Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3933472Z     
2025-05-07T20:32:25.3933645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3933722Z     
2025-05-07T20:32:25.3933822Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3933992Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3934085Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3934169Z         x0 = x[:, :D]
2025-05-07T20:32:25.3934253Z         x1 = x[:, D:]
2025-05-07T20:32:25.3934327Z     
2025-05-07T20:32:25.3934415Z         if contiguous:
2025-05-07T20:32:25.3934508Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3934599Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3934719Z     
2025-05-07T20:32:25.3934813Z         if scale_ub is not None:
2025-05-07T20:32:25.3934923Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3935064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3935142Z             )
2025-05-07T20:32:25.3935223Z         else:
2025-05-07T20:32:25.3935318Z             scale_ub_tensor = None
2025-05-07T20:32:25.3935394Z     
2025-05-07T20:32:25.3935527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3935625Z             op = silu_mul_quant
2025-05-07T20:32:25.3935713Z             if compiled:
2025-05-07T20:32:25.3935820Z                 op = torch.compile(op)
2025-05-07T20:32:25.3935926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3936001Z     
2025-05-07T20:32:25.3936098Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.3936220Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.3936294Z     
2025-05-07T20:32:25.3936489Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3936593Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.3936698Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.3936823Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.3936962Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3937046Z     
2025-05-07T20:32:25.3937150Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.3937155Z 
2025-05-07T20:32:25.3937258Z moe/activation_test.py:126: 
2025-05-07T20:32:25.3937405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3937512Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.3937648Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.3938208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.3938315Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.3938679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3938900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3939263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.3939567Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3939964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.3940218Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.3940593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.3940763Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.3941107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.3941186Z     fn()
2025-05-07T20:32:25.3941585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.3941674Z     self.fn.run(
2025-05-07T20:32:25.3942055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3942157Z     kernel = self.compile(
2025-05-07T20:32:25.3942539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3942714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3942846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3942889Z 
2025-05-07T20:32:25.3943099Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508fcf010>
2025-05-07T20:32:25.3943879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3944379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508e45580>}
2025-05-07T20:32:25.3945122Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3945314Z context = <triton._C.libtriton.ir.context object at 0x7f0508fb3630>
2025-05-07T20:32:25.3945319Z 
2025-05-07T20:32:25.3945482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3945789Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3945897Z                            module_map=module_map)
2025-05-07T20:32:25.3946060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3946172Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.3946250Z E       ^
2025-05-07T20:32:25.3946609Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3946616Z 
2025-05-07T20:32:25.3947029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3947034Z 
2025-05-07T20:32:25.3947140Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3947369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3947449Z     T=1,
2025-05-07T20:32:25.3947533Z     D=5120,
2025-05-07T20:32:25.3947623Z     scale_ub=1200.0,
2025-05-07T20:32:25.3947713Z     contiguous=False,
2025-05-07T20:32:25.3947803Z     compiled=True,
2025-05-07T20:32:25.3947878Z )
2025-05-07T20:32:25.3948096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3948267Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.3948271Z 
2025-05-07T20:32:25.3948351Z     @given(
2025-05-07T20:32:25.3948515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3948627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3948743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3948860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3948977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3949052Z     )
2025-05-07T20:32:25.3949300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3949401Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3949478Z         self,
2025-05-07T20:32:25.3949559Z         T: int,
2025-05-07T20:32:25.3949638Z         D: int,
2025-05-07T20:32:25.3949739Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3949835Z         contiguous: bool,
2025-05-07T20:32:25.3949922Z         compiled: bool,
2025-05-07T20:32:25.3950002Z     ) -> None:
2025-05-07T20:32:25.3950105Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3950180Z     
2025-05-07T20:32:25.3950419Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3950501Z     
2025-05-07T20:32:25.3950595Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3950723Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3950812Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3950894Z         x0 = x[:, :D]
2025-05-07T20:32:25.3950978Z         x1 = x[:, D:]
2025-05-07T20:32:25.3951052Z     
2025-05-07T20:32:25.3951136Z         if contiguous:
2025-05-07T20:32:25.3951275Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3951369Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3951443Z     
2025-05-07T20:32:25.3951540Z         if scale_ub is not None:
2025-05-07T20:32:25.3951647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3951784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3951867Z             )
2025-05-07T20:32:25.3951946Z         else:
2025-05-07T20:32:25.3952044Z             scale_ub_tensor = None
2025-05-07T20:32:25.3952125Z     
2025-05-07T20:32:25.3952260Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3952356Z             op = silu_mul_quant
2025-05-07T20:32:25.3952442Z             if compiled:
2025-05-07T20:32:25.3952566Z                 op = torch.compile(op)
2025-05-07T20:32:25.3952690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3952780Z     
2025-05-07T20:32:25.3952874Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3952923Z 
2025-05-07T20:32:25.3953028Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3953161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3953263Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3953367Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3953731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3953832Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3954327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3954427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3954786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3955008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3955355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3955451Z     kernel = self.compile(
2025-05-07T20:32:25.3955831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3956009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3956137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3956141Z 
2025-05-07T20:32:25.3956389Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508fff910>
2025-05-07T20:32:25.3957166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3957665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508e46b60>}
2025-05-07T20:32:25.3958420Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3958610Z context = <triton._C.libtriton.ir.context object at 0x7f0509857e30>
2025-05-07T20:32:25.3958615Z 
2025-05-07T20:32:25.3958787Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3959090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3959202Z                            module_map=module_map)
2025-05-07T20:32:25.3959367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3959468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3959547Z E       ^
2025-05-07T20:32:25.3959905Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3959953Z 
2025-05-07T20:32:25.3960365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3960370Z 
2025-05-07T20:32:25.3960479Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3960701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3960779Z     T=1,
2025-05-07T20:32:25.3960866Z     D=5120,
2025-05-07T20:32:25.3960955Z     scale_ub=1200.0,
2025-05-07T20:32:25.3961044Z     contiguous=False,
2025-05-07T20:32:25.3961135Z     compiled=False,
2025-05-07T20:32:25.3961213Z )
2025-05-07T20:32:25.3961438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3961606Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.3961610Z 
2025-05-07T20:32:25.3961689Z     @given(
2025-05-07T20:32:25.3961858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3961958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3962076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3962196Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3962310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3962386Z     )
2025-05-07T20:32:25.3962658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3962767Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3962864Z         self,
2025-05-07T20:32:25.3962941Z         T: int,
2025-05-07T20:32:25.3963018Z         D: int,
2025-05-07T20:32:25.3963122Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3963216Z         contiguous: bool,
2025-05-07T20:32:25.3963304Z         compiled: bool,
2025-05-07T20:32:25.3963390Z     ) -> None:
2025-05-07T20:32:25.3963485Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3963566Z     
2025-05-07T20:32:25.3963743Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3963819Z     
2025-05-07T20:32:25.3963911Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3964040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3964129Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3964214Z         x0 = x[:, :D]
2025-05-07T20:32:25.3964296Z         x1 = x[:, D:]
2025-05-07T20:32:25.3964371Z     
2025-05-07T20:32:25.3964460Z         if contiguous:
2025-05-07T20:32:25.3964602Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3964694Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3964773Z     
2025-05-07T20:32:25.3964866Z         if scale_ub is not None:
2025-05-07T20:32:25.3964974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3965114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3965190Z             )
2025-05-07T20:32:25.3965267Z         else:
2025-05-07T20:32:25.3965372Z             scale_ub_tensor = None
2025-05-07T20:32:25.3965448Z     
2025-05-07T20:32:25.3965579Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3965675Z             op = silu_mul_quant
2025-05-07T20:32:25.3965762Z             if compiled:
2025-05-07T20:32:25.3965867Z                 op = torch.compile(op)
2025-05-07T20:32:25.3965975Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3966051Z     
2025-05-07T20:32:25.3966149Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3966157Z 
2025-05-07T20:32:25.3966295Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3966428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3966533Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3966633Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3967133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3967276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3967750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3967978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3968317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3968413Z     kernel = self.compile(
2025-05-07T20:32:25.3968801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3968975Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3969105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3969110Z 
2025-05-07T20:32:25.3969311Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509895510>
2025-05-07T20:32:25.3970088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3970638Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508e472e0>}
2025-05-07T20:32:25.3971385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3971578Z context = <triton._C.libtriton.ir.context object at 0x7f05098bdb70>
2025-05-07T20:32:25.3971583Z 
2025-05-07T20:32:25.3971746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3972011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3972131Z                            module_map=module_map)
2025-05-07T20:32:25.3972293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3972396Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3972477Z E       ^
2025-05-07T20:32:25.3972831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3972835Z 
2025-05-07T20:32:25.3973294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3973299Z 
2025-05-07T20:32:25.3973405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3973628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3973708Z     T=16384,
2025-05-07T20:32:25.3973786Z     D=5120,
2025-05-07T20:32:25.3973875Z     scale_ub=1200.0,
2025-05-07T20:32:25.3973963Z     contiguous=False,
2025-05-07T20:32:25.3974054Z     compiled=True,
2025-05-07T20:32:25.3974132Z )
2025-05-07T20:32:25.3974350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3974527Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.3974531Z 
2025-05-07T20:32:25.3974612Z     @given(
2025-05-07T20:32:25.3974733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3974837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3974955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3975113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3975233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3975309Z     )
2025-05-07T20:32:25.3975555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3975652Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3975729Z         self,
2025-05-07T20:32:25.3975847Z         T: int,
2025-05-07T20:32:25.3975932Z         D: int,
2025-05-07T20:32:25.3976031Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3976121Z         contiguous: bool,
2025-05-07T20:32:25.3976211Z         compiled: bool,
2025-05-07T20:32:25.3976290Z     ) -> None:
2025-05-07T20:32:25.3976388Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3976462Z     
2025-05-07T20:32:25.3976633Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3976712Z     
2025-05-07T20:32:25.3976809Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3976939Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3977031Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3977113Z         x0 = x[:, :D]
2025-05-07T20:32:25.3977194Z         x1 = x[:, D:]
2025-05-07T20:32:25.3977273Z     
2025-05-07T20:32:25.3977358Z         if contiguous:
2025-05-07T20:32:25.3977450Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3977547Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3977672Z     
2025-05-07T20:32:25.3977769Z         if scale_ub is not None:
2025-05-07T20:32:25.3977877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3978013Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3978095Z             )
2025-05-07T20:32:25.3978173Z         else:
2025-05-07T20:32:25.3978268Z             scale_ub_tensor = None
2025-05-07T20:32:25.3978345Z     
2025-05-07T20:32:25.3978474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3978569Z             op = silu_mul_quant
2025-05-07T20:32:25.3978662Z             if compiled:
2025-05-07T20:32:25.3978763Z                 op = torch.compile(op)
2025-05-07T20:32:25.3978870Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3978946Z     
2025-05-07T20:32:25.3979037Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3979042Z 
2025-05-07T20:32:25.3979141Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3979271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3979378Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3979481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3979848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3979943Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3980508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3980611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3980974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3981197Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3981535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3981639Z     kernel = self.compile(
2025-05-07T20:32:25.3982020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3982199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3982326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3982331Z 
2025-05-07T20:32:25.3982538Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509814450>
2025-05-07T20:32:25.3983356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3983857Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509844fe0>}
2025-05-07T20:32:25.3984644Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3984833Z context = <triton._C.libtriton.ir.context object at 0x7f05098f8670>
2025-05-07T20:32:25.3984838Z 
2025-05-07T20:32:25.3985004Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3985268Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3985378Z                            module_map=module_map)
2025-05-07T20:32:25.3985544Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3985644Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3985727Z E       ^
2025-05-07T20:32:25.3986083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3986090Z 
2025-05-07T20:32:25.3986545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3986550Z 
2025-05-07T20:32:25.3986657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3986883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3986961Z     T=2048,
2025-05-07T20:32:25.3987045Z     D=7168,
2025-05-07T20:32:25.3987132Z     scale_ub=1200.0,
2025-05-07T20:32:25.3987224Z     contiguous=False,
2025-05-07T20:32:25.3987310Z     compiled=True,
2025-05-07T20:32:25.3987388Z )
2025-05-07T20:32:25.3987609Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3987787Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.3987792Z 
2025-05-07T20:32:25.3987871Z     @given(
2025-05-07T20:32:25.3987995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3988097Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3988221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3988341Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3988455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3988533Z     )
2025-05-07T20:32:25.3988777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3988872Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3988954Z         self,
2025-05-07T20:32:25.3989075Z         T: int,
2025-05-07T20:32:25.3989159Z         D: int,
2025-05-07T20:32:25.3989262Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3989352Z         contiguous: bool,
2025-05-07T20:32:25.3989440Z         compiled: bool,
2025-05-07T20:32:25.3989523Z     ) -> None:
2025-05-07T20:32:25.3989619Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3989696Z     
2025-05-07T20:32:25.3989870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3989951Z     
2025-05-07T20:32:25.3990047Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3990173Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3990262Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3990348Z         x0 = x[:, :D]
2025-05-07T20:32:25.3990429Z         x1 = x[:, D:]
2025-05-07T20:32:25.3990504Z     
2025-05-07T20:32:25.3990594Z         if contiguous:
2025-05-07T20:32:25.3990686Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3990776Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3990860Z     
2025-05-07T20:32:25.3990998Z         if scale_ub is not None:
2025-05-07T20:32:25.3991107Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3991247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3991324Z             )
2025-05-07T20:32:25.3991408Z         else:
2025-05-07T20:32:25.3991504Z             scale_ub_tensor = None
2025-05-07T20:32:25.3991578Z     
2025-05-07T20:32:25.3991712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3991845Z             op = silu_mul_quant
2025-05-07T20:32:25.3991933Z             if compiled:
2025-05-07T20:32:25.3992036Z                 op = torch.compile(op)
2025-05-07T20:32:25.3992144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3992219Z     
2025-05-07T20:32:25.3992313Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3992318Z 
2025-05-07T20:32:25.3992428Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3992581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3992706Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3992806Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3993178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3993274Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3993766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3993917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3994274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3994501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3994839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3994936Z     kernel = self.compile(
2025-05-07T20:32:25.3995322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3995496Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3995623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3995627Z 
2025-05-07T20:32:25.3995835Z self = <triton.compiler.compiler.ASTSource object at 0x7f05089f9950>
2025-05-07T20:32:25.3996612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3997113Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509845b20>}
2025-05-07T20:32:25.3997900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3998096Z context = <triton._C.libtriton.ir.context object at 0x7f050896dfb0>
2025-05-07T20:32:25.3998100Z 
2025-05-07T20:32:25.3998265Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3998530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3998644Z                            module_map=module_map)
2025-05-07T20:32:25.3998806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3998907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3998990Z E       ^
2025-05-07T20:32:25.3999343Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3999348Z 
2025-05-07T20:32:25.3999810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3999815Z 
2025-05-07T20:32:25.3999921Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4000142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4000229Z     T=1,
2025-05-07T20:32:25.4000307Z     D=5120,
2025-05-07T20:32:25.4000390Z     scale_ub=None,
2025-05-07T20:32:25.4000524Z     contiguous=False,
2025-05-07T20:32:25.4000610Z     compiled=False,
2025-05-07T20:32:25.4000690Z )
2025-05-07T20:32:25.4000908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4001073Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.4001078Z 
2025-05-07T20:32:25.4001159Z     @given(
2025-05-07T20:32:25.4001280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4001381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4001503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4001621Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4001736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4001816Z     )
2025-05-07T20:32:25.4002059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4002160Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4002283Z         self,
2025-05-07T20:32:25.4002362Z         T: int,
2025-05-07T20:32:25.4002443Z         D: int,
2025-05-07T20:32:25.4002542Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4002633Z         contiguous: bool,
2025-05-07T20:32:25.4002723Z         compiled: bool,
2025-05-07T20:32:25.4002807Z     ) -> None:
2025-05-07T20:32:25.4002923Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4003007Z     
2025-05-07T20:32:25.4003194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4003272Z     
2025-05-07T20:32:25.4003371Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4003495Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4003589Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4003672Z         x0 = x[:, :D]
2025-05-07T20:32:25.4003754Z         x1 = x[:, D:]
2025-05-07T20:32:25.4003830Z     
2025-05-07T20:32:25.4003915Z         if contiguous:
2025-05-07T20:32:25.4004009Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4004108Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4004182Z     
2025-05-07T20:32:25.4004274Z         if scale_ub is not None:
2025-05-07T20:32:25.4004384Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4004520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4004596Z             )
2025-05-07T20:32:25.4004675Z         else:
2025-05-07T20:32:25.4004768Z             scale_ub_tensor = None
2025-05-07T20:32:25.4004845Z     
2025-05-07T20:32:25.4005018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4005111Z             op = silu_mul_quant
2025-05-07T20:32:25.4005197Z             if compiled:
2025-05-07T20:32:25.4005296Z                 op = torch.compile(op)
2025-05-07T20:32:25.4005401Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4005476Z     
2025-05-07T20:32:25.4005566Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4005571Z 
2025-05-07T20:32:25.4005871Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4006063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4006164Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4006268Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4006762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4006861Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4007303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4007582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4007922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4008022Z     kernel = self.compile(
2025-05-07T20:32:25.4008403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4008645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4008773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4008778Z 
2025-05-07T20:32:25.4008981Z self = <triton.compiler.compiler.ASTSource object at 0x7f050898b250>
2025-05-07T20:32:25.4009759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4010258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509846e80>}
2025-05-07T20:32:25.4011004Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4011288Z context = <triton._C.libtriton.ir.context object at 0x7f05089ef870>
2025-05-07T20:32:25.4011292Z 
2025-05-07T20:32:25.4011457Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4011722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4011831Z                            module_map=module_map)
2025-05-07T20:32:25.4012002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4012102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4012180Z E       ^
2025-05-07T20:32:25.4012571Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4012577Z 
2025-05-07T20:32:25.4016458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4016475Z 
2025-05-07T20:32:25.4016594Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4016822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4016900Z     T=4096,
2025-05-07T20:32:25.4016981Z     D=7168,
2025-05-07T20:32:25.4017066Z     scale_ub=1200.0,
2025-05-07T20:32:25.4017153Z     contiguous=False,
2025-05-07T20:32:25.4017242Z     compiled=False,
2025-05-07T20:32:25.4017316Z )
2025-05-07T20:32:25.4017629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4017812Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.4017817Z 
2025-05-07T20:32:25.4017895Z     @given(
2025-05-07T20:32:25.4018028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4018128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4018244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4018370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4018484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4018561Z     )
2025-05-07T20:32:25.4018809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4018906Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4018988Z         self,
2025-05-07T20:32:25.4019070Z         T: int,
2025-05-07T20:32:25.4019149Z         D: int,
2025-05-07T20:32:25.4019253Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4019391Z         contiguous: bool,
2025-05-07T20:32:25.4019480Z         compiled: bool,
2025-05-07T20:32:25.4019563Z     ) -> None:
2025-05-07T20:32:25.4019659Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4019734Z     
2025-05-07T20:32:25.4019910Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4019986Z     
2025-05-07T20:32:25.4020081Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4020211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4020348Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4020437Z         x0 = x[:, :D]
2025-05-07T20:32:25.4020519Z         x1 = x[:, D:]
2025-05-07T20:32:25.4020592Z     
2025-05-07T20:32:25.4020679Z         if contiguous:
2025-05-07T20:32:25.4020771Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4020862Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4020937Z     
2025-05-07T20:32:25.4021029Z         if scale_ub is not None:
2025-05-07T20:32:25.4021141Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4021284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4021358Z             )
2025-05-07T20:32:25.4021434Z         else:
2025-05-07T20:32:25.4021530Z             scale_ub_tensor = None
2025-05-07T20:32:25.4021603Z     
2025-05-07T20:32:25.4021733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4021824Z             op = silu_mul_quant
2025-05-07T20:32:25.4021956Z             if compiled:
2025-05-07T20:32:25.4022060Z                 op = torch.compile(op)
2025-05-07T20:32:25.4022166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4022238Z     
2025-05-07T20:32:25.4022331Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4022336Z 
2025-05-07T20:32:25.4022432Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4022560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4022664Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4022769Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4023273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4023372Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4023733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4023952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4024295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4024390Z     kernel = self.compile(
2025-05-07T20:32:25.4024770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4024943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4025113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4025118Z 
2025-05-07T20:32:25.4025319Z self = <triton.compiler.compiler.ASTSource object at 0x7f050949a150>
2025-05-07T20:32:25.4026090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4026589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509424040>}
2025-05-07T20:32:25.4027333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4027519Z context = <triton._C.libtriton.ir.context object at 0x7f050943e7b0>
2025-05-07T20:32:25.4027527Z 
2025-05-07T20:32:25.4027759Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4028027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4028133Z                            module_map=module_map)
2025-05-07T20:32:25.4028298Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4028395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4028512Z E       ^
2025-05-07T20:32:25.4028867Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4028872Z 
2025-05-07T20:32:25.4029282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4029286Z 
2025-05-07T20:32:25.4029394Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4029613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4029694Z     T=16384,
2025-05-07T20:32:25.4029775Z     D=7168,
2025-05-07T20:32:25.4029856Z     scale_ub=None,
2025-05-07T20:32:25.4029937Z     contiguous=True,
2025-05-07T20:32:25.4030022Z     compiled=True,
2025-05-07T20:32:25.4030093Z )
2025-05-07T20:32:25.4030306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4030480Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.4030528Z 
2025-05-07T20:32:25.4030607Z     @given(
2025-05-07T20:32:25.4030726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4030823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4030937Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4031055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4031169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4031244Z     )
2025-05-07T20:32:25.4031494Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4031587Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4031663Z         self,
2025-05-07T20:32:25.4031742Z         T: int,
2025-05-07T20:32:25.4031819Z         D: int,
2025-05-07T20:32:25.4031916Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4032007Z         contiguous: bool,
2025-05-07T20:32:25.4032090Z         compiled: bool,
2025-05-07T20:32:25.4032173Z     ) -> None:
2025-05-07T20:32:25.4032268Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4032340Z     
2025-05-07T20:32:25.4032510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4032583Z     
2025-05-07T20:32:25.4032675Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4032800Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4032888Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4032968Z         x0 = x[:, :D]
2025-05-07T20:32:25.4033051Z         x1 = x[:, D:]
2025-05-07T20:32:25.4033169Z     
2025-05-07T20:32:25.4033256Z         if contiguous:
2025-05-07T20:32:25.4033353Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4033444Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4033518Z     
2025-05-07T20:32:25.4033609Z         if scale_ub is not None:
2025-05-07T20:32:25.4033715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4033852Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4033933Z             )
2025-05-07T20:32:25.4034008Z         else:
2025-05-07T20:32:25.4034104Z             scale_ub_tensor = None
2025-05-07T20:32:25.4034177Z     
2025-05-07T20:32:25.4034305Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4034397Z             op = silu_mul_quant
2025-05-07T20:32:25.4034482Z             if compiled:
2025-05-07T20:32:25.4034580Z                 op = torch.compile(op)
2025-05-07T20:32:25.4034685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4034765Z     
2025-05-07T20:32:25.4034896Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4034905Z 
2025-05-07T20:32:25.4035002Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4035130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4035235Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4035333Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4035698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4035835Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4036323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4036421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4036773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4036996Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4037334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4037428Z     kernel = self.compile(
2025-05-07T20:32:25.4037806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4037981Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4038152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4038156Z 
2025-05-07T20:32:25.4038357Z self = <triton.compiler.compiler.ASTSource object at 0x7f0509471050>
2025-05-07T20:32:25.4039124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4039623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509425260>}
2025-05-07T20:32:25.4040373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4040559Z context = <triton._C.libtriton.ir.context object at 0x7f05094f9670>
2025-05-07T20:32:25.4040569Z 
2025-05-07T20:32:25.4040733Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4040991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4041097Z                            module_map=module_map)
2025-05-07T20:32:25.4041261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4041358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4041480Z E       ^
2025-05-07T20:32:25.4041837Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4041841Z 
2025-05-07T20:32:25.4042248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4042253Z 
2025-05-07T20:32:25.4042358Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4042603Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4042691Z     T=4096,
2025-05-07T20:32:25.4042784Z     D=5120,
2025-05-07T20:32:25.4042875Z     scale_ub=None,
2025-05-07T20:32:25.4042971Z     contiguous=False,
2025-05-07T20:32:25.4043055Z     compiled=True,
2025-05-07T20:32:25.4043127Z )
2025-05-07T20:32:25.4043344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4043515Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.4043520Z 
2025-05-07T20:32:25.4043637Z     @given(
2025-05-07T20:32:25.4043767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4044088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4044394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4044720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4045044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4045392Z     )
2025-05-07T20:32:25.4045741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4046174Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4046413Z         self,
2025-05-07T20:32:25.4046610Z         T: int,
2025-05-07T20:32:25.4046809Z         D: int,
2025-05-07T20:32:25.4047021Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4047286Z         contiguous: bool,
2025-05-07T20:32:25.4047593Z         compiled: bool,
2025-05-07T20:32:25.4047814Z     ) -> None:
2025-05-07T20:32:25.4048033Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4048268Z     
2025-05-07T20:32:25.4048540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4048880Z     
2025-05-07T20:32:25.4049072Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4049361Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4049663Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4049902Z         x0 = x[:, :D]
2025-05-07T20:32:25.4050171Z         x1 = x[:, D:]
2025-05-07T20:32:25.4050371Z     
2025-05-07T20:32:25.4050554Z         if contiguous:
2025-05-07T20:32:25.4050787Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4051043Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4051280Z     
2025-05-07T20:32:25.4051481Z         if scale_ub is not None:
2025-05-07T20:32:25.4051749Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4052081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4052392Z             )
2025-05-07T20:32:25.4052819Z         else:
2025-05-07T20:32:25.4053033Z             scale_ub_tensor = None
2025-05-07T20:32:25.4053282Z     
2025-05-07T20:32:25.4053513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4053821Z             op = silu_mul_quant
2025-05-07T20:32:25.4054069Z             if compiled:
2025-05-07T20:32:25.4054313Z                 op = torch.compile(op)
2025-05-07T20:32:25.4054602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4054877Z     
2025-05-07T20:32:25.4055065Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4055226Z 
2025-05-07T20:32:25.4055321Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4055612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4055940Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4056211Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4056812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4057367Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4058021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4058704Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4059235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4059911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4060564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4061085Z     kernel = self.compile(
2025-05-07T20:32:25.4061616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4062262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4062744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4062975Z 
2025-05-07T20:32:25.4063181Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be0a67d0>
2025-05-07T20:32:25.4064251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4065665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509425da0>}
2025-05-07T20:32:25.4066993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4068010Z context = <triton._C.libtriton.ir.context object at 0x7f05be13adf0>
2025-05-07T20:32:25.4068297Z 
2025-05-07T20:32:25.4068459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4068973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4069432Z                            module_map=module_map)
2025-05-07T20:32:25.4069787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4070181Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4070436Z E       ^
2025-05-07T20:32:25.4070896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4071343Z 
2025-05-07T20:32:25.4071754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4072263Z 
2025-05-07T20:32:25.4072368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4072780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4073174Z     T=4096,
2025-05-07T20:32:25.4073360Z     D=5120,
2025-05-07T20:32:25.4073549Z     scale_ub=1200.0,
2025-05-07T20:32:25.4073764Z     contiguous=False,
2025-05-07T20:32:25.4073984Z     compiled=False,
2025-05-07T20:32:25.4074181Z )
2025-05-07T20:32:25.4074493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4074990Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.4075263Z 
2025-05-07T20:32:25.4075346Z     @given(
2025-05-07T20:32:25.4075570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4075878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4076178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4076496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4076813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4077136Z     )
2025-05-07T20:32:25.4077483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4077911Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4078149Z         self,
2025-05-07T20:32:25.4078341Z         T: int,
2025-05-07T20:32:25.4078532Z         D: int,
2025-05-07T20:32:25.4078747Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4079017Z         contiguous: bool,
2025-05-07T20:32:25.4079258Z         compiled: bool,
2025-05-07T20:32:25.4079479Z     ) -> None:
2025-05-07T20:32:25.4079687Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4079919Z     
2025-05-07T20:32:25.4080185Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4080525Z     
2025-05-07T20:32:25.4080716Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4080999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4081298Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4081533Z         x0 = x[:, :D]
2025-05-07T20:32:25.4081788Z         x1 = x[:, D:]
2025-05-07T20:32:25.4081995Z     
2025-05-07T20:32:25.4082181Z         if contiguous:
2025-05-07T20:32:25.4082403Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4082656Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4082888Z     
2025-05-07T20:32:25.4083071Z         if scale_ub is not None:
2025-05-07T20:32:25.4083338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4083716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4084019Z             )
2025-05-07T20:32:25.4084210Z         else:
2025-05-07T20:32:25.4084419Z             scale_ub_tensor = None
2025-05-07T20:32:25.4084663Z     
2025-05-07T20:32:25.4084888Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4085196Z             op = silu_mul_quant
2025-05-07T20:32:25.4085436Z             if compiled:
2025-05-07T20:32:25.4085680Z                 op = torch.compile(op)
2025-05-07T20:32:25.4085974Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4086242Z     
2025-05-07T20:32:25.4086429Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4086598Z 
2025-05-07T20:32:25.4086694Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4086984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4087308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4087662Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4088401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4089083Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4089608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4090280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4090937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4091460Z     kernel = self.compile(
2025-05-07T20:32:25.4091993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4092692Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4093082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4093312Z 
2025-05-07T20:32:25.4093516Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be093b10>
2025-05-07T20:32:25.4094582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4095984Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509427420>}
2025-05-07T20:32:25.4097314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4098322Z context = <triton._C.libtriton.ir.context object at 0x7f05be108170>
2025-05-07T20:32:25.4098608Z 
2025-05-07T20:32:25.4098771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4099289Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4099748Z                            module_map=module_map)
2025-05-07T20:32:25.4100104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4100452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4100706Z E       ^
2025-05-07T20:32:25.4101206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4101653Z 
2025-05-07T20:32:25.4102066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4102573Z 
2025-05-07T20:32:25.4102676Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4103080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4103472Z     T=4096,
2025-05-07T20:32:25.4103704Z     D=5120,
2025-05-07T20:32:25.4103895Z     scale_ub=1200.0,
2025-05-07T20:32:25.4104110Z     contiguous=False,
2025-05-07T20:32:25.4104332Z     compiled=True,
2025-05-07T20:32:25.4104527Z )
2025-05-07T20:32:25.4104836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4105323Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.4105778Z 
2025-05-07T20:32:25.4105894Z     @given(
2025-05-07T20:32:25.4106178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4106480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4106779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4107105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4107423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4107702Z     )
2025-05-07T20:32:25.4108046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4108587Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4108824Z         self,
2025-05-07T20:32:25.4109012Z         T: int,
2025-05-07T20:32:25.4109206Z         D: int,
2025-05-07T20:32:25.4109417Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4109681Z         contiguous: bool,
2025-05-07T20:32:25.4109916Z         compiled: bool,
2025-05-07T20:32:25.4110131Z     ) -> None:
2025-05-07T20:32:25.4110341Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4110575Z     
2025-05-07T20:32:25.4110842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4111181Z     
2025-05-07T20:32:25.4111372Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4111654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4111962Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4112195Z         x0 = x[:, :D]
2025-05-07T20:32:25.4112407Z         x1 = x[:, D:]
2025-05-07T20:32:25.4112640Z     
2025-05-07T20:32:25.4112844Z         if contiguous:
2025-05-07T20:32:25.4113070Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4113323Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4113559Z     
2025-05-07T20:32:25.4113745Z         if scale_ub is not None:
2025-05-07T20:32:25.4114011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4114341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4114642Z             )
2025-05-07T20:32:25.4114831Z         else:
2025-05-07T20:32:25.4115108Z             scale_ub_tensor = None
2025-05-07T20:32:25.4115355Z     
2025-05-07T20:32:25.4115583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4115892Z             op = silu_mul_quant
2025-05-07T20:32:25.4116132Z             if compiled:
2025-05-07T20:32:25.4116376Z                 op = torch.compile(op)
2025-05-07T20:32:25.4116666Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4116931Z     
2025-05-07T20:32:25.4117118Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4117287Z 
2025-05-07T20:32:25.4117383Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4117671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4117997Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4118273Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4118823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4119369Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4120118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4120803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4121335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4122005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4122721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4123246Z     kernel = self.compile(
2025-05-07T20:32:25.4123775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4124421Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4124809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4125040Z 
2025-05-07T20:32:25.4125249Z self = <triton.compiler.compiler.ASTSource object at 0x7f05be0c68d0>
2025-05-07T20:32:25.4126313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4127744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be074860>}
2025-05-07T20:32:25.4129143Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4130160Z context = <triton._C.libtriton.ir.context object at 0x7f05be156ef0>
2025-05-07T20:32:25.4130445Z 
2025-05-07T20:32:25.4130613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4131126Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4131585Z                            module_map=module_map)
2025-05-07T20:32:25.4131944Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4132291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4132549Z E       ^
2025-05-07T20:32:25.4133009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4133452Z 
2025-05-07T20:32:25.4133863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4134367Z 
2025-05-07T20:32:25.4134471Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4134880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4135327Z     T=2048,
2025-05-07T20:32:25.4135513Z     D=7168,
2025-05-07T20:32:25.4135705Z     scale_ub=1200.0,
2025-05-07T20:32:25.4135923Z     contiguous=False,
2025-05-07T20:32:25.4136141Z     compiled=False,
2025-05-07T20:32:25.4136345Z )
2025-05-07T20:32:25.4136660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4137151Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.4137426Z 
2025-05-07T20:32:25.4137509Z     @given(
2025-05-07T20:32:25.4137735Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4138046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4138342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4138664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4138982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4139262Z     )
2025-05-07T20:32:25.4139607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4140092Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4140330Z         self,
2025-05-07T20:32:25.4140516Z         T: int,
2025-05-07T20:32:25.4140709Z         D: int,
2025-05-07T20:32:25.4140927Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4141186Z         contiguous: bool,
2025-05-07T20:32:25.4141421Z         compiled: bool,
2025-05-07T20:32:25.4141635Z     ) -> None:
2025-05-07T20:32:25.4141843Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4142132Z     
2025-05-07T20:32:25.4142404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4142739Z     
2025-05-07T20:32:25.4142930Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4143218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4143520Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4143757Z         x0 = x[:, :D]
2025-05-07T20:32:25.4143970Z         x1 = x[:, D:]
2025-05-07T20:32:25.4144171Z     
2025-05-07T20:32:25.4144357Z         if contiguous:
2025-05-07T20:32:25.4144585Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4144838Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4145074Z     
2025-05-07T20:32:25.4145263Z         if scale_ub is not None:
2025-05-07T20:32:25.4149997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4150354Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4150664Z             )
2025-05-07T20:32:25.4150934Z         else:
2025-05-07T20:32:25.4151141Z             scale_ub_tensor = None
2025-05-07T20:32:25.4151391Z     
2025-05-07T20:32:25.4151626Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4151933Z             op = silu_mul_quant
2025-05-07T20:32:25.4152185Z             if compiled:
2025-05-07T20:32:25.4152433Z                 op = torch.compile(op)
2025-05-07T20:32:25.4152725Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4152997Z     
2025-05-07T20:32:25.4153190Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4153353Z 
2025-05-07T20:32:25.4153458Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4153748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4154082Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4154364Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4155056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4155745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4156276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4156951Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4157605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4158134Z     kernel = self.compile(
2025-05-07T20:32:25.4158727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4159376Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4159769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4160000Z 
2025-05-07T20:32:25.4160205Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508a9b890>
2025-05-07T20:32:25.4161276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4162643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be0756c0>}
2025-05-07T20:32:25.4164018Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4165036Z context = <triton._C.libtriton.ir.context object at 0x7f0508a83ef0>
2025-05-07T20:32:25.4165325Z 
2025-05-07T20:32:25.4165487Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4166003Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4166507Z                            module_map=module_map)
2025-05-07T20:32:25.4166862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4167211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4167469Z E       ^
2025-05-07T20:32:25.4168008Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4168458Z 
2025-05-07T20:32:25.4168877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4169387Z 
2025-05-07T20:32:25.4169490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4169900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4170295Z     T=1,
2025-05-07T20:32:25.4170480Z     D=7168,
2025-05-07T20:32:25.4170674Z     scale_ub=None,
2025-05-07T20:32:25.4170882Z     contiguous=True,
2025-05-07T20:32:25.4171157Z     compiled=False,
2025-05-07T20:32:25.4171357Z )
2025-05-07T20:32:25.4171669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4172148Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4172412Z 
2025-05-07T20:32:25.4172492Z     @given(
2025-05-07T20:32:25.4172722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4173028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4173340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4173660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4173980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4174263Z     )
2025-05-07T20:32:25.4174604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4175036Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4175275Z         self,
2025-05-07T20:32:25.4175479Z         T: int,
2025-05-07T20:32:25.4175672Z         D: int,
2025-05-07T20:32:25.4175883Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4176153Z         contiguous: bool,
2025-05-07T20:32:25.4176388Z         compiled: bool,
2025-05-07T20:32:25.4176604Z     ) -> None:
2025-05-07T20:32:25.4176815Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4177052Z     
2025-05-07T20:32:25.4177317Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4177656Z     
2025-05-07T20:32:25.4177893Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4178177Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4178484Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4178723Z         x0 = x[:, :D]
2025-05-07T20:32:25.4178938Z         x1 = x[:, D:]
2025-05-07T20:32:25.4179141Z     
2025-05-07T20:32:25.4179323Z         if contiguous:
2025-05-07T20:32:25.4179546Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4179803Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4180049Z     
2025-05-07T20:32:25.4180233Z         if scale_ub is not None:
2025-05-07T20:32:25.4180502Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4180832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4181135Z             )
2025-05-07T20:32:25.4181326Z         else:
2025-05-07T20:32:25.4181533Z             scale_ub_tensor = None
2025-05-07T20:32:25.4181786Z     
2025-05-07T20:32:25.4182011Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4182376Z             op = silu_mul_quant
2025-05-07T20:32:25.4182622Z             if compiled:
2025-05-07T20:32:25.4182864Z                 op = torch.compile(op)
2025-05-07T20:32:25.4183155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4183427Z     
2025-05-07T20:32:25.4183616Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4183779Z 
2025-05-07T20:32:25.4183876Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4184168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4184538Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4184816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4185501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4186188Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4186717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4187394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4188051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4188580Z     kernel = self.compile(
2025-05-07T20:32:25.4189113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4189806Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4190196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4190420Z 
2025-05-07T20:32:25.4190624Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508a3ba50>
2025-05-07T20:32:25.4191699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4193116Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be074fe0>}
2025-05-07T20:32:25.4194444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4195467Z context = <triton._C.libtriton.ir.context object at 0x7f0508afb6f0>
2025-05-07T20:32:25.4195751Z 
2025-05-07T20:32:25.4195913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4196438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4196898Z                            module_map=module_map)
2025-05-07T20:32:25.4197326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4197681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4197939Z E       ^
2025-05-07T20:32:25.4198398Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4198841Z 
2025-05-07T20:32:25.4199250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4199762Z 
2025-05-07T20:32:25.4199868Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4200274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4200672Z     T=16384,
2025-05-07T20:32:25.4200859Z     D=7168,
2025-05-07T20:32:25.4201048Z     scale_ub=1200.0,
2025-05-07T20:32:25.4201267Z     contiguous=False,
2025-05-07T20:32:25.4201484Z     compiled=True,
2025-05-07T20:32:25.4201684Z )
2025-05-07T20:32:25.4202008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4202545Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.4202837Z 
2025-05-07T20:32:25.4202927Z     @given(
2025-05-07T20:32:25.4203181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4203489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4203785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4204111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4204479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4204760Z     )
2025-05-07T20:32:25.4205099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4205529Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4206021Z         self,
2025-05-07T20:32:25.4206226Z         T: int,
2025-05-07T20:32:25.4206422Z         D: int,
2025-05-07T20:32:25.4206635Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4206898Z         contiguous: bool,
2025-05-07T20:32:25.4207140Z         compiled: bool,
2025-05-07T20:32:25.4207362Z     ) -> None:
2025-05-07T20:32:25.4207610Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4207847Z     
2025-05-07T20:32:25.4208116Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4208448Z     
2025-05-07T20:32:25.4208642Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4208927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4209324Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4209559Z         x0 = x[:, :D]
2025-05-07T20:32:25.4209773Z         x1 = x[:, D:]
2025-05-07T20:32:25.4209979Z     
2025-05-07T20:32:25.4210159Z         if contiguous:
2025-05-07T20:32:25.4210385Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4210640Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4210870Z     
2025-05-07T20:32:25.4211056Z         if scale_ub is not None:
2025-05-07T20:32:25.4211323Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4211653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4211958Z             )
2025-05-07T20:32:25.4212149Z         else:
2025-05-07T20:32:25.4212354Z             scale_ub_tensor = None
2025-05-07T20:32:25.4212600Z     
2025-05-07T20:32:25.4212826Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4213129Z             op = silu_mul_quant
2025-05-07T20:32:25.4213373Z             if compiled:
2025-05-07T20:32:25.4213618Z                 op = torch.compile(op)
2025-05-07T20:32:25.4213904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4214175Z     
2025-05-07T20:32:25.4214366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4214533Z 
2025-05-07T20:32:25.4214635Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4214923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4215252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4215596Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4216147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4216702Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4217355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4218036Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4218562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4219240Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4219892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4220413Z     kernel = self.compile(
2025-05-07T20:32:25.4220949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4221656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4222048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4222274Z 
2025-05-07T20:32:25.4222477Z self = <triton.compiler.compiler.ASTSource object at 0x7f05090ddc90>
2025-05-07T20:32:25.4223544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4224961Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f05be077b00>}
2025-05-07T20:32:25.4226297Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4227311Z context = <triton._C.libtriton.ir.context object at 0x7f050909a2f0>
2025-05-07T20:32:25.4227598Z 
2025-05-07T20:32:25.4227761Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4228279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4228740Z                            module_map=module_map)
2025-05-07T20:32:25.4229143Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4229491Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4229750Z E       ^
2025-05-07T20:32:25.4230204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4230653Z 
2025-05-07T20:32:25.4231064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4231574Z 
2025-05-07T20:32:25.4231678Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4232087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4232481Z     T=1,
2025-05-07T20:32:25.4232665Z     D=7168,
2025-05-07T20:32:25.4232854Z     scale_ub=None,
2025-05-07T20:32:25.4233064Z     contiguous=False,
2025-05-07T20:32:25.4233287Z     compiled=False,
2025-05-07T20:32:25.4233489Z )
2025-05-07T20:32:25.4233801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4234288Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.4234551Z 
2025-05-07T20:32:25.4234629Z     @given(
2025-05-07T20:32:25.4234855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4235159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4235461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4235833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4236155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4236436Z     )
2025-05-07T20:32:25.4236779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4237215Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4237449Z         self,
2025-05-07T20:32:25.4237639Z         T: int,
2025-05-07T20:32:25.4237829Z         D: int,
2025-05-07T20:32:25.4238039Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4238307Z         contiguous: bool,
2025-05-07T20:32:25.4238542Z         compiled: bool,
2025-05-07T20:32:25.4238758Z     ) -> None:
2025-05-07T20:32:25.4238970Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4239042Z     
2025-05-07T20:32:25.4239214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4239288Z     
2025-05-07T20:32:25.4239377Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4239508Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4239648Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4239729Z         x0 = x[:, :D]
2025-05-07T20:32:25.4239812Z         x1 = x[:, D:]
2025-05-07T20:32:25.4239882Z     
2025-05-07T20:32:25.4239971Z         if contiguous:
2025-05-07T20:32:25.4240062Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4240150Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4240225Z     
2025-05-07T20:32:25.4240314Z         if scale_ub is not None:
2025-05-07T20:32:25.4240461Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4240598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4240674Z             )
2025-05-07T20:32:25.4240749Z         else:
2025-05-07T20:32:25.4240846Z             scale_ub_tensor = None
2025-05-07T20:32:25.4240918Z     
2025-05-07T20:32:25.4241047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4241142Z             op = silu_mul_quant
2025-05-07T20:32:25.4241227Z             if compiled:
2025-05-07T20:32:25.4241329Z                 op = torch.compile(op)
2025-05-07T20:32:25.4241440Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4241513Z     
2025-05-07T20:32:25.4241604Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4241608Z 
2025-05-07T20:32:25.4241703Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4241832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4241934Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4242076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4242571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4242671Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4243028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4243260Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4243602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4243696Z     kernel = self.compile(
2025-05-07T20:32:25.4244079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4244251Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4244381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4244389Z 
2025-05-07T20:32:25.4244591Z self = <triton.compiler.compiler.ASTSource object at 0x7f05090e3450>
2025-05-07T20:32:25.4245365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4245911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509070680>}
2025-05-07T20:32:25.4246658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4246849Z context = <triton._C.libtriton.ir.context object at 0x7f0509093a30>
2025-05-07T20:32:25.4246860Z 
2025-05-07T20:32:25.4247022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4247286Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4247393Z                            module_map=module_map)
2025-05-07T20:32:25.4247629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4247730Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4247807Z E       ^
2025-05-07T20:32:25.4248206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4248211Z 
2025-05-07T20:32:25.4248626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4248631Z 
2025-05-07T20:32:25.4248733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4248953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4249107Z     T=2048,
2025-05-07T20:32:25.4249183Z     D=7168,
2025-05-07T20:32:25.4249268Z     scale_ub=None,
2025-05-07T20:32:25.4249353Z     contiguous=False,
2025-05-07T20:32:25.4249436Z     compiled=True,
2025-05-07T20:32:25.4249510Z )
2025-05-07T20:32:25.4249725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4249896Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.4249900Z 
2025-05-07T20:32:25.4249983Z     @given(
2025-05-07T20:32:25.4250106Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4250204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4250319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4250434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4250547Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4250620Z     )
2025-05-07T20:32:25.4250933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4251027Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4251102Z         self,
2025-05-07T20:32:25.4251181Z         T: int,
2025-05-07T20:32:25.4251259Z         D: int,
2025-05-07T20:32:25.4251356Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4251442Z         contiguous: bool,
2025-05-07T20:32:25.4251528Z         compiled: bool,
2025-05-07T20:32:25.4251608Z     ) -> None:
2025-05-07T20:32:25.4251703Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4251786Z     
2025-05-07T20:32:25.4251954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4252029Z     
2025-05-07T20:32:25.4252121Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4252243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4252333Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4252412Z         x0 = x[:, :D]
2025-05-07T20:32:25.4252493Z         x1 = x[:, D:]
2025-05-07T20:32:25.4252575Z     
2025-05-07T20:32:25.4252658Z         if contiguous:
2025-05-07T20:32:25.4252748Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4252843Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4252914Z     
2025-05-07T20:32:25.4253003Z         if scale_ub is not None:
2025-05-07T20:32:25.4253111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4253243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4253324Z             )
2025-05-07T20:32:25.4253448Z         else:
2025-05-07T20:32:25.4253547Z             scale_ub_tensor = None
2025-05-07T20:32:25.4253622Z     
2025-05-07T20:32:25.4253749Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4253841Z             op = silu_mul_quant
2025-05-07T20:32:25.4253928Z             if compiled:
2025-05-07T20:32:25.4254025Z                 op = torch.compile(op)
2025-05-07T20:32:25.4254127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4254208Z     
2025-05-07T20:32:25.4254298Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4254303Z 
2025-05-07T20:32:25.4254399Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4254530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4254630Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4254729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4255100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4255233Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4255730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4255830Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4256188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4256411Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4256797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4256893Z     kernel = self.compile(
2025-05-07T20:32:25.4257271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4257447Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4257578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4257582Z 
2025-05-07T20:32:25.4257784Z self = <triton.compiler.compiler.ASTSource object at 0x7f050905e4d0>
2025-05-07T20:32:25.4258561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4259103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509071d00>}
2025-05-07T20:32:25.4259851Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4260040Z context = <triton._C.libtriton.ir.context object at 0x7f05090bab30>
2025-05-07T20:32:25.4260044Z 
2025-05-07T20:32:25.4260210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4260473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4260578Z                            module_map=module_map)
2025-05-07T20:32:25.4260742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4260839Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4260922Z E       ^
2025-05-07T20:32:25.4261276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4261281Z 
2025-05-07T20:32:25.4261691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4261696Z 
2025-05-07T20:32:25.4261801Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4262062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4262143Z     T=4096,
2025-05-07T20:32:25.4262224Z     D=7168,
2025-05-07T20:32:25.4262307Z     scale_ub=None,
2025-05-07T20:32:25.4262398Z     contiguous=False,
2025-05-07T20:32:25.4262481Z     compiled=True,
2025-05-07T20:32:25.4262554Z )
2025-05-07T20:32:25.4262773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4262942Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.4262951Z 
2025-05-07T20:32:25.4263029Z     @given(
2025-05-07T20:32:25.4263151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4263249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4263360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4263477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4263587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4263664Z     )
2025-05-07T20:32:25.4263949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4264047Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4264124Z         self,
2025-05-07T20:32:25.4264202Z         T: int,
2025-05-07T20:32:25.4264276Z         D: int,
2025-05-07T20:32:25.4264377Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4264467Z         contiguous: bool,
2025-05-07T20:32:25.4264551Z         compiled: bool,
2025-05-07T20:32:25.4264678Z     ) -> None:
2025-05-07T20:32:25.4264770Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4264843Z     
2025-05-07T20:32:25.4265014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4265088Z     
2025-05-07T20:32:25.4265178Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4265304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4265392Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4265475Z         x0 = x[:, :D]
2025-05-07T20:32:25.4265557Z         x1 = x[:, D:]
2025-05-07T20:32:25.4265630Z     
2025-05-07T20:32:25.4265712Z         if contiguous:
2025-05-07T20:32:25.4265803Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4265890Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4265966Z     
2025-05-07T20:32:25.4266056Z         if scale_ub is not None:
2025-05-07T20:32:25.4266162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4266299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4266422Z             )
2025-05-07T20:32:25.4266498Z         else:
2025-05-07T20:32:25.4266595Z             scale_ub_tensor = None
2025-05-07T20:32:25.4266668Z     
2025-05-07T20:32:25.4266798Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4266888Z             op = silu_mul_quant
2025-05-07T20:32:25.4266971Z             if compiled:
2025-05-07T20:32:25.4267072Z                 op = torch.compile(op)
2025-05-07T20:32:25.4267175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4267252Z     
2025-05-07T20:32:25.4267348Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4267352Z 
2025-05-07T20:32:25.4267449Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4267577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4267683Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4267781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4268150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4268248Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4268739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4268841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4269195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4269505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4269847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4269940Z     kernel = self.compile(
2025-05-07T20:32:25.4270321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4270491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4270621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4270626Z 
2025-05-07T20:32:25.4270830Z self = <triton.compiler.compiler.ASTSource object at 0x7f05087cfd50>
2025-05-07T20:32:25.4271600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4272139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0509072840>}
2025-05-07T20:32:25.4272882Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4273108Z context = <triton._C.libtriton.ir.context object at 0x7f05087403b0>
2025-05-07T20:32:25.4273117Z 
2025-05-07T20:32:25.4273279Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4273539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4273647Z                            module_map=module_map)
2025-05-07T20:32:25.4273805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4273906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4273989Z E       ^
2025-05-07T20:32:25.4274340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4274345Z 
2025-05-07T20:32:25.4274756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4274761Z 
2025-05-07T20:32:25.4274861Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4275122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4275205Z     T=16384,
2025-05-07T20:32:25.4275281Z     D=5120,
2025-05-07T20:32:25.4275362Z     scale_ub=1200.0,
2025-05-07T20:32:25.4275449Z     contiguous=False,
2025-05-07T20:32:25.4275536Z     compiled=False,
2025-05-07T20:32:25.4275609Z )
2025-05-07T20:32:25.4275826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4276005Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.4276012Z 
2025-05-07T20:32:25.4276093Z     @given(
2025-05-07T20:32:25.4276210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4276309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4276426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4276540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4276650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4276733Z     )
2025-05-07T20:32:25.4276974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4277071Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4277148Z         self,
2025-05-07T20:32:25.4277224Z         T: int,
2025-05-07T20:32:25.4277304Z         D: int,
2025-05-07T20:32:25.4277401Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4277489Z         contiguous: bool,
2025-05-07T20:32:25.4277576Z         compiled: bool,
2025-05-07T20:32:25.4277698Z     ) -> None:
2025-05-07T20:32:25.4277794Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4277867Z     
2025-05-07T20:32:25.4278033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4278105Z     
2025-05-07T20:32:25.4278197Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4278321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4278409Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4278500Z         x0 = x[:, :D]
2025-05-07T20:32:25.4278579Z         x1 = x[:, D:]
2025-05-07T20:32:25.4278654Z     
2025-05-07T20:32:25.4278738Z         if contiguous:
2025-05-07T20:32:25.4278827Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4278918Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4278990Z     
2025-05-07T20:32:25.4279081Z         if scale_ub is not None:
2025-05-07T20:32:25.4279190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4279324Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4279437Z             )
2025-05-07T20:32:25.4279516Z         else:
2025-05-07T20:32:25.4279610Z             scale_ub_tensor = None
2025-05-07T20:32:25.4279683Z     
2025-05-07T20:32:25.4279813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4279903Z             op = silu_mul_quant
2025-05-07T20:32:25.4279989Z             if compiled:
2025-05-07T20:32:25.4280090Z                 op = torch.compile(op)
2025-05-07T20:32:25.4280258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4280334Z     
2025-05-07T20:32:25.4280424Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4280429Z 
2025-05-07T20:32:25.4280523Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4280653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4280752Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4280848Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4281350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4281447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4281804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4282021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4282360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4282503Z     kernel = self.compile(
2025-05-07T20:32:25.4282930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4283106Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4283229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4283234Z 
2025-05-07T20:32:25.4283436Z self = <triton.compiler.compiler.ASTSource object at 0x7f05087e9110>
2025-05-07T20:32:25.4284208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4284704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508758040>}
2025-05-07T20:32:25.4285453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4285639Z context = <triton._C.libtriton.ir.context object at 0x7f05087a9730>
2025-05-07T20:32:25.4285644Z 
2025-05-07T20:32:25.4285804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4286109Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4286217Z                            module_map=module_map)
2025-05-07T20:32:25.4286380Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4286478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4286555Z E       ^
2025-05-07T20:32:25.4290932Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4290949Z 
2025-05-07T20:32:25.4291386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4291391Z 
2025-05-07T20:32:25.4291501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4291725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4291805Z     T=16384,
2025-05-07T20:32:25.4291886Z     D=5120,
2025-05-07T20:32:25.4292042Z     scale_ub=1200.0,
2025-05-07T20:32:25.4292131Z     contiguous=True,
2025-05-07T20:32:25.4292217Z     compiled=True,
2025-05-07T20:32:25.4292290Z )
2025-05-07T20:32:25.4292510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4292690Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.4292694Z 
2025-05-07T20:32:25.4292773Z     @given(
2025-05-07T20:32:25.4292942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4293042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4293157Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4293276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4293388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4293464Z     )
2025-05-07T20:32:25.4293719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4293816Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4293903Z         self,
2025-05-07T20:32:25.4293983Z         T: int,
2025-05-07T20:32:25.4294061Z         D: int,
2025-05-07T20:32:25.4294164Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4294254Z         contiguous: bool,
2025-05-07T20:32:25.4294340Z         compiled: bool,
2025-05-07T20:32:25.4294424Z     ) -> None:
2025-05-07T20:32:25.4294521Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4294596Z     
2025-05-07T20:32:25.4294820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4294896Z     
2025-05-07T20:32:25.4294989Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4295125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4295216Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4295299Z         x0 = x[:, :D]
2025-05-07T20:32:25.4295383Z         x1 = x[:, D:]
2025-05-07T20:32:25.4295456Z     
2025-05-07T20:32:25.4295544Z         if contiguous:
2025-05-07T20:32:25.4295639Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4295732Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4295808Z     
2025-05-07T20:32:25.4295900Z         if scale_ub is not None:
2025-05-07T20:32:25.4296008Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4296145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4296221Z             )
2025-05-07T20:32:25.4296295Z         else:
2025-05-07T20:32:25.4296394Z             scale_ub_tensor = None
2025-05-07T20:32:25.4296470Z     
2025-05-07T20:32:25.4296598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4296694Z             op = silu_mul_quant
2025-05-07T20:32:25.4296780Z             if compiled:
2025-05-07T20:32:25.4296887Z                 op = torch.compile(op)
2025-05-07T20:32:25.4296990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4297062Z     
2025-05-07T20:32:25.4297153Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4297158Z 
2025-05-07T20:32:25.4297301Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4297432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4297540Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4297638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4298010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4298102Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4298603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4298699Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4299053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4299274Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4299654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4299749Z     kernel = self.compile(
2025-05-07T20:32:25.4300131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4300304Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4300433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4300479Z 
2025-05-07T20:32:25.4300681Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508bcfed0>
2025-05-07T20:32:25.4301452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4301956Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508759300>}
2025-05-07T20:32:25.4302703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4302896Z context = <triton._C.libtriton.ir.context object at 0x7f0508bec530>
2025-05-07T20:32:25.4302900Z 
2025-05-07T20:32:25.4303107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4303368Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4303474Z                            module_map=module_map)
2025-05-07T20:32:25.4303635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4303737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4303816Z E       ^
2025-05-07T20:32:25.4304174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4304179Z 
2025-05-07T20:32:25.4304592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4304596Z 
2025-05-07T20:32:25.4304699Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4304920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4305002Z     T=16384,
2025-05-07T20:32:25.4305077Z     D=5120,
2025-05-07T20:32:25.4305161Z     scale_ub=None,
2025-05-07T20:32:25.4305247Z     contiguous=False,
2025-05-07T20:32:25.4305328Z     compiled=True,
2025-05-07T20:32:25.4305403Z )
2025-05-07T20:32:25.4305812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4306064Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.4306074Z 
2025-05-07T20:32:25.4306149Z     @given(
2025-05-07T20:32:25.4306369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4306473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4306585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4306699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4306810Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4306883Z     )
2025-05-07T20:32:25.4307122Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4307219Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4307291Z         self,
2025-05-07T20:32:25.4307363Z         T: int,
2025-05-07T20:32:25.4307438Z         D: int,
2025-05-07T20:32:25.4307531Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4307620Z         contiguous: bool,
2025-05-07T20:32:25.4307701Z         compiled: bool,
2025-05-07T20:32:25.4307774Z     ) -> None:
2025-05-07T20:32:25.4307868Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4307939Z     
2025-05-07T20:32:25.4308168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4308244Z     
2025-05-07T20:32:25.4308332Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4308451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4308540Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4308617Z         x0 = x[:, :D]
2025-05-07T20:32:25.4308697Z         x1 = x[:, D:]
2025-05-07T20:32:25.4308768Z     
2025-05-07T20:32:25.4308909Z         if contiguous:
2025-05-07T20:32:25.4308996Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4309084Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4309154Z     
2025-05-07T20:32:25.4309244Z         if scale_ub is not None:
2025-05-07T20:32:25.4309345Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4309477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4309553Z             )
2025-05-07T20:32:25.4309624Z         else:
2025-05-07T20:32:25.4309719Z             scale_ub_tensor = None
2025-05-07T20:32:25.4309794Z     
2025-05-07T20:32:25.4309919Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4310005Z             op = silu_mul_quant
2025-05-07T20:32:25.4310089Z             if compiled:
2025-05-07T20:32:25.4310185Z                 op = torch.compile(op)
2025-05-07T20:32:25.4310286Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4310357Z     
2025-05-07T20:32:25.4310449Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4310517Z 
2025-05-07T20:32:25.4310613Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4310740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4310837Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4310934Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4311298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4311391Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4311885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4311979Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4312332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4312548Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4312889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4312981Z     kernel = self.compile(
2025-05-07T20:32:25.4313357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4313527Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4313694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4313702Z 
2025-05-07T20:32:25.4313902Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508bb67d0>
2025-05-07T20:32:25.4314674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4315166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508759e40>}
2025-05-07T20:32:25.4315912Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4316096Z context = <triton._C.libtriton.ir.context object at 0x7f0508b802f0>
2025-05-07T20:32:25.4316101Z 
2025-05-07T20:32:25.4316334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4316594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4316699Z                            module_map=module_map)
2025-05-07T20:32:25.4316859Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4316954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4317029Z E       ^
2025-05-07T20:32:25.4317428Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4317433Z 
2025-05-07T20:32:25.4317839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4317843Z 
2025-05-07T20:32:25.4317947Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4318165Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4318242Z     T=2048,
2025-05-07T20:32:25.4318324Z     D=5120,
2025-05-07T20:32:25.4318403Z     scale_ub=None,
2025-05-07T20:32:25.4318485Z     contiguous=False,
2025-05-07T20:32:25.4318568Z     compiled=True,
2025-05-07T20:32:25.4318638Z )
2025-05-07T20:32:25.4318849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4319020Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.4319072Z 
2025-05-07T20:32:25.4319147Z     @given(
2025-05-07T20:32:25.4319263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4319361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4319472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4319588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4319698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4319769Z     )
2025-05-07T20:32:25.4320012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4320105Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4320179Z         self,
2025-05-07T20:32:25.4320254Z         T: int,
2025-05-07T20:32:25.4320329Z         D: int,
2025-05-07T20:32:25.4320423Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4320513Z         contiguous: bool,
2025-05-07T20:32:25.4320595Z         compiled: bool,
2025-05-07T20:32:25.4320674Z     ) -> None:
2025-05-07T20:32:25.4320767Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4320841Z     
2025-05-07T20:32:25.4321008Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4321079Z     
2025-05-07T20:32:25.4321167Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4321290Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4321377Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4321454Z         x0 = x[:, :D]
2025-05-07T20:32:25.4321535Z         x1 = x[:, D:]
2025-05-07T20:32:25.4321603Z     
2025-05-07T20:32:25.4321729Z         if contiguous:
2025-05-07T20:32:25.4321825Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4321910Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4321982Z     
2025-05-07T20:32:25.4322070Z         if scale_ub is not None:
2025-05-07T20:32:25.4322173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4322305Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4322378Z             )
2025-05-07T20:32:25.4322457Z         else:
2025-05-07T20:32:25.4322553Z             scale_ub_tensor = None
2025-05-07T20:32:25.4322632Z     
2025-05-07T20:32:25.4322775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4322885Z             op = silu_mul_quant
2025-05-07T20:32:25.4322970Z             if compiled:
2025-05-07T20:32:25.4323065Z                 op = torch.compile(op)
2025-05-07T20:32:25.4323171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4323241Z     
2025-05-07T20:32:25.4323331Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4323379Z 
2025-05-07T20:32:25.4323475Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4323602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4323702Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4323798Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4324159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4324290Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4324776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4324869Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4325222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4325441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4325780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4325870Z     kernel = self.compile(
2025-05-07T20:32:25.4326247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4326419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4326585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4326589Z 
2025-05-07T20:32:25.4326791Z self = <triton.compiler.compiler.ASTSource object at 0x7f05088af4d0>
2025-05-07T20:32:25.4327607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4328105Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050875b240>}
2025-05-07T20:32:25.4328846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4329031Z context = <triton._C.libtriton.ir.context object at 0x7f05088eedb0>
2025-05-07T20:32:25.4329041Z 
2025-05-07T20:32:25.4329205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4329461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4329565Z                            module_map=module_map)
2025-05-07T20:32:25.4329724Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4329820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4329894Z E       ^
2025-05-07T20:32:25.4330288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4330294Z 
2025-05-07T20:32:25.4330702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4330706Z 
2025-05-07T20:32:25.4330810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4331025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4331110Z     T=2048,
2025-05-07T20:32:25.4331187Z     D=5120,
2025-05-07T20:32:25.4331269Z     scale_ub=1200.0,
2025-05-07T20:32:25.4331352Z     contiguous=False,
2025-05-07T20:32:25.4331430Z     compiled=True,
2025-05-07T20:32:25.4331502Z )
2025-05-07T20:32:25.4331717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4331889Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.4331897Z 
2025-05-07T20:32:25.4332018Z     @given(
2025-05-07T20:32:25.4332136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4332234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4332347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4332459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4332571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4332654Z     )
2025-05-07T20:32:25.4332972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4333067Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4333139Z         self,
2025-05-07T20:32:25.4333210Z         T: int,
2025-05-07T20:32:25.4333284Z         D: int,
2025-05-07T20:32:25.4333380Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4333465Z         contiguous: bool,
2025-05-07T20:32:25.4333549Z         compiled: bool,
2025-05-07T20:32:25.4333624Z     ) -> None:
2025-05-07T20:32:25.4333715Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4333789Z     
2025-05-07T20:32:25.4333952Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4334028Z     
2025-05-07T20:32:25.4334117Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4334237Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4334323Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4334401Z         x0 = x[:, :D]
2025-05-07T20:32:25.4334525Z         x1 = x[:, D:]
2025-05-07T20:32:25.4334596Z     
2025-05-07T20:32:25.4334677Z         if contiguous:
2025-05-07T20:32:25.4334765Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4334853Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4334923Z     
2025-05-07T20:32:25.4335011Z         if scale_ub is not None:
2025-05-07T20:32:25.4335115Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4335247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4335317Z             )
2025-05-07T20:32:25.4335396Z         else:
2025-05-07T20:32:25.4335490Z             scale_ub_tensor = None
2025-05-07T20:32:25.4335564Z     
2025-05-07T20:32:25.4335690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4335776Z             op = silu_mul_quant
2025-05-07T20:32:25.4335859Z             if compiled:
2025-05-07T20:32:25.4335955Z                 op = torch.compile(op)
2025-05-07T20:32:25.4336056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4336134Z     
2025-05-07T20:32:25.4336222Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4336226Z 
2025-05-07T20:32:25.4336321Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4336451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4336549Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4336647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4337051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4337145Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4337634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4337727Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4338079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4338302Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4338635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4338728Z     kernel = self.compile(
2025-05-07T20:32:25.4339103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4339273Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4339444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4339449Z 
2025-05-07T20:32:25.4339648Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508803a90>
2025-05-07T20:32:25.4340415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4340953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050886c720>}
2025-05-07T20:32:25.4341693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4341882Z context = <triton._C.libtriton.ir.context object at 0x7f05088d80f0>
2025-05-07T20:32:25.4341888Z 
2025-05-07T20:32:25.4342048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4342308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4342415Z                            module_map=module_map)
2025-05-07T20:32:25.4342572Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4342669Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4342788Z E       ^
2025-05-07T20:32:25.4343137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4343142Z 
2025-05-07T20:32:25.4343548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4343552Z 
2025-05-07T20:32:25.4343652Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4343875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4343950Z     T=4096,
2025-05-07T20:32:25.4344024Z     D=5120,
2025-05-07T20:32:25.4344106Z     scale_ub=1200.0,
2025-05-07T20:32:25.4344184Z     contiguous=True,
2025-05-07T20:32:25.4344267Z     compiled=True,
2025-05-07T20:32:25.4344338Z )
2025-05-07T20:32:25.4344551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4344720Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.4344730Z 
2025-05-07T20:32:25.4344803Z     @given(
2025-05-07T20:32:25.4344921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4345028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4345141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4345253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4345367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4345439Z     )
2025-05-07T20:32:25.4345748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4345840Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4345911Z         self,
2025-05-07T20:32:25.4345984Z         T: int,
2025-05-07T20:32:25.4346056Z         D: int,
2025-05-07T20:32:25.4346152Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4346241Z         contiguous: bool,
2025-05-07T20:32:25.4346324Z         compiled: bool,
2025-05-07T20:32:25.4346403Z     ) -> None:
2025-05-07T20:32:25.4346496Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4346565Z     
2025-05-07T20:32:25.4346729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4346803Z     
2025-05-07T20:32:25.4346892Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4347014Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4347098Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4347174Z         x0 = x[:, :D]
2025-05-07T20:32:25.4347260Z         x1 = x[:, D:]
2025-05-07T20:32:25.4347374Z     
2025-05-07T20:32:25.4347456Z         if contiguous:
2025-05-07T20:32:25.4347547Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4347633Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4347702Z     
2025-05-07T20:32:25.4347790Z         if scale_ub is not None:
2025-05-07T20:32:25.4347892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4348020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4348139Z             )
2025-05-07T20:32:25.4348211Z         else:
2025-05-07T20:32:25.4348301Z             scale_ub_tensor = None
2025-05-07T20:32:25.4348374Z     
2025-05-07T20:32:25.4348499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4348589Z             op = silu_mul_quant
2025-05-07T20:32:25.4348669Z             if compiled:
2025-05-07T20:32:25.4348765Z                 op = torch.compile(op)
2025-05-07T20:32:25.4348873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4348943Z     
2025-05-07T20:32:25.4349034Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4349038Z 
2025-05-07T20:32:25.4349139Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4349267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4349365Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4349463Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4349825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4349965Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4350453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4350547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4350903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4351124Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4351461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4351553Z     kernel = self.compile(
2025-05-07T20:32:25.4351928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4352099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4352225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4352229Z 
2025-05-07T20:32:25.4352428Z self = <triton.compiler.compiler.ASTSource object at 0x7f05086c8f50>
2025-05-07T20:32:25.4353287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4353783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050886d260>}
2025-05-07T20:32:25.4354520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4354708Z context = <triton._C.libtriton.ir.context object at 0x7f050867d570>
2025-05-07T20:32:25.4354713Z 
2025-05-07T20:32:25.4354873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4355130Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4355234Z                            module_map=module_map)
2025-05-07T20:32:25.4355394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4355492Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4355606Z E       ^
2025-05-07T20:32:25.4355957Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4355962Z 
2025-05-07T20:32:25.4356368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4356373Z 
2025-05-07T20:32:25.4356476Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4356732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4356808Z     T=128,
2025-05-07T20:32:25.4356887Z     D=5120,
2025-05-07T20:32:25.4356968Z     scale_ub=1200.0,
2025-05-07T20:32:25.4357049Z     contiguous=False,
2025-05-07T20:32:25.4357130Z     compiled=True,
2025-05-07T20:32:25.4357200Z )
2025-05-07T20:32:25.4357413Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4357590Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.4357594Z 
2025-05-07T20:32:25.4357669Z     @given(
2025-05-07T20:32:25.4357786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4357882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4357991Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4358105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4358215Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4358330Z     )
2025-05-07T20:32:25.4358570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4358659Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4358735Z         self,
2025-05-07T20:32:25.4358809Z         T: int,
2025-05-07T20:32:25.4358880Z         D: int,
2025-05-07T20:32:25.4358977Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4359064Z         contiguous: bool,
2025-05-07T20:32:25.4359144Z         compiled: bool,
2025-05-07T20:32:25.4359227Z     ) -> None:
2025-05-07T20:32:25.4359318Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4359389Z     
2025-05-07T20:32:25.4359556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4359628Z     
2025-05-07T20:32:25.4359715Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4359837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4359922Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4360008Z         x0 = x[:, :D]
2025-05-07T20:32:25.4360085Z         x1 = x[:, D:]
2025-05-07T20:32:25.4360156Z     
2025-05-07T20:32:25.4360238Z         if contiguous:
2025-05-07T20:32:25.4360327Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4360412Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4360484Z     
2025-05-07T20:32:25.4360572Z         if scale_ub is not None:
2025-05-07T20:32:25.4360672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4360853Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4360930Z             )
2025-05-07T20:32:25.4361003Z         else:
2025-05-07T20:32:25.4361101Z             scale_ub_tensor = None
2025-05-07T20:32:25.4361169Z     
2025-05-07T20:32:25.4361295Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4361386Z             op = silu_mul_quant
2025-05-07T20:32:25.4361468Z             if compiled:
2025-05-07T20:32:25.4361566Z                 op = torch.compile(op)
2025-05-07T20:32:25.4361675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4361743Z     
2025-05-07T20:32:25.4361830Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4361835Z 
2025-05-07T20:32:25.4361929Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4362054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4362154Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4362248Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4362659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4362754Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4363241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4363340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4363693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4363950Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4364286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4364375Z     kernel = self.compile(
2025-05-07T20:32:25.4364754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4364930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4365054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4365059Z 
2025-05-07T20:32:25.4365261Z self = <triton.compiler.compiler.ASTSource object at 0x7f05086ce2d0>
2025-05-07T20:32:25.4366028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4366571Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050886e480>}
2025-05-07T20:32:25.4367308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4367496Z context = <triton._C.libtriton.ir.context object at 0x7f050861a930>
2025-05-07T20:32:25.4367500Z 
2025-05-07T20:32:25.4367718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4367979Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4368088Z                            module_map=module_map)
2025-05-07T20:32:25.4368247Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4368351Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4368429Z E       ^
2025-05-07T20:32:25.4368778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4368782Z 
2025-05-07T20:32:25.4369198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4369202Z 
2025-05-07T20:32:25.4369347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4369569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4369650Z     T=16384,
2025-05-07T20:32:25.4369726Z     D=7168,
2025-05-07T20:32:25.4369808Z     scale_ub=1200.0,
2025-05-07T20:32:25.4369893Z     contiguous=True,
2025-05-07T20:32:25.4369975Z     compiled=True,
2025-05-07T20:32:25.4370048Z )
2025-05-07T20:32:25.4370266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4370443Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.4370448Z 
2025-05-07T20:32:25.4370525Z     @given(
2025-05-07T20:32:25.4370643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4370739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4370853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4370969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4371083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4371200Z     )
2025-05-07T20:32:25.4371443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4371535Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4371613Z         self,
2025-05-07T20:32:25.4371690Z         T: int,
2025-05-07T20:32:25.4371768Z         D: int,
2025-05-07T20:32:25.4371865Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4371953Z         contiguous: bool,
2025-05-07T20:32:25.4372082Z         compiled: bool,
2025-05-07T20:32:25.4372161Z     ) -> None:
2025-05-07T20:32:25.4372254Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4372329Z     
2025-05-07T20:32:25.4372495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4372569Z     
2025-05-07T20:32:25.4372661Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4372784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4372871Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4372960Z         x0 = x[:, :D]
2025-05-07T20:32:25.4373039Z         x1 = x[:, D:]
2025-05-07T20:32:25.4373111Z     
2025-05-07T20:32:25.4373197Z         if contiguous:
2025-05-07T20:32:25.4373288Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4373378Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4373452Z     
2025-05-07T20:32:25.4373540Z         if scale_ub is not None:
2025-05-07T20:32:25.4373646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4373847Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4373921Z             )
2025-05-07T20:32:25.4374002Z         else:
2025-05-07T20:32:25.4374094Z             scale_ub_tensor = None
2025-05-07T20:32:25.4374167Z     
2025-05-07T20:32:25.4374301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4374392Z             op = silu_mul_quant
2025-05-07T20:32:25.4374476Z             if compiled:
2025-05-07T20:32:25.4374578Z                 op = torch.compile(op)
2025-05-07T20:32:25.4374687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4374763Z     
2025-05-07T20:32:25.4374852Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4374857Z 
2025-05-07T20:32:25.4374952Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4375081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4375181Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4375279Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4375649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4375742Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4376234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4376331Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4376728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4376951Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4377287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4377380Z     kernel = self.compile(
2025-05-07T20:32:25.4377760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4377938Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4378067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4378072Z 
2025-05-07T20:32:25.4378271Z self = <triton.compiler.compiler.ASTSource object at 0x7f050864b610>
2025-05-07T20:32:25.4379081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4379586Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050886fd80>}
2025-05-07T20:32:25.4380324Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4380557Z context = <triton._C.libtriton.ir.context object at 0x7f0508627c70>
2025-05-07T20:32:25.4380562Z 
2025-05-07T20:32:25.4380723Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4380985Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4381090Z                            module_map=module_map)
2025-05-07T20:32:25.4381256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4381356Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4381432Z E       ^
2025-05-07T20:32:25.4381786Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4381790Z 
2025-05-07T20:32:25.4382199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4382244Z 
2025-05-07T20:32:25.4382350Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4382569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4382647Z     T=16384,
2025-05-07T20:32:25.4382725Z     D=5120,
2025-05-07T20:32:25.4382807Z     scale_ub=1200.0,
2025-05-07T20:32:25.4382890Z     contiguous=True,
2025-05-07T20:32:25.4382975Z     compiled=False,
2025-05-07T20:32:25.4383048Z )
2025-05-07T20:32:25.4383266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4383447Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.4383452Z 
2025-05-07T20:32:25.4383527Z     @given(
2025-05-07T20:32:25.4383644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4383744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4383857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4383974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4384090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4384165Z     )
2025-05-07T20:32:25.4384407Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4384499Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4384575Z         self,
2025-05-07T20:32:25.4384655Z         T: int,
2025-05-07T20:32:25.4384730Z         D: int,
2025-05-07T20:32:25.4384827Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4384963Z         contiguous: bool,
2025-05-07T20:32:25.4385053Z         compiled: bool,
2025-05-07T20:32:25.4385133Z     ) -> None:
2025-05-07T20:32:25.4385227Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4385300Z     
2025-05-07T20:32:25.4385468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4385542Z     
2025-05-07T20:32:25.4385633Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4385758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4385853Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4385932Z         x0 = x[:, :D]
2025-05-07T20:32:25.4386015Z         x1 = x[:, D:]
2025-05-07T20:32:25.4386087Z     
2025-05-07T20:32:25.4386170Z         if contiguous:
2025-05-07T20:32:25.4386263Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4386351Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4386422Z     
2025-05-07T20:32:25.4386513Z         if scale_ub is not None:
2025-05-07T20:32:25.4386622Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4386801Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4386876Z             )
2025-05-07T20:32:25.4386953Z         else:
2025-05-07T20:32:25.4387048Z             scale_ub_tensor = None
2025-05-07T20:32:25.4387119Z     
2025-05-07T20:32:25.4387247Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4387342Z             op = silu_mul_quant
2025-05-07T20:32:25.4387426Z             if compiled:
2025-05-07T20:32:25.4387568Z                 op = torch.compile(op)
2025-05-07T20:32:25.4387674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4387747Z     
2025-05-07T20:32:25.4387838Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4387845Z 
2025-05-07T20:32:25.4387941Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4388069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4388171Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4388271Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4388765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4388866Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4389221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4389436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4389816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4389909Z     kernel = self.compile(
2025-05-07T20:32:25.4390292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4390463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4390592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4390598Z 
2025-05-07T20:32:25.4390801Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508544950>
2025-05-07T20:32:25.4391567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4392068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050856ccc0>}
2025-05-07T20:32:25.4392811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4393002Z context = <triton._C.libtriton.ir.context object at 0x7f0508518f70>
2025-05-07T20:32:25.4393007Z 
2025-05-07T20:32:25.4393210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4393472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4393580Z                            module_map=module_map)
2025-05-07T20:32:25.4393741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4393838Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4393920Z E       ^
2025-05-07T20:32:25.4394275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4394280Z 
2025-05-07T20:32:25.4394690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4394695Z 
2025-05-07T20:32:25.4394800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4395019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4395102Z     T=1,
2025-05-07T20:32:25.4395220Z     D=7168,
2025-05-07T20:32:25.4395304Z     scale_ub=1200.0,
2025-05-07T20:32:25.4395391Z     contiguous=False,
2025-05-07T20:32:25.4395473Z     compiled=False,
2025-05-07T20:32:25.4395549Z )
2025-05-07T20:32:25.4395762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4395932Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.4395982Z 
2025-05-07T20:32:25.4396062Z     @given(
2025-05-07T20:32:25.4396179Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4396280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4396396Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4396511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4396622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4396699Z     )
2025-05-07T20:32:25.4396947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4397044Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4397119Z         self,
2025-05-07T20:32:25.4397195Z         T: int,
2025-05-07T20:32:25.4397275Z         D: int,
2025-05-07T20:32:25.4397371Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4397458Z         contiguous: bool,
2025-05-07T20:32:25.4397547Z         compiled: bool,
2025-05-07T20:32:25.4397628Z     ) -> None:
2025-05-07T20:32:25.4397822Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4397896Z     
2025-05-07T20:32:25.4398064Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4398137Z     
2025-05-07T20:32:25.4398234Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4398356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4398448Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4398527Z         x0 = x[:, :D]
2025-05-07T20:32:25.4398606Z         x1 = x[:, D:]
2025-05-07T20:32:25.4398682Z     
2025-05-07T20:32:25.4398767Z         if contiguous:
2025-05-07T20:32:25.4398857Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4398950Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4399022Z     
2025-05-07T20:32:25.4399112Z         if scale_ub is not None:
2025-05-07T20:32:25.4399219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4399353Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4399431Z             )
2025-05-07T20:32:25.4399513Z         else:
2025-05-07T20:32:25.4399606Z             scale_ub_tensor = None
2025-05-07T20:32:25.4399678Z     
2025-05-07T20:32:25.4399810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4399899Z             op = silu_mul_quant
2025-05-07T20:32:25.4399984Z             if compiled:
2025-05-07T20:32:25.4400082Z                 op = torch.compile(op)
2025-05-07T20:32:25.4400185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4400263Z     
2025-05-07T20:32:25.4400396Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4400403Z 
2025-05-07T20:32:25.4400499Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4400630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4400730Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4400827Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4401323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4401425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4401782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4402004Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4402339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4402439Z     kernel = self.compile(
2025-05-07T20:32:25.4402860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4403035Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4403160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4403164Z 
2025-05-07T20:32:25.4403364Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508573a10>
2025-05-07T20:32:25.4404201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4404697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050856d080>}
2025-05-07T20:32:25.4405444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4405834Z context = <triton._C.libtriton.ir.context object at 0x7f05085c0070>
2025-05-07T20:32:25.4405843Z 
2025-05-07T20:32:25.4406068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4406345Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4406543Z                            module_map=module_map)
2025-05-07T20:32:25.4406703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4406797Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4406872Z E       ^
2025-05-07T20:32:25.4407222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4407227Z 
2025-05-07T20:32:25.4407691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4407696Z 
2025-05-07T20:32:25.4407800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4408014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4408088Z     T=4096,
2025-05-07T20:32:25.4408162Z     D=7168,
2025-05-07T20:32:25.4408242Z     scale_ub=1200.0,
2025-05-07T20:32:25.4408327Z     contiguous=False,
2025-05-07T20:32:25.4408411Z     compiled=True,
2025-05-07T20:32:25.4408483Z )
2025-05-07T20:32:25.4408702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4417671Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.4417682Z 
2025-05-07T20:32:25.4417789Z     @given(
2025-05-07T20:32:25.4417954Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4418222Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4418362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4418490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4418608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4418689Z     )
2025-05-07T20:32:25.4418953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4419053Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4419146Z         self,
2025-05-07T20:32:25.4419229Z         T: int,
2025-05-07T20:32:25.4419312Z         D: int,
2025-05-07T20:32:25.4419423Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4419518Z         contiguous: bool,
2025-05-07T20:32:25.4419609Z         compiled: bool,
2025-05-07T20:32:25.4419699Z     ) -> None:
2025-05-07T20:32:25.4419801Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4419880Z     
2025-05-07T20:32:25.4420067Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4420151Z     
2025-05-07T20:32:25.4420319Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4420460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4420556Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4420646Z         x0 = x[:, :D]
2025-05-07T20:32:25.4420731Z         x1 = x[:, D:]
2025-05-07T20:32:25.4420837Z     
2025-05-07T20:32:25.4420929Z         if contiguous:
2025-05-07T20:32:25.4421029Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4421185Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4421266Z     
2025-05-07T20:32:25.4421360Z         if scale_ub is not None:
2025-05-07T20:32:25.4421467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4421614Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4421695Z             )
2025-05-07T20:32:25.4421774Z         else:
2025-05-07T20:32:25.4421873Z             scale_ub_tensor = None
2025-05-07T20:32:25.4421948Z     
2025-05-07T20:32:25.4422087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4422182Z             op = silu_mul_quant
2025-05-07T20:32:25.4422269Z             if compiled:
2025-05-07T20:32:25.4422375Z                 op = torch.compile(op)
2025-05-07T20:32:25.4422515Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4422590Z     
2025-05-07T20:32:25.4422683Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4422688Z 
2025-05-07T20:32:25.4422787Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4422970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4423076Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4423178Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4423559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4423660Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4424163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4424270Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4424630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4424854Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4425198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4425301Z     kernel = self.compile(
2025-05-07T20:32:25.4425689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4425865Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4425995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4426000Z 
2025-05-07T20:32:25.4426253Z self = <triton.compiler.compiler.ASTSource object at 0x7f05084366d0>
2025-05-07T20:32:25.4427040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4427546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050856f060>}
2025-05-07T20:32:25.4428305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4428501Z context = <triton._C.libtriton.ir.context object at 0x7f05084d6d30>
2025-05-07T20:32:25.4428506Z 
2025-05-07T20:32:25.4428671Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4428981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4429092Z                            module_map=module_map)
2025-05-07T20:32:25.4429256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4429363Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4429447Z E       ^
2025-05-07T20:32:25.4429802Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4429847Z 
2025-05-07T20:32:25.4430265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4430270Z 
2025-05-07T20:32:25.4430374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4430604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4430685Z     T=128,
2025-05-07T20:32:25.4430764Z     D=7168,
2025-05-07T20:32:25.4430857Z     scale_ub=1200.0,
2025-05-07T20:32:25.4430950Z     contiguous=False,
2025-05-07T20:32:25.4431038Z     compiled=True,
2025-05-07T20:32:25.4431121Z )
2025-05-07T20:32:25.4431342Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4431513Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.4431521Z 
2025-05-07T20:32:25.4431602Z     @given(
2025-05-07T20:32:25.4431728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4431883Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4431999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4432119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4432238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4432318Z     )
2025-05-07T20:32:25.4432565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4432673Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4432787Z         self,
2025-05-07T20:32:25.4432874Z         T: int,
2025-05-07T20:32:25.4432953Z         D: int,
2025-05-07T20:32:25.4433054Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4433152Z         contiguous: bool,
2025-05-07T20:32:25.4433244Z         compiled: bool,
2025-05-07T20:32:25.4433326Z     ) -> None:
2025-05-07T20:32:25.4433429Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4433532Z     
2025-05-07T20:32:25.4454435Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4454529Z     
2025-05-07T20:32:25.4454627Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4454755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4454849Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4454931Z         x0 = x[:, :D]
2025-05-07T20:32:25.4455012Z         x1 = x[:, D:]
2025-05-07T20:32:25.4455088Z     
2025-05-07T20:32:25.4455173Z         if contiguous:
2025-05-07T20:32:25.4455264Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4455423Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4455498Z     
2025-05-07T20:32:25.4455592Z         if scale_ub is not None:
2025-05-07T20:32:25.4455698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4455835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4455914Z             )
2025-05-07T20:32:25.4455991Z         else:
2025-05-07T20:32:25.4456086Z             scale_ub_tensor = None
2025-05-07T20:32:25.4456169Z     
2025-05-07T20:32:25.4456300Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4456391Z             op = silu_mul_quant
2025-05-07T20:32:25.4456480Z             if compiled:
2025-05-07T20:32:25.4456581Z                 op = torch.compile(op)
2025-05-07T20:32:25.4456688Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4456765Z     
2025-05-07T20:32:25.4456857Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4456861Z 
2025-05-07T20:32:25.4456961Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4457144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4457248Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4457350Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4457723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4457817Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4458358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4458461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4458825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4459048Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4459395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4459492Z     kernel = self.compile(
2025-05-07T20:32:25.4459875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4460050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4460182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4460190Z 
2025-05-07T20:32:25.4460437Z self = <triton.compiler.compiler.ASTSource object at 0x7f05084c3910>
2025-05-07T20:32:25.4461220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4461724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508438360>}
2025-05-07T20:32:25.4462478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4462668Z context = <triton._C.libtriton.ir.context object at 0x7f0508463ef0>
2025-05-07T20:32:25.4462673Z 
2025-05-07T20:32:25.4462837Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4463107Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4463215Z                            module_map=module_map)
2025-05-07T20:32:25.4463381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4463481Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4463559Z E       ^
2025-05-07T20:32:25.4463957Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4463964Z 
2025-05-07T20:32:25.4464377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4464382Z 
2025-05-07T20:32:25.4464490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4464711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4464789Z     T=2048,
2025-05-07T20:32:25.4464873Z     D=7168,
2025-05-07T20:32:25.4464956Z     scale_ub=None,
2025-05-07T20:32:25.4465045Z     contiguous=True,
2025-05-07T20:32:25.4465128Z     compiled=True,
2025-05-07T20:32:25.4465202Z )
2025-05-07T20:32:25.4465421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4465590Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.4465595Z 
2025-05-07T20:32:25.4465672Z     @given(
2025-05-07T20:32:25.4465799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4465941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4466061Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4466179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4466293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4466370Z     )
2025-05-07T20:32:25.4466613Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4466774Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4466854Z         self,
2025-05-07T20:32:25.4466932Z         T: int,
2025-05-07T20:32:25.4467009Z         D: int,
2025-05-07T20:32:25.4467111Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4467201Z         contiguous: bool,
2025-05-07T20:32:25.4467286Z         compiled: bool,
2025-05-07T20:32:25.4467368Z     ) -> None:
2025-05-07T20:32:25.4467463Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4467539Z     
2025-05-07T20:32:25.4467714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4467790Z     
2025-05-07T20:32:25.4467884Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4468009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4468099Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4468179Z         x0 = x[:, :D]
2025-05-07T20:32:25.4468259Z         x1 = x[:, D:]
2025-05-07T20:32:25.4468332Z     
2025-05-07T20:32:25.4468419Z         if contiguous:
2025-05-07T20:32:25.4468557Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4468647Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4468722Z     
2025-05-07T20:32:25.4468813Z         if scale_ub is not None:
2025-05-07T20:32:25.4468918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4469057Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4469133Z             )
2025-05-07T20:32:25.4469212Z         else:
2025-05-07T20:32:25.4469306Z             scale_ub_tensor = None
2025-05-07T20:32:25.4469386Z     
2025-05-07T20:32:25.4469521Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4469618Z             op = silu_mul_quant
2025-05-07T20:32:25.4469716Z             if compiled:
2025-05-07T20:32:25.4469819Z                 op = torch.compile(op)
2025-05-07T20:32:25.4469933Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4470009Z     
2025-05-07T20:32:25.4470102Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4470112Z 
2025-05-07T20:32:25.4470219Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4470352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4470457Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4470564Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4470934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4471036Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4471577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4471681Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4472046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4472274Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4472669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4472775Z     kernel = self.compile(
2025-05-07T20:32:25.4473159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4473341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4473473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4473478Z 
2025-05-07T20:32:25.4473727Z self = <triton.compiler.compiler.ASTSource object at 0x7f05083bd110>
2025-05-07T20:32:25.4474516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4475019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0508438ea0>}
2025-05-07T20:32:25.4475819Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4476010Z context = <triton._C.libtriton.ir.context object at 0x7f0508381730>
2025-05-07T20:32:25.4476015Z 
2025-05-07T20:32:25.4476190Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4476458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4476568Z                            module_map=module_map)
2025-05-07T20:32:25.4476741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4476842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4476924Z E       ^
2025-05-07T20:32:25.4477284Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4477331Z 
2025-05-07T20:32:25.4477748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4477752Z 
2025-05-07T20:32:25.4477865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4478090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4478171Z     T=16384,
2025-05-07T20:32:25.4478257Z     D=5120,
2025-05-07T20:32:25.4478349Z     scale_ub=None,
2025-05-07T20:32:25.4478437Z     contiguous=False,
2025-05-07T20:32:25.4478528Z     compiled=False,
2025-05-07T20:32:25.4478606Z )
2025-05-07T20:32:25.4478828Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4479014Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.4479018Z 
2025-05-07T20:32:25.4479097Z     @given(
2025-05-07T20:32:25.4479231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4479332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4479449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4479575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4479690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4479770Z     )
2025-05-07T20:32:25.4480022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4480163Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4480245Z         self,
2025-05-07T20:32:25.4480333Z         T: int,
2025-05-07T20:32:25.4480414Z         D: int,
2025-05-07T20:32:25.4480518Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4480612Z         contiguous: bool,
2025-05-07T20:32:25.4480704Z         compiled: bool,
2025-05-07T20:32:25.4480796Z     ) -> None:
2025-05-07T20:32:25.4480895Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4480977Z     
2025-05-07T20:32:25.4481158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4481240Z     
2025-05-07T20:32:25.4481335Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4481468Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4483367Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4483374Z 
2025-05-07T20:32:25.4483507Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.4483551Z 
2025-05-07T20:32:25.4483659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4483890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4483970Z     T=4096,
2025-05-07T20:32:25.4484054Z     D=7168,
2025-05-07T20:32:25.4484151Z     scale_ub=1200.0,
2025-05-07T20:32:25.4484238Z     contiguous=True,
2025-05-07T20:32:25.4484325Z     compiled=True,
2025-05-07T20:32:25.4484411Z )
2025-05-07T20:32:25.4484637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4484812Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.4484817Z 
2025-05-07T20:32:25.4484903Z     @given(
2025-05-07T20:32:25.4485025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4485132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4485250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4485370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4485532Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4485609Z     )
2025-05-07T20:32:25.4485859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4485964Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4486047Z         self,
2025-05-07T20:32:25.4486127Z         T: int,
2025-05-07T20:32:25.4486215Z         D: int,
2025-05-07T20:32:25.4486317Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4486413Z         contiguous: bool,
2025-05-07T20:32:25.4486510Z         compiled: bool,
2025-05-07T20:32:25.4486594Z     ) -> None:
2025-05-07T20:32:25.4486699Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4486777Z     
2025-05-07T20:32:25.4486947Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4487034Z     
2025-05-07T20:32:25.4487130Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4487259Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4489154Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4489160Z 
2025-05-07T20:32:25.4489285Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.4489289Z 
2025-05-07T20:32:25.4489402Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4489626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4489706Z     T=16384,
2025-05-07T20:32:25.4489790Z     D=7168,
2025-05-07T20:32:25.4489879Z     scale_ub=None,
2025-05-07T20:32:25.4489977Z     contiguous=False,
2025-05-07T20:32:25.4490063Z     compiled=False,
2025-05-07T20:32:25.4490140Z )
2025-05-07T20:32:25.4490361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4490539Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.4490544Z 
2025-05-07T20:32:25.4490624Z     @given(
2025-05-07T20:32:25.4490750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4490900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4491019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4491149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4491267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4491354Z     )
2025-05-07T20:32:25.4491600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4491700Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4491825Z         self,
2025-05-07T20:32:25.4491904Z         T: int,
2025-05-07T20:32:25.4491983Z         D: int,
2025-05-07T20:32:25.4492093Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4492186Z         contiguous: bool,
2025-05-07T20:32:25.4492274Z         compiled: bool,
2025-05-07T20:32:25.4492363Z     ) -> None:
2025-05-07T20:32:25.4492458Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4492534Z     
2025-05-07T20:32:25.4492708Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4494497Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4494546Z 
2025-05-07T20:32:25.4494673Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4494678Z 
2025-05-07T20:32:25.4494782Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4495008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4495089Z     T=2048,
2025-05-07T20:32:25.4495169Z     D=7168,
2025-05-07T20:32:25.4495264Z     scale_ub=1200.0,
2025-05-07T20:32:25.4495357Z     contiguous=True,
2025-05-07T20:32:25.4495444Z     compiled=True,
2025-05-07T20:32:25.4495529Z )
2025-05-07T20:32:25.4495745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4495917Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.4495929Z 
2025-05-07T20:32:25.4496009Z     @given(
2025-05-07T20:32:25.4496131Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4496243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4496362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4496480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4496602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4496682Z     )
2025-05-07T20:32:25.4496926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4497072Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4497155Z         self,
2025-05-07T20:32:25.4497235Z         T: int,
2025-05-07T20:32:25.4497324Z         D: int,
2025-05-07T20:32:25.4497425Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4497517Z         contiguous: bool,
2025-05-07T20:32:25.4497610Z         compiled: bool,
2025-05-07T20:32:25.4497692Z     ) -> None:
2025-05-07T20:32:25.4497789Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4497872Z     
2025-05-07T20:32:25.4498046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4498124Z     
2025-05-07T20:32:25.4498227Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4498354Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4500193Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4500200Z 
2025-05-07T20:32:25.4500320Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.4500325Z 
2025-05-07T20:32:25.4500477Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4500699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4500782Z     T=2048,
2025-05-07T20:32:25.4500870Z     D=7168,
2025-05-07T20:32:25.4500955Z     scale_ub=None,
2025-05-07T20:32:25.4501045Z     contiguous=True,
2025-05-07T20:32:25.4501136Z     compiled=False,
2025-05-07T20:32:25.4501212Z )
2025-05-07T20:32:25.4501428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4501610Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4501614Z 
2025-05-07T20:32:25.4501693Z     @given(
2025-05-07T20:32:25.4501819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4501920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4502037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4502160Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4502320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4502397Z     )
2025-05-07T20:32:25.4502677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4502792Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4502871Z         self,
2025-05-07T20:32:25.4502958Z         T: int,
2025-05-07T20:32:25.4503037Z         D: int,
2025-05-07T20:32:25.4503143Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4503234Z         contiguous: bool,
2025-05-07T20:32:25.4503325Z         compiled: bool,
2025-05-07T20:32:25.4503417Z     ) -> None:
2025-05-07T20:32:25.4503514Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4503591Z     
2025-05-07T20:32:25.4503765Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4503840Z     
2025-05-07T20:32:25.4503935Z >       x_sign = torch.sign(x)
2025-05-07T20:32:25.4505945Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4505964Z 
2025-05-07T20:32:25.4506236Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:25.4506242Z 
2025-05-07T20:32:25.4506354Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4506577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4506664Z     T=1,
2025-05-07T20:32:25.4506743Z     D=7168,
2025-05-07T20:32:25.4506826Z     scale_ub=1200.0,
2025-05-07T20:32:25.4506914Z     contiguous=True,
2025-05-07T20:32:25.4506999Z     compiled=False,
2025-05-07T20:32:25.4507077Z )
2025-05-07T20:32:25.4507294Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4507456Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.4507461Z 
2025-05-07T20:32:25.4507537Z     @given(
2025-05-07T20:32:25.4507657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4507756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4507876Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4508056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4508170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4508256Z     )
2025-05-07T20:32:25.4508495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4508587Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4508669Z         self,
2025-05-07T20:32:25.4508745Z         T: int,
2025-05-07T20:32:25.4508880Z         D: int,
2025-05-07T20:32:25.4508986Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4509076Z         contiguous: bool,
2025-05-07T20:32:25.4509162Z         compiled: bool,
2025-05-07T20:32:25.4509246Z     ) -> None:
2025-05-07T20:32:25.4509340Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4509414Z     
2025-05-07T20:32:25.4509586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4509660Z     
2025-05-07T20:32:25.4509755Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4509883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4509972Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4510059Z         x0 = x[:, :D]
2025-05-07T20:32:25.4510142Z         x1 = x[:, D:]
2025-05-07T20:32:25.4510215Z     
2025-05-07T20:32:25.4510305Z         if contiguous:
2025-05-07T20:32:25.4510401Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4510495Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4510580Z     
2025-05-07T20:32:25.4510743Z         if scale_ub is not None:
2025-05-07T20:32:25.4510850Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4510989Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4511066Z             )
2025-05-07T20:32:25.4511147Z         else:
2025-05-07T20:32:25.4511240Z             scale_ub_tensor = None
2025-05-07T20:32:25.4511314Z     
2025-05-07T20:32:25.4511455Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4511548Z             op = silu_mul_quant
2025-05-07T20:32:25.4511636Z             if compiled:
2025-05-07T20:32:25.4511744Z                 op = torch.compile(op)
2025-05-07T20:32:25.4511850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4511922Z     
2025-05-07T20:32:25.4512018Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4512023Z 
2025-05-07T20:32:25.4512117Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4512253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4512361Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4512461Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4513018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4513115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4513471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4513745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4514087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4514189Z     kernel = self.compile(
2025-05-07T20:32:25.4514569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4514741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4514879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4514883Z 
2025-05-07T20:32:25.4515083Z self = <triton.compiler.compiler.ASTSource object at 0x7f05083d3150>
2025-05-07T20:32:25.4515861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4516397Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050831c680>}
2025-05-07T20:32:25.4517139Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4517370Z context = <triton._C.libtriton.ir.context object at 0x7f05083cf6b0>
2025-05-07T20:32:25.4517374Z 
2025-05-07T20:32:25.4517537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4517804Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4517911Z                            module_map=module_map)
2025-05-07T20:32:25.4518072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4518180Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4518260Z E       ^
2025-05-07T20:32:25.4518611Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4518622Z 
2025-05-07T20:32:25.4519037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4519042Z 
2025-05-07T20:32:25.4519145Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4519419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4519496Z     T=128,
2025-05-07T20:32:25.4519576Z     D=5120,
2025-05-07T20:32:25.4519663Z     scale_ub=None,
2025-05-07T20:32:25.4519749Z     contiguous=True,
2025-05-07T20:32:25.4519832Z     compiled=False,
2025-05-07T20:32:25.4519913Z )
2025-05-07T20:32:25.4520128Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4520306Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4520313Z 
2025-05-07T20:32:25.4520391Z     @given(
2025-05-07T20:32:25.4520510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4520617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4520733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4520847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4520966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4521049Z     )
2025-05-07T20:32:25.4521292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4521393Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4521469Z         self,
2025-05-07T20:32:25.4521552Z         T: int,
2025-05-07T20:32:25.4521627Z         D: int,
2025-05-07T20:32:25.4521725Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4521820Z         contiguous: bool,
2025-05-07T20:32:25.4521905Z         compiled: bool,
2025-05-07T20:32:25.4522029Z     ) -> None:
2025-05-07T20:32:25.4522133Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4522206Z     
2025-05-07T20:32:25.4522374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4522454Z     
2025-05-07T20:32:25.4522547Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4522671Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4522765Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4522850Z         x0 = x[:, :D]
2025-05-07T20:32:25.4522938Z         x1 = x[:, D:]
2025-05-07T20:32:25.4523012Z     
2025-05-07T20:32:25.4523095Z         if contiguous:
2025-05-07T20:32:25.4523194Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4523283Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4523357Z     
2025-05-07T20:32:25.4523453Z         if scale_ub is not None:
2025-05-07T20:32:25.4523557Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4523697Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4523820Z             )
2025-05-07T20:32:25.4523900Z         else:
2025-05-07T20:32:25.4523996Z             scale_ub_tensor = None
2025-05-07T20:32:25.4524075Z     
2025-05-07T20:32:25.4524207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4524305Z             op = silu_mul_quant
2025-05-07T20:32:25.4524389Z             if compiled:
2025-05-07T20:32:25.4524491Z                 op = torch.compile(op)
2025-05-07T20:32:25.4524643Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4524716Z     
2025-05-07T20:32:25.4524806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4524810Z 
2025-05-07T20:32:25.4524915Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4525046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4525145Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4525252Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4525751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4525856Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4526215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4526436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4526779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4526917Z     kernel = self.compile(
2025-05-07T20:32:25.4527296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4527473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4527670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4527675Z 
2025-05-07T20:32:25.4527893Z self = <triton.compiler.compiler.ASTSource object at 0x7f050811fad0>
2025-05-07T20:32:25.4528669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4529176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050831d8a0>}
2025-05-07T20:32:25.4529928Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4530120Z context = <triton._C.libtriton.ir.context object at 0x7f0508172570>
2025-05-07T20:32:25.4530125Z 
2025-05-07T20:32:25.4530368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4530637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4530756Z                            module_map=module_map)
2025-05-07T20:32:25.4530922Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4531025Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4531112Z E       ^
2025-05-07T20:32:25.4531466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4531476Z 
2025-05-07T20:32:25.4531888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4531900Z 
2025-05-07T20:32:25.4532007Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4532231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4532315Z     T=128,
2025-05-07T20:32:25.4532395Z     D=7168,
2025-05-07T20:32:25.4532521Z     scale_ub=None,
2025-05-07T20:32:25.4532617Z     contiguous=True,
2025-05-07T20:32:25.4532711Z     compiled=False,
2025-05-07T20:32:25.4532789Z )
2025-05-07T20:32:25.4533012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4533188Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4533193Z 
2025-05-07T20:32:25.4533276Z     @given(
2025-05-07T20:32:25.4533444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4533549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4533672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4533792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4533906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4533993Z     )
2025-05-07T20:32:25.4534239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4534336Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4534425Z         self,
2025-05-07T20:32:25.4534507Z         T: int,
2025-05-07T20:32:25.4534587Z         D: int,
2025-05-07T20:32:25.4534694Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4534787Z         contiguous: bool,
2025-05-07T20:32:25.4534883Z         compiled: bool,
2025-05-07T20:32:25.4534966Z     ) -> None:
2025-05-07T20:32:25.4535062Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4535146Z     
2025-05-07T20:32:25.4535363Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4535441Z     
2025-05-07T20:32:25.4535542Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4535669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4535761Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4535851Z         x0 = x[:, :D]
2025-05-07T20:32:25.4535934Z         x1 = x[:, D:]
2025-05-07T20:32:25.4536010Z     
2025-05-07T20:32:25.4536101Z         if contiguous:
2025-05-07T20:32:25.4536198Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4536293Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4536374Z     
2025-05-07T20:32:25.4536467Z         if scale_ub is not None:
2025-05-07T20:32:25.4536581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4536720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4536799Z             )
2025-05-07T20:32:25.4536883Z         else:
2025-05-07T20:32:25.4536985Z             scale_ub_tensor = None
2025-05-07T20:32:25.4537069Z     
2025-05-07T20:32:25.4537206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4537299Z             op = silu_mul_quant
2025-05-07T20:32:25.4537388Z             if compiled:
2025-05-07T20:32:25.4537496Z                 op = torch.compile(op)
2025-05-07T20:32:25.4537609Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4537686Z     
2025-05-07T20:32:25.4537786Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4537791Z 
2025-05-07T20:32:25.4537957Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4538098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4538205Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4538306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4538810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4538917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4539274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4539500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4539840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4539943Z     kernel = self.compile(
2025-05-07T20:32:25.4540366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4540544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4540680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4540685Z 
2025-05-07T20:32:25.4540892Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508178e90>
2025-05-07T20:32:25.4541671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4542218Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050831e7a0>}
2025-05-07T20:32:25.4542968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4543165Z context = <triton._C.libtriton.ir.context object at 0x7f05081ed470>
2025-05-07T20:32:25.4543170Z 
2025-05-07T20:32:25.4543336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4543612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4543768Z                            module_map=module_map)
2025-05-07T20:32:25.4543931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4544041Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4544124Z E       ^
2025-05-07T20:32:25.4544485Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4544490Z 
2025-05-07T20:32:25.4544909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4544913Z 
2025-05-07T20:32:25.4545021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4545252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4545334Z     T=2048,
2025-05-07T20:32:25.4545419Z     D=7168,
2025-05-07T20:32:25.4545512Z     scale_ub=1200.0,
2025-05-07T20:32:25.4545600Z     contiguous=True,
2025-05-07T20:32:25.4545692Z     compiled=False,
2025-05-07T20:32:25.4545775Z )
2025-05-07T20:32:25.4545995Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4546177Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.4546181Z 
2025-05-07T20:32:25.4546261Z     @given(
2025-05-07T20:32:25.4546384Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4546492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4546653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4546780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4546901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4546979Z     )
2025-05-07T20:32:25.4547232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4547330Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4547410Z         self,
2025-05-07T20:32:25.4547497Z         T: int,
2025-05-07T20:32:25.4547581Z         D: int,
2025-05-07T20:32:25.4547682Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4547784Z         contiguous: bool,
2025-05-07T20:32:25.4547871Z         compiled: bool,
2025-05-07T20:32:25.4547955Z     ) -> None:
2025-05-07T20:32:25.4548060Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4548138Z     
2025-05-07T20:32:25.4548317Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4550135Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4550179Z 
2025-05-07T20:32:25.4550310Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4550314Z 
2025-05-07T20:32:25.4550421Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4550650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4550729Z     T=1,
2025-05-07T20:32:25.4550812Z     D=5120,
2025-05-07T20:32:25.4550906Z     scale_ub=1200.0,
2025-05-07T20:32:25.4550996Z     contiguous=True,
2025-05-07T20:32:25.4551087Z     compiled=False,
2025-05-07T20:32:25.4551173Z )
2025-05-07T20:32:25.4551392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4551558Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.4551563Z 
2025-05-07T20:32:25.4551651Z     @given(
2025-05-07T20:32:25.4551772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4551880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4552042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4552161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4552285Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4552362Z     )
2025-05-07T20:32:25.4552609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4552715Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4552794Z         self,
2025-05-07T20:32:25.4552875Z         T: int,
2025-05-07T20:32:25.4552963Z         D: int,
2025-05-07T20:32:25.4553068Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4553164Z         contiguous: bool,
2025-05-07T20:32:25.4553253Z         compiled: bool,
2025-05-07T20:32:25.4553335Z     ) -> None:
2025-05-07T20:32:25.4553439Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4553516Z     
2025-05-07T20:32:25.4553685Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4553772Z     
2025-05-07T20:32:25.4553872Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4554002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4554094Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4554179Z         x0 = x[:, :D]
2025-05-07T20:32:25.4554263Z         x1 = x[:, D:]
2025-05-07T20:32:25.4554352Z     
2025-05-07T20:32:25.4554441Z         if contiguous:
2025-05-07T20:32:25.4554537Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4554635Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4554714Z     
2025-05-07T20:32:25.4554856Z         if scale_ub is not None:
2025-05-07T20:32:25.4554974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4555114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4555199Z             )
2025-05-07T20:32:25.4555278Z         else:
2025-05-07T20:32:25.4555377Z             scale_ub_tensor = None
2025-05-07T20:32:25.4555462Z     
2025-05-07T20:32:25.4555593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4555691Z             op = silu_mul_quant
2025-05-07T20:32:25.4555785Z             if compiled:
2025-05-07T20:32:25.4555889Z                 op = torch.compile(op)
2025-05-07T20:32:25.4555998Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4556081Z     
2025-05-07T20:32:25.4556175Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4556180Z 
2025-05-07T20:32:25.4556291Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4556433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4556649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4556768Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4557269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4557372Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4557736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4558002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4558350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4558449Z     kernel = self.compile(
2025-05-07T20:32:25.4558831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4559021Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4559154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4559159Z 
2025-05-07T20:32:25.4559365Z self = <triton.compiler.compiler.ASTSource object at 0x7f0508290cd0>
2025-05-07T20:32:25.4560150Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4560723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f050831fb00>}
2025-05-07T20:32:25.4566731Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4566956Z context = <triton._C.libtriton.ir.context object at 0x7f05082a92b0>
2025-05-07T20:32:25.4566962Z 
2025-05-07T20:32:25.4567130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4567398Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4567508Z                            module_map=module_map)
2025-05-07T20:32:25.4567755Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4567864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4567942Z E       ^
2025-05-07T20:32:25.4568298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4568303Z 
2025-05-07T20:32:25.4568720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4568725Z 
2025-05-07T20:32:25.4568899Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4569127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4569210Z     T=2048,
2025-05-07T20:32:25.4569294Z     D=5120,
2025-05-07T20:32:25.4569379Z     scale_ub=None,
2025-05-07T20:32:25.4569468Z     contiguous=True,
2025-05-07T20:32:25.4569560Z     compiled=False,
2025-05-07T20:32:25.4569635Z )
2025-05-07T20:32:25.4569855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4570041Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4570046Z 
2025-05-07T20:32:25.4570124Z     @given(
2025-05-07T20:32:25.4570256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4570359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4570476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4570601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4570720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4570840Z     )
2025-05-07T20:32:25.4571093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4571188Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4571266Z         self,
2025-05-07T20:32:25.4571349Z         T: int,
2025-05-07T20:32:25.4571428Z         D: int,
2025-05-07T20:32:25.4571532Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4571628Z         contiguous: bool,
2025-05-07T20:32:25.4571757Z         compiled: bool,
2025-05-07T20:32:25.4571841Z     ) -> None:
2025-05-07T20:32:25.4571938Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4572012Z     
2025-05-07T20:32:25.4572186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4572261Z     
2025-05-07T20:32:25.4572359Z >       x_sign = torch.sign(x)
2025-05-07T20:32:25.4574153Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4574201Z 
2025-05-07T20:32:25.4574324Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:25.4574329Z 
2025-05-07T20:32:25.4574436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4574657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4574740Z     T=16384,
2025-05-07T20:32:25.4574819Z     D=5120,
2025-05-07T20:32:25.4574904Z     scale_ub=None,
2025-05-07T20:32:25.4574993Z     contiguous=True,
2025-05-07T20:32:25.4575079Z     compiled=False,
2025-05-07T20:32:25.4575158Z )
2025-05-07T20:32:25.4575383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4575560Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4575565Z 
2025-05-07T20:32:25.4575643Z     @given(
2025-05-07T20:32:25.4575767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4575867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4575985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4576110Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4576224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4576302Z     )
2025-05-07T20:32:25.4576548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4576648Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4576726Z         self,
2025-05-07T20:32:25.4576803Z         T: int,
2025-05-07T20:32:25.4576883Z         D: int,
2025-05-07T20:32:25.4577027Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4577118Z         contiguous: bool,
2025-05-07T20:32:25.4577207Z         compiled: bool,
2025-05-07T20:32:25.4577286Z     ) -> None:
2025-05-07T20:32:25.4577382Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4577463Z     
2025-05-07T20:32:25.4577629Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4579403Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4579417Z 
2025-05-07T20:32:25.4579574Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4579579Z 
2025-05-07T20:32:25.4579694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4579915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4579994Z     T=4096,
2025-05-07T20:32:25.4580075Z     D=5120,
2025-05-07T20:32:25.4580158Z     scale_ub=None,
2025-05-07T20:32:25.4580244Z     contiguous=True,
2025-05-07T20:32:25.4580376Z     compiled=False,
2025-05-07T20:32:25.4580452Z )
2025-05-07T20:32:25.4580666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4580839Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4580844Z 
2025-05-07T20:32:25.4580924Z     @given(
2025-05-07T20:32:25.4581046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4581145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4581267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4581390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4581504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4581580Z     )
2025-05-07T20:32:25.4581826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4581920Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4581997Z         self,
2025-05-07T20:32:25.4582125Z         T: int,
2025-05-07T20:32:25.4582203Z         D: int,
2025-05-07T20:32:25.4582308Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4582400Z         contiguous: bool,
2025-05-07T20:32:25.4582488Z         compiled: bool,
2025-05-07T20:32:25.4582570Z     ) -> None:
2025-05-07T20:32:25.4582665Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4582741Z     
2025-05-07T20:32:25.4582914Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4584678Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4584688Z 
2025-05-07T20:32:25.4584812Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4584816Z 
2025-05-07T20:32:25.4584920Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4585140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4585220Z     T=2048,
2025-05-07T20:32:25.4585297Z     D=5120,
2025-05-07T20:32:25.4585386Z     scale_ub=None,
2025-05-07T20:32:25.4585519Z     contiguous=False,
2025-05-07T20:32:25.4585608Z     compiled=False,
2025-05-07T20:32:25.4585689Z )
2025-05-07T20:32:25.4585906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4586077Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.4586082Z 
2025-05-07T20:32:25.4586168Z     @given(
2025-05-07T20:32:25.4586292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4586397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4586522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4586642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4586759Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4586834Z     )
2025-05-07T20:32:25.4587080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4587184Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4587263Z         self,
2025-05-07T20:32:25.4587344Z         T: int,
2025-05-07T20:32:25.4587470Z         D: int,
2025-05-07T20:32:25.4587573Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4587664Z         contiguous: bool,
2025-05-07T20:32:25.4587757Z         compiled: bool,
2025-05-07T20:32:25.4587836Z     ) -> None:
2025-05-07T20:32:25.4587932Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4588010Z     
2025-05-07T20:32:25.4588178Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4589986Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4589992Z 
2025-05-07T20:32:25.4590113Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4590118Z 
2025-05-07T20:32:25.4590228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4590448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4590526Z     T=4096,
2025-05-07T20:32:25.4590605Z     D=7168,
2025-05-07T20:32:25.4590731Z     scale_ub=None,
2025-05-07T20:32:25.4590820Z     contiguous=True,
2025-05-07T20:32:25.4590909Z     compiled=True,
2025-05-07T20:32:25.4590986Z )
2025-05-07T20:32:25.4591203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4591375Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.4591381Z 
2025-05-07T20:32:25.4591459Z     @given(
2025-05-07T20:32:25.4591584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4591688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4591805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4591924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4592037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4592114Z     )
2025-05-07T20:32:25.4592364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4592460Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4592546Z         self,
2025-05-07T20:32:25.4592633Z         T: int,
2025-05-07T20:32:25.4592711Z         D: int,
2025-05-07T20:32:25.4592814Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4592905Z         contiguous: bool,
2025-05-07T20:32:25.4592994Z         compiled: bool,
2025-05-07T20:32:25.4593081Z     ) -> None:
2025-05-07T20:32:25.4593178Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4593254Z     
2025-05-07T20:32:25.4593475Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4595245Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4595256Z 
2025-05-07T20:32:25.4595379Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4595384Z 
2025-05-07T20:32:25.4595492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4595714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4595800Z     T=2048,
2025-05-07T20:32:25.4595880Z     D=5120,
2025-05-07T20:32:25.4595974Z     scale_ub=1200.0,
2025-05-07T20:32:25.4596109Z     contiguous=False,
2025-05-07T20:32:25.4596198Z     compiled=False,
2025-05-07T20:32:25.4596281Z )
2025-05-07T20:32:25.4596497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4596675Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.4596679Z 
2025-05-07T20:32:25.4596765Z     @given(
2025-05-07T20:32:25.4596951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4597054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4597176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4597297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4597416Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4597491Z     )
2025-05-07T20:32:25.4597737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4597837Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4597918Z         self,
2025-05-07T20:32:25.4597997Z         T: int,
2025-05-07T20:32:25.4598083Z         D: int,
2025-05-07T20:32:25.4598183Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4598276Z         contiguous: bool,
2025-05-07T20:32:25.4598370Z         compiled: bool,
2025-05-07T20:32:25.4598450Z     ) -> None:
2025-05-07T20:32:25.4598546Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4598625Z     
2025-05-07T20:32:25.4598837Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4600612Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4600618Z 
2025-05-07T20:32:25.4600738Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4600742Z 
2025-05-07T20:32:25.4600852Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4601073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4601160Z     T=4096,
2025-05-07T20:32:25.4601247Z     D=7168,
2025-05-07T20:32:25.4601336Z     scale_ub=1200.0,
2025-05-07T20:32:25.4601427Z     contiguous=True,
2025-05-07T20:32:25.4601522Z     compiled=False,
2025-05-07T20:32:25.4601599Z )
2025-05-07T20:32:25.4601812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4601989Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.4601994Z 
2025-05-07T20:32:25.4602073Z     @given(
2025-05-07T20:32:25.4602245Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4602347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4602462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4602583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4602698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4602776Z     )
2025-05-07T20:32:25.4603022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4603125Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4603204Z         self,
2025-05-07T20:32:25.4603289Z         T: int,
2025-05-07T20:32:25.4603371Z         D: int,
2025-05-07T20:32:25.4603481Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4603575Z         contiguous: bool,
2025-05-07T20:32:25.4603664Z         compiled: bool,
2025-05-07T20:32:25.4603749Z     ) -> None:
2025-05-07T20:32:25.4603845Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4603919Z     
2025-05-07T20:32:25.4604133Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4606260Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4606358Z 
2025-05-07T20:32:25.4606488Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4606493Z 
2025-05-07T20:32:25.4606597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4606816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4606904Z     T=16384,
2025-05-07T20:32:25.4606985Z     D=7168,
2025-05-07T20:32:25.4607069Z     scale_ub=None,
2025-05-07T20:32:25.4607157Z     contiguous=False,
2025-05-07T20:32:25.4607240Z     compiled=True,
2025-05-07T20:32:25.4607316Z )
2025-05-07T20:32:25.4607585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4607759Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.4607838Z 
2025-05-07T20:32:25.4607927Z     @given(
2025-05-07T20:32:25.4608043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4608140Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4608258Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4608374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4608488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4608563Z     )
2025-05-07T20:32:25.4608807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4608905Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4608984Z         self,
2025-05-07T20:32:25.4609065Z         T: int,
2025-05-07T20:32:25.4609147Z         D: int,
2025-05-07T20:32:25.4609249Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4609338Z         contiguous: bool,
2025-05-07T20:32:25.4609434Z         compiled: bool,
2025-05-07T20:32:25.4609514Z     ) -> None:
2025-05-07T20:32:25.4609608Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4609689Z     
2025-05-07T20:32:25.4609854Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4611693Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4611700Z 
2025-05-07T20:32:25.4611819Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4611830Z 
2025-05-07T20:32:25.4611930Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4612148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4612236Z     T=4096,
2025-05-07T20:32:25.4612315Z     D=7168,
2025-05-07T20:32:25.4612399Z     scale_ub=None,
2025-05-07T20:32:25.4612488Z     contiguous=True,
2025-05-07T20:32:25.4612573Z     compiled=False,
2025-05-07T20:32:25.4612647Z )
2025-05-07T20:32:25.4612866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4613034Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4613042Z 
2025-05-07T20:32:25.4613179Z     @given(
2025-05-07T20:32:25.4613299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4613398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4613514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4613626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4613738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4613856Z     )
2025-05-07T20:32:25.4614101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4614193Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4614278Z         self,
2025-05-07T20:32:25.4614354Z         T: int,
2025-05-07T20:32:25.4614431Z         D: int,
2025-05-07T20:32:25.4614532Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4614621Z         contiguous: bool,
2025-05-07T20:32:25.4614709Z         compiled: bool,
2025-05-07T20:32:25.4614789Z     ) -> None:
2025-05-07T20:32:25.4614883Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4614965Z     
2025-05-07T20:32:25.4615130Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4616899Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4616954Z 
2025-05-07T20:32:25.4617071Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4617076Z 
2025-05-07T20:32:25.4617178Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4617405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4617482Z     T=16384,
2025-05-07T20:32:25.4617561Z     D=7168,
2025-05-07T20:32:25.4617648Z     scale_ub=None,
2025-05-07T20:32:25.4617731Z     contiguous=True,
2025-05-07T20:32:25.4617820Z     compiled=False,
2025-05-07T20:32:25.4617893Z )
2025-05-07T20:32:25.4618104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4618278Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.4618288Z 
2025-05-07T20:32:25.4618365Z     @given(
2025-05-07T20:32:25.4618480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4618581Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4618692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4618809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4618923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4619038Z     )
2025-05-07T20:32:25.4619285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4619378Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4619454Z         self,
2025-05-07T20:32:25.4619534Z         T: int,
2025-05-07T20:32:25.4619610Z         D: int,
2025-05-07T20:32:25.4619707Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4619796Z         contiguous: bool,
2025-05-07T20:32:25.4619879Z         compiled: bool,
2025-05-07T20:32:25.4619964Z     ) -> None:
2025-05-07T20:32:25.4620060Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4620134Z     
2025-05-07T20:32:25.4620300Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4622116Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4622122Z 
2025-05-07T20:32:25.4622237Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4622244Z 
2025-05-07T20:32:25.4622344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4622602Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4622684Z     T=16384,
2025-05-07T20:32:25.4622761Z     D=7168,
2025-05-07T20:32:25.4622843Z     scale_ub=1200.0,
2025-05-07T20:32:25.4622932Z     contiguous=True,
2025-05-07T20:32:25.4623016Z     compiled=False,
2025-05-07T20:32:25.4623093Z )
2025-05-07T20:32:25.4623313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4623490Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.4623495Z 
2025-05-07T20:32:25.4623575Z     @given(
2025-05-07T20:32:25.4623690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4623786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4623901Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4624015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4624127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4624249Z     )
2025-05-07T20:32:25.4624489Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4624581Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4624665Z         self,
2025-05-07T20:32:25.4624743Z         T: int,
2025-05-07T20:32:25.4624820Z         D: int,
2025-05-07T20:32:25.4624921Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4625010Z         contiguous: bool,
2025-05-07T20:32:25.4625099Z         compiled: bool,
2025-05-07T20:32:25.4625181Z     ) -> None:
2025-05-07T20:32:25.4625274Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4625350Z     
2025-05-07T20:32:25.4625515Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4627280Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4627294Z 
2025-05-07T20:32:25.4627410Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4627415Z 
2025-05-07T20:32:25.4627556Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4627782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4627858Z     T=128,
2025-05-07T20:32:25.4627938Z     D=5120,
2025-05-07T20:32:25.4628024Z     scale_ub=1200.0,
2025-05-07T20:32:25.4628109Z     contiguous=False,
2025-05-07T20:32:25.4628195Z     compiled=False,
2025-05-07T20:32:25.4628268Z )
2025-05-07T20:32:25.4628481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4628657Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.4628661Z 
2025-05-07T20:32:25.4628740Z     @given(
2025-05-07T20:32:25.4628856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4628956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4629067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4629180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4629366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4629443Z     )
2025-05-07T20:32:25.4629688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4629780Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4629856Z         self,
2025-05-07T20:32:25.4629937Z         T: int,
2025-05-07T20:32:25.4630014Z         D: int,
2025-05-07T20:32:25.4630109Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4630244Z         contiguous: bool,
2025-05-07T20:32:25.4630331Z         compiled: bool,
2025-05-07T20:32:25.4630408Z     ) -> None:
2025-05-07T20:32:25.4630504Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4630576Z     
2025-05-07T20:32:25.4630739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4630819Z     
2025-05-07T20:32:25.4630911Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4631040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4631130Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4631212Z         x0 = x[:, :D]
2025-05-07T20:32:25.4631296Z         x1 = x[:, D:]
2025-05-07T20:32:25.4631369Z     
2025-05-07T20:32:25.4631450Z         if contiguous:
2025-05-07T20:32:25.4631544Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4631636Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4631709Z     
2025-05-07T20:32:25.4631805Z         if scale_ub is not None:
2025-05-07T20:32:25.4631911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4632091Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4632173Z             )
2025-05-07T20:32:25.4632249Z         else:
2025-05-07T20:32:25.4632342Z             scale_ub_tensor = None
2025-05-07T20:32:25.4632438Z     
2025-05-07T20:32:25.4632583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4632691Z             op = silu_mul_quant
2025-05-07T20:32:25.4632777Z             if compiled:
2025-05-07T20:32:25.4632880Z                 op = torch.compile(op)
2025-05-07T20:32:25.4632991Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4633064Z     
2025-05-07T20:32:25.4633154Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4633159Z 
2025-05-07T20:32:25.4633259Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4633387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4633486Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4633592Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4634092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4634192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4634546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4634768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4635160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4635260Z     kernel = self.compile(
2025-05-07T20:32:25.4635644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4635820Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4635948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4635957Z 
2025-05-07T20:32:25.4636168Z self = <triton.compiler.compiler.ASTSource object at 0x7f050800b690>
2025-05-07T20:32:25.4636942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4637488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f005df3e700>}
2025-05-07T20:32:25.4638236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4638425Z context = <triton._C.libtriton.ir.context object at 0x7f05082d4d30>
2025-05-07T20:32:25.4638470Z 
2025-05-07T20:32:25.4638639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4638901Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4639014Z                            module_map=module_map)
2025-05-07T20:32:25.4639177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4639277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4639363Z E       ^
2025-05-07T20:32:25.4639723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4639728Z 
2025-05-07T20:32:25.4641557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4641566Z 
2025-05-07T20:32:25.4641669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4641889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4642016Z     T=2048,
2025-05-07T20:32:25.4642095Z     D=7168,
2025-05-07T20:32:25.4642178Z     scale_ub=None,
2025-05-07T20:32:25.4642271Z     contiguous=False,
2025-05-07T20:32:25.4642357Z     compiled=False,
2025-05-07T20:32:25.4642436Z )
2025-05-07T20:32:25.4642656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4642828Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.4642833Z 
2025-05-07T20:32:25.4642919Z     @given(
2025-05-07T20:32:25.4643041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4643141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4643261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4643379Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4643493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4643575Z     )
2025-05-07T20:32:25.4643820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4643921Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4644004Z         self,
2025-05-07T20:32:25.4644083Z         T: int,
2025-05-07T20:32:25.4644162Z         D: int,
2025-05-07T20:32:25.4644265Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4644357Z         contiguous: bool,
2025-05-07T20:32:25.4644449Z         compiled: bool,
2025-05-07T20:32:25.4644528Z     ) -> None:
2025-05-07T20:32:25.4644624Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4644746Z     
2025-05-07T20:32:25.4644918Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4646684Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4646700Z 
2025-05-07T20:32:25.4646819Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4646823Z 
2025-05-07T20:32:25.4646927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4647153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4647271Z     T=128,
2025-05-07T20:32:25.4647351Z     D=7168,
2025-05-07T20:32:25.4647442Z     scale_ub=1200.0,
2025-05-07T20:32:25.4647575Z     contiguous=True,
2025-05-07T20:32:25.4647663Z     compiled=True,
2025-05-07T20:32:25.4647738Z )
2025-05-07T20:32:25.4647952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4648124Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.4648174Z 
2025-05-07T20:32:25.4648254Z     @given(
2025-05-07T20:32:25.4648370Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4648472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4648584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4648697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4648812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4648889Z     )
2025-05-07T20:32:25.4649141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4649234Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4649312Z         self,
2025-05-07T20:32:25.4649392Z         T: int,
2025-05-07T20:32:25.4649469Z         D: int,
2025-05-07T20:32:25.4649566Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4649658Z         contiguous: bool,
2025-05-07T20:32:25.4649744Z         compiled: bool,
2025-05-07T20:32:25.4649821Z     ) -> None:
2025-05-07T20:32:25.4649970Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4650044Z     
2025-05-07T20:32:25.4650210Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4650288Z     
2025-05-07T20:32:25.4650377Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4650506Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4650593Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4650673Z         x0 = x[:, :D]
2025-05-07T20:32:25.4650754Z         x1 = x[:, D:]
2025-05-07T20:32:25.4650830Z     
2025-05-07T20:32:25.4650914Z         if contiguous:
2025-05-07T20:32:25.4651008Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4651096Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4651170Z     
2025-05-07T20:32:25.4651262Z         if scale_ub is not None:
2025-05-07T20:32:25.4651368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4651501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4651589Z             )
2025-05-07T20:32:25.4651667Z         else:
2025-05-07T20:32:25.4651761Z             scale_ub_tensor = None
2025-05-07T20:32:25.4651834Z     
2025-05-07T20:32:25.4651961Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4652053Z             op = silu_mul_quant
2025-05-07T20:32:25.4652137Z             if compiled:
2025-05-07T20:32:25.4652238Z                 op = torch.compile(op)
2025-05-07T20:32:25.4652346Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4652420Z     
2025-05-07T20:32:25.4652559Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4652564Z 
2025-05-07T20:32:25.4652663Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4652792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4652895Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4652991Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4653356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4653459Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4653950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4654046Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4654405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4654669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4655010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4655103Z     kernel = self.compile(
2025-05-07T20:32:25.4655483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4655662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4655832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4655837Z 
2025-05-07T20:32:25.4656039Z self = <triton.compiler.compiler.ASTSource object at 0x7f05080930d0>
2025-05-07T20:32:25.4656815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4657315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f0605228540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f005df3ff60>}
2025-05-07T20:32:25.4658062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4658249Z context = <triton._C.libtriton.ir.context object at 0x7f05080dcd30>
2025-05-07T20:32:25.4658297Z 
2025-05-07T20:32:25.4658466Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4658727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4658832Z                            module_map=module_map)
2025-05-07T20:32:25.4658995Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4659092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4659173Z E       ^
2025-05-07T20:32:25.4659531Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4659536Z 
2025-05-07T20:32:25.4659946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4659951Z 
2025-05-07T20:32:25.4660059Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4660282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4660360Z     T=128,
2025-05-07T20:32:25.4660442Z     D=7168,
2025-05-07T20:32:25.4660525Z     scale_ub=1200.0,
2025-05-07T20:32:25.4660609Z     contiguous=True,
2025-05-07T20:32:25.4660696Z     compiled=False,
2025-05-07T20:32:25.4660769Z )
2025-05-07T20:32:25.4660984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4661217Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.4661223Z 
2025-05-07T20:32:25.4661302Z     @given(
2025-05-07T20:32:25.4661423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4661522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4661634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4661751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4661862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4661944Z     )
2025-05-07T20:32:25.4662190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4662284Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4662364Z         self,
2025-05-07T20:32:25.4662440Z         T: int,
2025-05-07T20:32:25.4662517Z         D: int,
2025-05-07T20:32:25.4662616Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4662704Z         contiguous: bool,
2025-05-07T20:32:25.4662794Z         compiled: bool,
2025-05-07T20:32:25.4662876Z     ) -> None:
2025-05-07T20:32:25.4663011Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4663086Z     
2025-05-07T20:32:25.4663251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4663324Z     
2025-05-07T20:32:25.4663416Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4663538Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4665305Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4665352Z 
2025-05-07T20:32:25.4665475Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.4665480Z 
2025-05-07T20:32:25.4665584Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4665802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4665881Z     T=128,
2025-05-07T20:32:25.4665960Z     D=5120,
2025-05-07T20:32:25.4666042Z     scale_ub=1200.0,
2025-05-07T20:32:25.4666127Z     contiguous=True,
2025-05-07T20:32:25.4666218Z     compiled=True,
2025-05-07T20:32:25.4666336Z )
2025-05-07T20:32:25.4666549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4666718Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.4666722Z 
2025-05-07T20:32:25.4666798Z     @given(
2025-05-07T20:32:25.4666916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4667013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4667127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4667248Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4667356Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4667430Z     )
2025-05-07T20:32:25.4667674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4667766Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4667842Z         self,
2025-05-07T20:32:25.4667921Z         T: int,
2025-05-07T20:32:25.4668003Z         D: int,
2025-05-07T20:32:25.4668102Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4668188Z         contiguous: bool,
2025-05-07T20:32:25.4668272Z         compiled: bool,
2025-05-07T20:32:25.4668351Z     ) -> None:
2025-05-07T20:32:25.4668444Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4668517Z     
2025-05-07T20:32:25.4668683Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4668757Z     
2025-05-07T20:32:25.4668847Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4669018Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4670772Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4670783Z 
2025-05-07T20:32:25.4670901Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.4670906Z 
2025-05-07T20:32:25.4671007Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4671227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4671307Z     T=128,
2025-05-07T20:32:25.4671424Z     D=7168,
2025-05-07T20:32:25.4671511Z     scale_ub=None,
2025-05-07T20:32:25.4671598Z     contiguous=True,
2025-05-07T20:32:25.4671678Z     compiled=True,
2025-05-07T20:32:25.4671752Z )
2025-05-07T20:32:25.4671963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4672126Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.4672130Z 
2025-05-07T20:32:25.4672254Z     @given(
2025-05-07T20:32:25.4672374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4672470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4672589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4672703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4672817Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4672891Z     )
2025-05-07T20:32:25.4673134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4673233Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4673310Z         self,
2025-05-07T20:32:25.4673385Z         T: int,
2025-05-07T20:32:25.4673465Z         D: int,
2025-05-07T20:32:25.4673561Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4673647Z         contiguous: bool,
2025-05-07T20:32:25.4673734Z         compiled: bool,
2025-05-07T20:32:25.4673813Z     ) -> None:
2025-05-07T20:32:25.4673906Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4674029Z     
2025-05-07T20:32:25.4674193Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4675952Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.4675959Z 
2025-05-07T20:32:25.4676076Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.4676218Z =============================== warnings summary ===============================
2025-05-07T20:32:25.4676522Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:25.4676825Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:25.4677124Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:25.4678034Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:25.4678269Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:25.4678274Z 
2025-05-07T20:32:25.4678481Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:25.4678643Z ================= 1 failed, 1 deselected, 3 warnings in 13.78s =================
2025-05-07T20:32:27.0193059Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:27.0808221Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:27.0808838Z 
2025-05-07T20:32:29.0826081Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:31.2291666Z ============================= test session starts ==============================
2025-05-07T20:32:31.2292331Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:31.2292856Z cachedir: .pytest_cache
2025-05-07T20:32:31.2293423Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:31.2294255Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:31.2294706Z plugins: hypothesis-6.131.14
2025-05-07T20:32:32.8179360Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:32.9694614Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:32.9695065Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:32.9695289Z 
2025-05-07T20:32:35.3126066Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3133299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3133763Z     T=1,
2025-05-07T20:32:35.3133970Z     D=5120,
2025-05-07T20:32:35.3134182Z     scale_ub=None,
2025-05-07T20:32:35.3134404Z     contiguous=True,
2025-05-07T20:32:35.3134640Z     compiled=True,
2025-05-07T20:32:35.3134862Z )
2025-05-07T20:32:35.3135189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3136034Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.3136299Z 
2025-05-07T20:32:35.3136394Z     @given(
2025-05-07T20:32:35.3136638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3136969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3137289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3137637Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3137984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3138287Z     )
2025-05-07T20:32:35.3138653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3139101Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3139359Z         self,
2025-05-07T20:32:35.3139570Z         T: int,
2025-05-07T20:32:35.3139774Z         D: int,
2025-05-07T20:32:35.3140008Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3140296Z         contiguous: bool,
2025-05-07T20:32:35.3140543Z         compiled: bool,
2025-05-07T20:32:35.3140786Z     ) -> None:
2025-05-07T20:32:35.3141016Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3141264Z     
2025-05-07T20:32:35.3141551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3141908Z     
2025-05-07T20:32:35.3142119Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3142415Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3142821Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3143086Z         x0 = x[:, :D]
2025-05-07T20:32:35.3143318Z         x1 = x[:, D:]
2025-05-07T20:32:35.3143532Z     
2025-05-07T20:32:35.3143733Z         if contiguous:
2025-05-07T20:32:35.3143983Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3144250Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3144506Z     
2025-05-07T20:32:35.3144715Z         if scale_ub is not None:
2025-05-07T20:32:35.3144996Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3145356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3145681Z             )
2025-05-07T20:32:35.3145883Z         else:
2025-05-07T20:32:35.3146111Z             scale_ub_tensor = None
2025-05-07T20:32:35.3146377Z     
2025-05-07T20:32:35.3146616Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3146949Z             op = silu_mul_quant
2025-05-07T20:32:35.3147218Z             if compiled:
2025-05-07T20:32:35.3147477Z                 op = torch.compile(op)
2025-05-07T20:32:35.3147877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3148171Z     
2025-05-07T20:32:35.3148379Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.3148671Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.3148975Z     
2025-05-07T20:32:35.3149230Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3149571Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.3149965Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.3150293Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.3150659Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3150986Z     
2025-05-07T20:32:35.3151202Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.3151400Z 
2025-05-07T20:32:35.3151508Z moe/activation_test.py:126: 
2025-05-07T20:32:35.3151820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3152173Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.3152511Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.3153311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.3154078Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.3154645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3155391Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3156083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.3156816Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3157582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.3158333Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.3159071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.3159720Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.3160332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.3160861Z     fn()
2025-05-07T20:32:35.3161379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.3161972Z     self.fn.run(
2025-05-07T20:32:35.3162452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3162989Z     kernel = self.compile(
2025-05-07T20:32:35.3163594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3164258Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3164664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3164911Z 
2025-05-07T20:32:35.3165126Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df67f7310>
2025-05-07T20:32:35.3166267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3167746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df65e13a0>}
2025-05-07T20:32:35.3169157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3170192Z context = <triton._C.libtriton.ir.context object at 0x7f5df7c71930>
2025-05-07T20:32:35.3170491Z 
2025-05-07T20:32:35.3170661Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3171196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3171723Z                            module_map=module_map)
2025-05-07T20:32:35.3172096Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3172469Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.3172750Z E       ^
2025-05-07T20:32:35.3173219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3173679Z 
2025-05-07T20:32:35.3174105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3174629Z 
2025-05-07T20:32:35.3174739Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3175163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3175571Z     T=2048,
2025-05-07T20:32:35.3175776Z     D=5120,
2025-05-07T20:32:35.3175982Z     scale_ub=1200.0,
2025-05-07T20:32:35.3176209Z     contiguous=True,
2025-05-07T20:32:35.3176447Z     compiled=False,
2025-05-07T20:32:35.3176713Z )
2025-05-07T20:32:36.2357944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2358658Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.2358940Z 
2025-05-07T20:32:36.2359023Z     @given(
2025-05-07T20:32:36.2359257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2359571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2359894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2360238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2360569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2360849Z     )
2025-05-07T20:32:36.2361203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2361646Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2361891Z         self,
2025-05-07T20:32:36.2362079Z         T: int,
2025-05-07T20:32:36.2362292Z         D: int,
2025-05-07T20:32:36.2362514Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2362781Z         contiguous: bool,
2025-05-07T20:32:36.2363023Z         compiled: bool,
2025-05-07T20:32:36.2363260Z     ) -> None:
2025-05-07T20:32:36.2363475Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2363716Z     
2025-05-07T20:32:36.2363995Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2364345Z     
2025-05-07T20:32:36.2364539Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2365125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2365439Z         x = x_sign * x_clamp
2025-05-07T20:32:36.2365672Z         x0 = x[:, :D]
2025-05-07T20:32:36.2365896Z         x1 = x[:, D:]
2025-05-07T20:32:36.2366106Z     
2025-05-07T20:32:36.2366284Z         if contiguous:
2025-05-07T20:32:36.2366515Z             x0 = x0.contiguous()
2025-05-07T20:32:36.2366771Z             x1 = x1.contiguous()
2025-05-07T20:32:36.2367005Z     
2025-05-07T20:32:36.2367203Z         if scale_ub is not None:
2025-05-07T20:32:36.2367472Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.2367906Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.2368215Z             )
2025-05-07T20:32:36.2368419Z         else:
2025-05-07T20:32:36.2368633Z             scale_ub_tensor = None
2025-05-07T20:32:36.2368881Z     
2025-05-07T20:32:36.2369117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2369439Z             op = silu_mul_quant
2025-05-07T20:32:36.2369785Z             if compiled:
2025-05-07T20:32:36.2370042Z                 op = torch.compile(op)
2025-05-07T20:32:36.2370347Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2370627Z     
2025-05-07T20:32:36.2370824Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.2370989Z 
2025-05-07T20:32:36.2371095Z moe/activation_test.py:117: 
2025-05-07T20:32:36.2371390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2371807Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.2372095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2372796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.2373483Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.2374022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.2374712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.2375381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.2375963Z     kernel = self.compile(
2025-05-07T20:32:36.2376513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.2377174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.2377656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2377892Z 
2025-05-07T20:32:36.2378101Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df675d390>
2025-05-07T20:32:36.2379186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.2380583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df62902c0>}
2025-05-07T20:32:36.2381934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.2382966Z context = <triton._C.libtriton.ir.context object at 0x7f5df67398f0>
2025-05-07T20:32:36.2383268Z 
2025-05-07T20:32:36.2383439Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.2383966Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.2384438Z                            module_map=module_map)
2025-05-07T20:32:36.2384805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.2385216Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.2385492Z E       ^
2025-05-07T20:32:36.2385956Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.2386412Z 
2025-05-07T20:32:36.2386829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.2387347Z 
2025-05-07T20:32:36.2387453Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2387878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2388279Z     T=2048,
2025-05-07T20:32:36.2388474Z     D=5120,
2025-05-07T20:32:36.2388675Z     scale_ub=1200.0,
2025-05-07T20:32:36.2388897Z     contiguous=True,
2025-05-07T20:32:36.2389123Z     compiled=True,
2025-05-07T20:32:36.2389338Z )
2025-05-07T20:32:36.2389656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2390202Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.2390475Z 
2025-05-07T20:32:36.2390564Z     @given(
2025-05-07T20:32:36.2390794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2391114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2391426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2391760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2392166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2392462Z     )
2025-05-07T20:32:36.2392821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2393260Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2393506Z         self,
2025-05-07T20:32:36.2393709Z         T: int,
2025-05-07T20:32:36.2393906Z         D: int,
2025-05-07T20:32:36.2394131Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2394406Z         contiguous: bool,
2025-05-07T20:32:36.2394645Z         compiled: bool,
2025-05-07T20:32:36.2394876Z     ) -> None:
2025-05-07T20:32:36.2395097Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2395345Z     
2025-05-07T20:32:36.2395620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2395969Z     
2025-05-07T20:32:36.2396170Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2396463Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2396777Z         x = x_sign * x_clamp
2025-05-07T20:32:36.2397080Z         x0 = x[:, :D]
2025-05-07T20:32:36.2397296Z         x1 = x[:, D:]
2025-05-07T20:32:36.2397512Z     
2025-05-07T20:32:36.2397706Z         if contiguous:
2025-05-07T20:32:36.2397940Z             x0 = x0.contiguous()
2025-05-07T20:32:36.2398203Z             x1 = x1.contiguous()
2025-05-07T20:32:36.2398447Z     
2025-05-07T20:32:36.2398643Z         if scale_ub is not None:
2025-05-07T20:32:36.2398922Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.2399268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.2399578Z             )
2025-05-07T20:32:36.2399782Z         else:
2025-05-07T20:32:36.2399997Z             scale_ub_tensor = None
2025-05-07T20:32:36.2400252Z     
2025-05-07T20:32:36.2400490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2400808Z             op = silu_mul_quant
2025-05-07T20:32:36.2401063Z             if compiled:
2025-05-07T20:32:36.2401310Z                 op = torch.compile(op)
2025-05-07T20:32:36.2401613Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2401892Z     
2025-05-07T20:32:36.2402085Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.2402375Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.2402667Z     
2025-05-07T20:32:36.2402905Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2403244Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.2403540Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.2403904Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.2404270Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.2404589Z     
2025-05-07T20:32:36.2404790Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.2404992Z 
2025-05-07T20:32:36.2405093Z moe/activation_test.py:126: 
2025-05-07T20:32:36.2405395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2406081Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.2406415Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.2407198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.2408019Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.2408559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.2409321Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.2410011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.2410730Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.2411477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.2413085Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.2413820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.2414462Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.2415059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.2415586Z     fn()
2025-05-07T20:32:36.2416101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.2416687Z     self.fn.run(
2025-05-07T20:32:36.2417153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.2417688Z     kernel = self.compile(
2025-05-07T20:32:36.2418232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.2418959Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.2419362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2419589Z 
2025-05-07T20:32:36.2419803Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4c14190>
2025-05-07T20:32:36.2420888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.2422261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df6291440>}
2025-05-07T20:32:36.2423614Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.2424643Z context = <triton._C.libtriton.ir.context object at 0x7f5df4c18730>
2025-05-07T20:32:36.2424932Z 
2025-05-07T20:32:36.2425107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.2425625Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.2426094Z                            module_map=module_map)
2025-05-07T20:32:36.2426535Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.2426900Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.2427168Z E       ^
2025-05-07T20:32:36.2427647Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.2428095Z 
2025-05-07T20:32:36.2428515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.2429031Z 
2025-05-07T20:32:36.2429142Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2429550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2429959Z     T=16384,
2025-05-07T20:32:36.2430161Z     D=7168,
2025-05-07T20:32:36.2430355Z     scale_ub=1200.0,
2025-05-07T20:32:36.2430584Z     contiguous=False,
2025-05-07T20:32:36.2430814Z     compiled=False,
2025-05-07T20:32:36.2431029Z )
2025-05-07T20:32:37.0252104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0252826Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.0253133Z 
2025-05-07T20:32:37.0253214Z     @given(
2025-05-07T20:32:37.0253457Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0253776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0254092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0254549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0254889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0255176Z     )
2025-05-07T20:32:37.0255534Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0255985Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0256230Z         self,
2025-05-07T20:32:37.0256437Z         T: int,
2025-05-07T20:32:37.0256646Z         D: int,
2025-05-07T20:32:37.0256870Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0257157Z         contiguous: bool,
2025-05-07T20:32:37.0257410Z         compiled: bool,
2025-05-07T20:32:37.0257640Z     ) -> None:
2025-05-07T20:32:37.0257866Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0258121Z     
2025-05-07T20:32:37.0258397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0258750Z     
2025-05-07T20:32:37.0258957Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0259354Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0259667Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0259916Z         x0 = x[:, :D]
2025-05-07T20:32:37.0260146Z         x1 = x[:, D:]
2025-05-07T20:32:37.0260361Z     
2025-05-07T20:32:37.0260558Z         if contiguous:
2025-05-07T20:32:37.0260801Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0261066Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0261317Z     
2025-05-07T20:32:37.0261522Z         if scale_ub is not None:
2025-05-07T20:32:37.0261804Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0262153Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0262476Z             )
2025-05-07T20:32:37.0262676Z         else:
2025-05-07T20:32:37.0262902Z             scale_ub_tensor = None
2025-05-07T20:32:37.0263163Z     
2025-05-07T20:32:37.0263398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0263726Z             op = silu_mul_quant
2025-05-07T20:32:37.0263998Z             if compiled:
2025-05-07T20:32:37.0264252Z                 op = torch.compile(op)
2025-05-07T20:32:37.0264560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0264847Z     
2025-05-07T20:32:37.0265052Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0265221Z 
2025-05-07T20:32:37.0265326Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0265635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0266107Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0266396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0267098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0267801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0268345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0269042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0269714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0270256Z     kernel = self.compile(
2025-05-07T20:32:37.0270801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0271461Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0271917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0272153Z 
2025-05-07T20:32:37.0272368Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4ca2190>
2025-05-07T20:32:37.0273445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0274887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df50ba980>}
2025-05-07T20:32:37.0276235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0277267Z context = <triton._C.libtriton.ir.context object at 0x7f5df4fd83f0>
2025-05-07T20:32:37.0277558Z 
2025-05-07T20:32:37.0277733Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0278255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0278730Z                            module_map=module_map)
2025-05-07T20:32:37.0279101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0279458Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0279771Z E       ^
2025-05-07T20:32:37.0280240Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0280691Z 
2025-05-07T20:32:37.0281111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0281622Z 
2025-05-07T20:32:37.0281729Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0282156Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0282568Z     T=1,
2025-05-07T20:32:37.0282757Z     D=7168,
2025-05-07T20:32:37.0282967Z     scale_ub=None,
2025-05-07T20:32:37.0283193Z     contiguous=True,
2025-05-07T20:32:37.0283421Z     compiled=True,
2025-05-07T20:32:37.0283640Z )
2025-05-07T20:32:37.0283968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0284464Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0291137Z 
2025-05-07T20:32:37.0291226Z     @given(
2025-05-07T20:32:37.0291471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0291795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0292105Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0292442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0292786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0293070Z     )
2025-05-07T20:32:37.0293506Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0293960Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0294208Z         self,
2025-05-07T20:32:37.0294418Z         T: int,
2025-05-07T20:32:37.0294622Z         D: int,
2025-05-07T20:32:37.0294844Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0295125Z         contiguous: bool,
2025-05-07T20:32:37.0295372Z         compiled: bool,
2025-05-07T20:32:37.0295611Z     ) -> None:
2025-05-07T20:32:37.0295832Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0296114Z     
2025-05-07T20:32:37.0296409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0296756Z     
2025-05-07T20:32:37.0296960Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0297262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0297573Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0297825Z         x0 = x[:, :D]
2025-05-07T20:32:37.0298055Z         x1 = x[:, D:]
2025-05-07T20:32:37.0298354Z     
2025-05-07T20:32:37.0298554Z         if contiguous:
2025-05-07T20:32:37.0298794Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0299052Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0299297Z     
2025-05-07T20:32:37.0299491Z         if scale_ub is not None:
2025-05-07T20:32:37.0299772Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0300113Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0300485Z             )
2025-05-07T20:32:37.0300690Z         else:
2025-05-07T20:32:37.0300903Z             scale_ub_tensor = None
2025-05-07T20:32:37.0301168Z     
2025-05-07T20:32:37.0301409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0301733Z             op = silu_mul_quant
2025-05-07T20:32:37.0301983Z             if compiled:
2025-05-07T20:32:37.0302239Z                 op = torch.compile(op)
2025-05-07T20:32:37.0302543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0302824Z     
2025-05-07T20:32:37.0303028Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0303320Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0303616Z     
2025-05-07T20:32:37.0303867Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0304208Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0304502Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0304830Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0305246Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0305568Z     
2025-05-07T20:32:37.0306226Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0306500Z 
2025-05-07T20:32:37.0306631Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0307026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0307461Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0307806Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0308597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0309351Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0309904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0310594Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0311287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0312004Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0312757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.0313610Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0314342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0314976Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0315584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0316111Z     fn()
2025-05-07T20:32:37.0316620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0317211Z     self.fn.run(
2025-05-07T20:32:37.0317684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0318218Z     kernel = self.compile(
2025-05-07T20:32:37.0318755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0319483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0319884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0320113Z 
2025-05-07T20:32:37.0320329Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4bc9ad0>
2025-05-07T20:32:37.0321406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0322879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4f3ac00>}
2025-05-07T20:32:37.0324235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0325266Z context = <triton._C.libtriton.ir.context object at 0x7f5df4bdfbb0>
2025-05-07T20:32:37.0325554Z 
2025-05-07T20:32:37.0325731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0326254Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0326734Z                            module_map=module_map)
2025-05-07T20:32:37.0327195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0327651Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0327926Z E       ^
2025-05-07T20:32:37.0328394Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0328847Z 
2025-05-07T20:32:37.0329273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0329782Z 
2025-05-07T20:32:37.0329894Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0330315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0330728Z     T=4096,
2025-05-07T20:32:37.0330922Z     D=5120,
2025-05-07T20:32:37.0331124Z     scale_ub=None,
2025-05-07T20:32:37.0331347Z     contiguous=False,
2025-05-07T20:32:37.0331574Z     compiled=False,
2025-05-07T20:32:37.0331789Z )
2025-05-07T20:32:37.9275929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.9276569Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.9276860Z 
2025-05-07T20:32:37.9276945Z     @given(
2025-05-07T20:32:37.9277189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.9277505Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.9277821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.9278168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.9278818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.9279117Z     )
2025-05-07T20:32:37.9279478Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.9279921Z     def test_silu_mul_quant(
2025-05-07T20:32:37.9280182Z         self,
2025-05-07T20:32:37.9280392Z         T: int,
2025-05-07T20:32:37.9280603Z         D: int,
2025-05-07T20:32:37.9280824Z         scale_ub: Optional[float],
2025-05-07T20:32:37.9281111Z         contiguous: bool,
2025-05-07T20:32:37.9281364Z         compiled: bool,
2025-05-07T20:32:37.9281599Z     ) -> None:
2025-05-07T20:32:37.9281832Z         torch.manual_seed(2025)
2025-05-07T20:32:37.9282087Z     
2025-05-07T20:32:37.9282366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.9282721Z     
2025-05-07T20:32:37.9282925Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.9283218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.9283636Z         x = x_sign * x_clamp
2025-05-07T20:32:37.9283888Z         x0 = x[:, :D]
2025-05-07T20:32:37.9284104Z         x1 = x[:, D:]
2025-05-07T20:32:37.9284320Z     
2025-05-07T20:32:37.9284523Z         if contiguous:
2025-05-07T20:32:37.9284755Z             x0 = x0.contiguous()
2025-05-07T20:32:37.9285027Z             x1 = x1.contiguous()
2025-05-07T20:32:37.9285275Z     
2025-05-07T20:32:37.9285471Z         if scale_ub is not None:
2025-05-07T20:32:37.9285830Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.9286175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.9286485Z             )
2025-05-07T20:32:37.9286685Z         else:
2025-05-07T20:32:37.9286905Z             scale_ub_tensor = None
2025-05-07T20:32:37.9287159Z     
2025-05-07T20:32:37.9287398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9287794Z             op = silu_mul_quant
2025-05-07T20:32:37.9288052Z             if compiled:
2025-05-07T20:32:37.9288309Z                 op = torch.compile(op)
2025-05-07T20:32:37.9288612Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9288898Z     
2025-05-07T20:32:37.9289096Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.9289267Z 
2025-05-07T20:32:37.9289371Z moe/activation_test.py:117: 
2025-05-07T20:32:37.9289676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9290009Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.9290391Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9291107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.9291804Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.9292343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.9293021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.9293684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.9294221Z     kernel = self.compile(
2025-05-07T20:32:37.9294760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.9295417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.9295814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9296048Z 
2025-05-07T20:32:37.9296259Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4a59310>
2025-05-07T20:32:37.9297329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9298785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4f14180>}
2025-05-07T20:32:37.9300140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9301176Z context = <triton._C.libtriton.ir.context object at 0x7f5df4a7d8f0>
2025-05-07T20:32:37.9301472Z 
2025-05-07T20:32:37.9301648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9302168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9302650Z                            module_map=module_map)
2025-05-07T20:32:37.9303026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9303380Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.9303651Z E       ^
2025-05-07T20:32:37.9304181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9304634Z 
2025-05-07T20:32:37.9305059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9305574Z 
2025-05-07T20:32:37.9305953Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.9306422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.9306935Z     T=4096,
2025-05-07T20:32:37.9307125Z     D=7168,
2025-05-07T20:32:37.9307334Z     scale_ub=None,
2025-05-07T20:32:37.9307556Z     contiguous=False,
2025-05-07T20:32:37.9307784Z     compiled=False,
2025-05-07T20:32:37.9308003Z )
2025-05-07T20:32:37.9308323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.9308826Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.9309102Z 
2025-05-07T20:32:37.9309188Z     @given(
2025-05-07T20:32:37.9309425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.9309744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.9310048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.9310389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.9310720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.9311003Z     )
2025-05-07T20:32:37.9311475Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.9311925Z     def test_silu_mul_quant(
2025-05-07T20:32:37.9312175Z         self,
2025-05-07T20:32:37.9312371Z         T: int,
2025-05-07T20:32:37.9312582Z         D: int,
2025-05-07T20:32:37.9312810Z         scale_ub: Optional[float],
2025-05-07T20:32:37.9313084Z         contiguous: bool,
2025-05-07T20:32:37.9313328Z         compiled: bool,
2025-05-07T20:32:37.9313558Z     ) -> None:
2025-05-07T20:32:37.9313780Z         torch.manual_seed(2025)
2025-05-07T20:32:37.9314030Z     
2025-05-07T20:32:37.9314310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.9314652Z     
2025-05-07T20:32:37.9314858Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.9315154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.9315465Z         x = x_sign * x_clamp
2025-05-07T20:32:37.9315716Z         x0 = x[:, :D]
2025-05-07T20:32:37.9315941Z         x1 = x[:, D:]
2025-05-07T20:32:37.9316186Z     
2025-05-07T20:32:37.9316403Z         if contiguous:
2025-05-07T20:32:37.9316643Z             x0 = x0.contiguous()
2025-05-07T20:32:37.9316900Z             x1 = x1.contiguous()
2025-05-07T20:32:37.9317153Z     
2025-05-07T20:32:37.9317355Z         if scale_ub is not None:
2025-05-07T20:32:37.9317639Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.9317975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.9318293Z             )
2025-05-07T20:32:37.9318575Z         else:
2025-05-07T20:32:37.9318794Z             scale_ub_tensor = None
2025-05-07T20:32:37.9319058Z     
2025-05-07T20:32:37.9319297Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9319609Z             op = silu_mul_quant
2025-05-07T20:32:37.9319874Z             if compiled:
2025-05-07T20:32:37.9320131Z                 op = torch.compile(op)
2025-05-07T20:32:37.9320426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9320718Z     
2025-05-07T20:32:37.9320919Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.9321086Z 
2025-05-07T20:32:37.9321184Z moe/activation_test.py:117: 
2025-05-07T20:32:37.9321488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9321827Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.9322117Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9322809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.9323583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.9324129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.9324812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.9325479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.9326091Z     kernel = self.compile(
2025-05-07T20:32:37.9326661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.9327314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.9327813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9328043Z 
2025-05-07T20:32:37.9328258Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4af3c90>
2025-05-07T20:32:37.9329342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9330718Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4f16020>}
2025-05-07T20:32:37.9332125Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9333157Z context = <triton._C.libtriton.ir.context object at 0x7f5df4acc2f0>
2025-05-07T20:32:37.9333444Z 
2025-05-07T20:32:37.9333620Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9334143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9334616Z                            module_map=module_map)
2025-05-07T20:32:37.9334985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9335347Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.9335606Z E       ^
2025-05-07T20:32:37.9336099Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9336580Z 
2025-05-07T20:32:37.9337007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9337518Z 
2025-05-07T20:32:37.9337635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.9338046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.9338453Z     T=128,
2025-05-07T20:32:37.9338647Z     D=7168,
2025-05-07T20:32:37.9338844Z     scale_ub=None,
2025-05-07T20:32:37.9339120Z     contiguous=False,
2025-05-07T20:32:37.9339360Z     compiled=True,
2025-05-07T20:32:37.9339566Z )
2025-05-07T20:32:37.9780061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.9780585Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.9780943Z 
2025-05-07T20:32:37.9781062Z     @given(
2025-05-07T20:32:37.9781327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.9781668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.9781986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.9782319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.9782655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.9782949Z     )
2025-05-07T20:32:37.9783303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.9783754Z     def test_silu_mul_quant(
2025-05-07T20:32:37.9784005Z         self,
2025-05-07T20:32:37.9784214Z         T: int,
2025-05-07T20:32:37.9784670Z         D: int,
2025-05-07T20:32:37.9784903Z         scale_ub: Optional[float],
2025-05-07T20:32:37.9785178Z         contiguous: bool,
2025-05-07T20:32:37.9785417Z         compiled: bool,
2025-05-07T20:32:37.9785650Z     ) -> None:
2025-05-07T20:32:37.9785872Z         torch.manual_seed(2025)
2025-05-07T20:32:37.9786112Z     
2025-05-07T20:32:37.9786438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.9786875Z     
2025-05-07T20:32:37.9787076Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.9787369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.9787686Z         x = x_sign * x_clamp
2025-05-07T20:32:37.9787931Z         x0 = x[:, :D]
2025-05-07T20:32:37.9788151Z         x1 = x[:, D:]
2025-05-07T20:32:37.9788368Z     
2025-05-07T20:32:37.9788561Z         if contiguous:
2025-05-07T20:32:37.9788816Z             x0 = x0.contiguous()
2025-05-07T20:32:37.9789082Z             x1 = x1.contiguous()
2025-05-07T20:32:37.9789329Z     
2025-05-07T20:32:37.9789530Z         if scale_ub is not None:
2025-05-07T20:32:37.9789801Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.9790139Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.9790453Z             )
2025-05-07T20:32:37.9790648Z         else:
2025-05-07T20:32:37.9790865Z             scale_ub_tensor = None
2025-05-07T20:32:37.9791123Z     
2025-05-07T20:32:37.9791453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9791766Z             op = silu_mul_quant
2025-05-07T20:32:37.9792025Z             if compiled:
2025-05-07T20:32:37.9792278Z                 op = torch.compile(op)
2025-05-07T20:32:37.9792575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9792860Z     
2025-05-07T20:32:37.9793062Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.9793349Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.9793650Z     
2025-05-07T20:32:37.9793900Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9794235Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.9794538Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.9794860Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.9795223Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.9795544Z     
2025-05-07T20:32:37.9795753Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.9795956Z 
2025-05-07T20:32:37.9796067Z moe/activation_test.py:126: 
2025-05-07T20:32:37.9796367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9796709Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.9797046Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.9797906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.9798671Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.9799224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.9799911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.9800597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.9801338Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.9802102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.9802860Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.9803597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.9804297Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.9804910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.9805434Z     fn()
2025-05-07T20:32:37.9806325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.9806913Z     self.fn.run(
2025-05-07T20:32:37.9807478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.9808090Z     kernel = self.compile(
2025-05-07T20:32:37.9808632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.9809284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.9809674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9809914Z 
2025-05-07T20:32:37.9810125Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df672bcd0>
2025-05-07T20:32:37.9811201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9812586Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4a07060>}
2025-05-07T20:32:37.9814008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9815022Z context = <triton._C.libtriton.ir.context object at 0x7f5df4856cf0>
2025-05-07T20:32:37.9815316Z 
2025-05-07T20:32:37.9815486Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9816007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9816527Z                            module_map=module_map)
2025-05-07T20:32:37.9816888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9817252Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.9817516Z E       ^
2025-05-07T20:32:37.9817978Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9818440Z 
2025-05-07T20:32:37.9818853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9819372Z 
2025-05-07T20:32:37.9819479Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.9819893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.9820297Z     T=128,
2025-05-07T20:32:37.9820556Z     D=7168,
2025-05-07T20:32:37.9820763Z     scale_ub=None,
2025-05-07T20:32:37.9820977Z     contiguous=False,
2025-05-07T20:32:37.9821211Z     compiled=False,
2025-05-07T20:32:37.9821419Z )
2025-05-07T20:32:38.2804692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2805448Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:38.2806231Z 
2025-05-07T20:32:38.2806386Z     @given(
2025-05-07T20:32:38.2806722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2807182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2807715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2808174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2808620Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2809018Z     )
2025-05-07T20:32:38.2809507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2817420Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2817689Z         self,
2025-05-07T20:32:38.2817897Z         T: int,
2025-05-07T20:32:38.2818098Z         D: int,
2025-05-07T20:32:38.2818327Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2818607Z         contiguous: bool,
2025-05-07T20:32:38.2818848Z         compiled: bool,
2025-05-07T20:32:38.2819088Z     ) -> None:
2025-05-07T20:32:38.2819315Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2819691Z     
2025-05-07T20:32:38.2819975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2820332Z     
2025-05-07T20:32:38.2820527Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2820831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2821152Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2821396Z         x0 = x[:, :D]
2025-05-07T20:32:38.2821625Z         x1 = x[:, D:]
2025-05-07T20:32:38.2821842Z     
2025-05-07T20:32:38.2822034Z         if contiguous:
2025-05-07T20:32:38.2822276Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2822543Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2822783Z     
2025-05-07T20:32:38.2822983Z         if scale_ub is not None:
2025-05-07T20:32:38.2823264Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2823610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2823915Z             )
2025-05-07T20:32:38.2824119Z         else:
2025-05-07T20:32:38.2824429Z             scale_ub_tensor = None
2025-05-07T20:32:38.2824678Z     
2025-05-07T20:32:38.2824914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2825231Z             op = silu_mul_quant
2025-05-07T20:32:38.2825488Z             if compiled:
2025-05-07T20:32:38.2825738Z                 op = torch.compile(op)
2025-05-07T20:32:38.2826046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2826331Z     
2025-05-07T20:32:38.2826529Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.2826706Z 
2025-05-07T20:32:38.2826813Z moe/activation_test.py:117: 
2025-05-07T20:32:38.2827119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2827452Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.2827740Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2828438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.2829147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.2829684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2830374Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2831040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2831580Z     kernel = self.compile(
2025-05-07T20:32:38.2832204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2832869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2833274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2833502Z 
2025-05-07T20:32:38.2833715Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df48cba10>
2025-05-07T20:32:38.2834805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2836195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df43d8cc0>}
2025-05-07T20:32:38.2837638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2838670Z context = <triton._C.libtriton.ir.context object at 0x7f5df48f00b0>
2025-05-07T20:32:38.2838958Z 
2025-05-07T20:32:38.2839125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2839652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2840186Z                            module_map=module_map)
2025-05-07T20:32:38.2840559Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2840913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2841184Z E       ^
2025-05-07T20:32:38.2841658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2842107Z 
2025-05-07T20:32:38.2842530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2843047Z 
2025-05-07T20:32:38.2843154Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2843573Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2843980Z     T=4096,
2025-05-07T20:32:38.2844172Z     D=5120,
2025-05-07T20:32:38.2844370Z     scale_ub=1200.0,
2025-05-07T20:32:38.2844597Z     contiguous=True,
2025-05-07T20:32:38.2844870Z     compiled=False,
2025-05-07T20:32:38.2845088Z )
2025-05-07T20:32:38.2845410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2845908Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:38.2846187Z 
2025-05-07T20:32:38.2846268Z     @given(
2025-05-07T20:32:38.2846505Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2846817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2847136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2847474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2847914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2848202Z     )
2025-05-07T20:32:38.2848555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2849006Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2849252Z         self,
2025-05-07T20:32:38.2849464Z         T: int,
2025-05-07T20:32:38.2849678Z         D: int,
2025-05-07T20:32:38.2849899Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2850180Z         contiguous: bool,
2025-05-07T20:32:38.2850429Z         compiled: bool,
2025-05-07T20:32:38.2850654Z     ) -> None:
2025-05-07T20:32:38.2850878Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2851128Z     
2025-05-07T20:32:38.2851404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2851755Z     
2025-05-07T20:32:38.2852011Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2852313Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2852632Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2852884Z         x0 = x[:, :D]
2025-05-07T20:32:38.2853107Z         x1 = x[:, D:]
2025-05-07T20:32:38.2853319Z     
2025-05-07T20:32:38.2853515Z         if contiguous:
2025-05-07T20:32:38.2853753Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2854014Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2854270Z     
2025-05-07T20:32:38.2854468Z         if scale_ub is not None:
2025-05-07T20:32:38.2854740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2855072Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2855386Z             )
2025-05-07T20:32:38.2855576Z         else:
2025-05-07T20:32:38.2855795Z             scale_ub_tensor = None
2025-05-07T20:32:38.2856056Z     
2025-05-07T20:32:38.2856286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2856659Z             op = silu_mul_quant
2025-05-07T20:32:38.2856920Z             if compiled:
2025-05-07T20:32:38.2857172Z                 op = torch.compile(op)
2025-05-07T20:32:38.2857472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2857753Z     
2025-05-07T20:32:38.2857956Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.2858121Z 
2025-05-07T20:32:38.2858221Z moe/activation_test.py:117: 
2025-05-07T20:32:38.2858525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2858912Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.2859190Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2859878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.2860571Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.2861117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2861803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2862472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2863009Z     kernel = self.compile(
2025-05-07T20:32:38.2863547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2864259Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2864663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2864893Z 
2025-05-07T20:32:38.2865103Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df489c4d0>
2025-05-07T20:32:38.2866211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2867592Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df43d9f80>}
2025-05-07T20:32:38.2868940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2869969Z context = <triton._C.libtriton.ir.context object at 0x7f5df48acaf0>
2025-05-07T20:32:38.2870256Z 
2025-05-07T20:32:38.2870431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2870954Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2871417Z                            module_map=module_map)
2025-05-07T20:32:38.2871826Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2872183Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2872433Z E       ^
2025-05-07T20:32:38.2872893Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2873339Z 
2025-05-07T20:32:38.2873755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2874263Z 
2025-05-07T20:32:38.2874376Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2874781Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2875178Z     T=1,
2025-05-07T20:32:38.2875367Z     D=5120,
2025-05-07T20:32:38.2875553Z     scale_ub=None,
2025-05-07T20:32:38.2875767Z     contiguous=True,
2025-05-07T20:32:38.2875988Z     compiled=True,
2025-05-07T20:32:38.2876187Z )
2025-05-07T20:32:38.7143991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.7145352Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:38.7145884Z 
2025-05-07T20:32:38.7146044Z     @given(
2025-05-07T20:32:38.7146418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.7146778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.7147087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.7147424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.7147842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.7148136Z     )
2025-05-07T20:32:38.7148489Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.7148934Z     def test_silu_mul_quant(
2025-05-07T20:32:38.7149177Z         self,
2025-05-07T20:32:38.7149375Z         T: int,
2025-05-07T20:32:38.7149581Z         D: int,
2025-05-07T20:32:38.7149801Z         scale_ub: Optional[float],
2025-05-07T20:32:38.7150078Z         contiguous: bool,
2025-05-07T20:32:38.7150326Z         compiled: bool,
2025-05-07T20:32:38.7150555Z     ) -> None:
2025-05-07T20:32:38.7150779Z         torch.manual_seed(2025)
2025-05-07T20:32:38.7151028Z     
2025-05-07T20:32:38.7151298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.7151649Z     
2025-05-07T20:32:38.7151857Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.7152146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.7152463Z         x = x_sign * x_clamp
2025-05-07T20:32:38.7152799Z         x0 = x[:, :D]
2025-05-07T20:32:38.7153016Z         x1 = x[:, D:]
2025-05-07T20:32:38.7153231Z     
2025-05-07T20:32:38.7153425Z         if contiguous:
2025-05-07T20:32:38.7153654Z             x0 = x0.contiguous()
2025-05-07T20:32:38.7153931Z             x1 = x1.contiguous()
2025-05-07T20:32:38.7154178Z     
2025-05-07T20:32:38.7154376Z         if scale_ub is not None:
2025-05-07T20:32:38.7154647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.7154994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.7155307Z             )
2025-05-07T20:32:38.7155508Z         else:
2025-05-07T20:32:38.7155717Z             scale_ub_tensor = None
2025-05-07T20:32:38.7155974Z     
2025-05-07T20:32:38.7156209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.7156524Z             op = silu_mul_quant
2025-05-07T20:32:38.7156785Z             if compiled:
2025-05-07T20:32:38.7157034Z                 op = torch.compile(op)
2025-05-07T20:32:38.7157332Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7157614Z     
2025-05-07T20:32:38.7157813Z         y_fp8, y_scale = fn()
2025-05-07T20:32:38.7158095Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:38.7158390Z     
2025-05-07T20:32:38.7158630Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.7158960Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:38.7159347Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:38.7159670Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:38.7160036Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.7160345Z     
2025-05-07T20:32:38.7160552Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:38.7160747Z 
2025-05-07T20:32:38.7160854Z moe/activation_test.py:126: 
2025-05-07T20:32:38.7161151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7161494Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:38.7161821Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.7162605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:38.7163358Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:38.7163909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.7164674Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.7165356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:38.7166082Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.7166833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:38.7167751Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.7168474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:38.7169121Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:38.7169730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:38.7170259Z     fn()
2025-05-07T20:32:38.7170767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:38.7171352Z     self.fn.run(
2025-05-07T20:32:38.7171827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.7172357Z     kernel = self.compile(
2025-05-07T20:32:38.7172900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.7173610Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.7174016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7174245Z 
2025-05-07T20:32:38.7174454Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3fb3a410>
2025-05-07T20:32:38.7175542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.7176939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df43dafc0>}
2025-05-07T20:32:38.7178286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.7179306Z context = <triton._C.libtriton.ir.context object at 0x7f5d3fb469f0>
2025-05-07T20:32:38.7179599Z 
2025-05-07T20:32:38.7179766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.7180290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.7180813Z                            module_map=module_map)
2025-05-07T20:32:38.7181178Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.7181539Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:38.7181810Z E       ^
2025-05-07T20:32:38.7182270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.7182725Z 
2025-05-07T20:32:38.7183137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.7183656Z 
2025-05-07T20:32:38.7183761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.7184173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.7184571Z     T=2048,
2025-05-07T20:32:38.7184765Z     D=5120,
2025-05-07T20:32:38.7184964Z     scale_ub=None,
2025-05-07T20:32:38.7185181Z     contiguous=True,
2025-05-07T20:32:38.7185410Z     compiled=True,
2025-05-07T20:32:38.7185622Z )
2025-05-07T20:32:39.1331239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.1331842Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.1332179Z 
2025-05-07T20:32:39.1332271Z     @given(
2025-05-07T20:32:39.1332510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.1332832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.1333149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.1333583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.1333924Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.1334213Z     )
2025-05-07T20:32:39.1334562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.1335022Z     def test_silu_mul_quant(
2025-05-07T20:32:39.1335270Z         self,
2025-05-07T20:32:39.1335469Z         T: int,
2025-05-07T20:32:39.1335665Z         D: int,
2025-05-07T20:32:39.1335892Z         scale_ub: Optional[float],
2025-05-07T20:32:39.1336167Z         contiguous: bool,
2025-05-07T20:32:39.1336417Z         compiled: bool,
2025-05-07T20:32:39.1336681Z     ) -> None:
2025-05-07T20:32:39.1336913Z         torch.manual_seed(2025)
2025-05-07T20:32:39.1337162Z     
2025-05-07T20:32:39.1337434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.1337784Z     
2025-05-07T20:32:39.1337988Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.1338373Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.1338690Z         x = x_sign * x_clamp
2025-05-07T20:32:39.1338936Z         x0 = x[:, :D]
2025-05-07T20:32:39.1339157Z         x1 = x[:, D:]
2025-05-07T20:32:39.1339366Z     
2025-05-07T20:32:39.1339560Z         if contiguous:
2025-05-07T20:32:39.1339799Z             x0 = x0.contiguous()
2025-05-07T20:32:39.1340055Z             x1 = x1.contiguous()
2025-05-07T20:32:39.1340300Z     
2025-05-07T20:32:39.1340502Z         if scale_ub is not None:
2025-05-07T20:32:39.1340780Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.1341123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.1341436Z             )
2025-05-07T20:32:39.1341640Z         else:
2025-05-07T20:32:39.1341850Z             scale_ub_tensor = None
2025-05-07T20:32:39.1342110Z     
2025-05-07T20:32:39.1342351Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1342669Z             op = silu_mul_quant
2025-05-07T20:32:39.1342935Z             if compiled:
2025-05-07T20:32:39.1343189Z                 op = torch.compile(op)
2025-05-07T20:32:39.1343484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1343768Z     
2025-05-07T20:32:39.1343968Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.1344255Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.1344555Z     
2025-05-07T20:32:39.1344801Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1345230Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.1345536Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.1345861Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.1346225Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.1346539Z     
2025-05-07T20:32:39.1346747Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.1346944Z 
2025-05-07T20:32:39.1347053Z moe/activation_test.py:126: 
2025-05-07T20:32:39.1347359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1347702Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.1348042Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.1348835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.1349598Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.1350200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.1350887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.1351573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.1352298Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.1353097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:39.1353846Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.1354574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.1355215Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.1355827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.1356352Z     fn()
2025-05-07T20:32:39.1356904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.1357489Z     self.fn.run(
2025-05-07T20:32:39.1357958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.1358537Z     kernel = self.compile(
2025-05-07T20:32:39.1359081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.1359734Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1360138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1360368Z 
2025-05-07T20:32:39.1360578Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ffd7a90>
2025-05-07T20:32:39.1361660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.1363054Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4023740>}
2025-05-07T20:32:39.1364405Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.1365424Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ffe7c70>
2025-05-07T20:32:39.1365718Z 
2025-05-07T20:32:39.1365884Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.1366460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1366931Z                            module_map=module_map)
2025-05-07T20:32:39.1367294Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1367776Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.1368046Z E       ^
2025-05-07T20:32:39.1368506Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1368968Z 
2025-05-07T20:32:39.1369383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.1369900Z 
2025-05-07T20:32:39.1370005Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.1370418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.1370820Z     T=128,
2025-05-07T20:32:39.1371021Z     D=5120,
2025-05-07T20:32:39.1371229Z     scale_ub=None,
2025-05-07T20:32:39.1371449Z     contiguous=True,
2025-05-07T20:32:39.1371732Z     compiled=True,
2025-05-07T20:32:39.1371949Z )
2025-05-07T20:32:39.7886210Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7887184Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.7887619Z 
2025-05-07T20:32:39.7887721Z     @given(
2025-05-07T20:32:39.7888020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7888653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7888967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7889303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7889630Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7889921Z     )
2025-05-07T20:32:39.7890272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7890725Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7897307Z         self,
2025-05-07T20:32:39.7897547Z         T: int,
2025-05-07T20:32:39.7897763Z         D: int,
2025-05-07T20:32:39.7897999Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7898289Z         contiguous: bool,
2025-05-07T20:32:39.7898539Z         compiled: bool,
2025-05-07T20:32:39.7898779Z     ) -> None:
2025-05-07T20:32:39.7899012Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7899262Z     
2025-05-07T20:32:39.7899549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7900045Z     
2025-05-07T20:32:39.7900244Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7900554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7900877Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7901131Z         x0 = x[:, :D]
2025-05-07T20:32:39.7901364Z         x1 = x[:, D:]
2025-05-07T20:32:39.7901587Z     
2025-05-07T20:32:39.7901780Z         if contiguous:
2025-05-07T20:32:39.7902023Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7902298Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7902554Z     
2025-05-07T20:32:39.7902754Z         if scale_ub is not None:
2025-05-07T20:32:39.7903039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7903387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7903702Z             )
2025-05-07T20:32:39.7903910Z         else:
2025-05-07T20:32:39.7904137Z             scale_ub_tensor = None
2025-05-07T20:32:39.7904394Z     
2025-05-07T20:32:39.7904644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7904977Z             op = silu_mul_quant
2025-05-07T20:32:39.7905235Z             if compiled:
2025-05-07T20:32:39.7905492Z                 op = torch.compile(op)
2025-05-07T20:32:39.7906107Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7906399Z     
2025-05-07T20:32:39.7906602Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.7906905Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.7907214Z     
2025-05-07T20:32:39.7907589Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7907941Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.7908250Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.7908580Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.7908945Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.7909276Z     
2025-05-07T20:32:39.7909501Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.7909705Z 
2025-05-07T20:32:39.7909810Z moe/activation_test.py:126: 
2025-05-07T20:32:39.7910124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7910474Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.7910807Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.7911617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.7912469Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.7913031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7913721Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7914421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.7915227Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.7915991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:39.7916747Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.7917490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.7918146Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.7918761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.7919290Z     fn()
2025-05-07T20:32:39.7919811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.7920406Z     self.fn.run(
2025-05-07T20:32:39.7920954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7921500Z     kernel = self.compile(
2025-05-07T20:32:39.7922050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7922716Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7923115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7923358Z 
2025-05-07T20:32:39.7923570Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f9dced0>
2025-05-07T20:32:39.7924668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7926074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3fc91e40>}
2025-05-07T20:32:39.7927439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7928584Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f9bd430>
2025-05-07T20:32:39.7928882Z 
2025-05-07T20:32:39.7929103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7929640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7930113Z                            module_map=module_map)
2025-05-07T20:32:39.7930492Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7930861Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.7931138Z E       ^
2025-05-07T20:32:39.7931621Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7932087Z 
2025-05-07T20:32:39.7932508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7933026Z 
2025-05-07T20:32:39.7933146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7933567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7933987Z     T=4096,
2025-05-07T20:32:39.7934244Z     D=5120,
2025-05-07T20:32:39.7934447Z     scale_ub=None,
2025-05-07T20:32:39.7934679Z     contiguous=True,
2025-05-07T20:32:39.7934921Z     compiled=True,
2025-05-07T20:32:39.7935134Z )
2025-05-07T20:32:40.2829235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.2829851Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:40.2830211Z 
2025-05-07T20:32:40.2830653Z     @given(
2025-05-07T20:32:40.2830893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.2831215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.2831532Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.2831865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.2832202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.2832497Z     )
2025-05-07T20:32:40.2832854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.2833309Z     def test_silu_mul_quant(
2025-05-07T20:32:40.2833561Z         self,
2025-05-07T20:32:40.2833766Z         T: int,
2025-05-07T20:32:40.2833965Z         D: int,
2025-05-07T20:32:40.2834190Z         scale_ub: Optional[float],
2025-05-07T20:32:40.2834467Z         contiguous: bool,
2025-05-07T20:32:40.2834712Z         compiled: bool,
2025-05-07T20:32:40.2834949Z     ) -> None:
2025-05-07T20:32:40.2835172Z         torch.manual_seed(2025)
2025-05-07T20:32:40.2835505Z     
2025-05-07T20:32:40.2835784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.2836135Z     
2025-05-07T20:32:40.2836334Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.2836633Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.2836949Z         x = x_sign * x_clamp
2025-05-07T20:32:40.2837192Z         x0 = x[:, :D]
2025-05-07T20:32:40.2837415Z         x1 = x[:, D:]
2025-05-07T20:32:40.2837632Z     
2025-05-07T20:32:40.2837818Z         if contiguous:
2025-05-07T20:32:40.2838062Z             x0 = x0.contiguous()
2025-05-07T20:32:40.2838328Z             x1 = x1.contiguous()
2025-05-07T20:32:40.2838567Z     
2025-05-07T20:32:40.2838763Z         if scale_ub is not None:
2025-05-07T20:32:40.2839037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.2839379Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.2839687Z             )
2025-05-07T20:32:40.2839887Z         else:
2025-05-07T20:32:40.2840103Z             scale_ub_tensor = None
2025-05-07T20:32:40.2840363Z     
2025-05-07T20:32:40.2840598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.2840911Z             op = silu_mul_quant
2025-05-07T20:32:40.2841163Z             if compiled:
2025-05-07T20:32:40.2841413Z                 op = torch.compile(op)
2025-05-07T20:32:40.2841702Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.2841981Z     
2025-05-07T20:32:40.2842173Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.2842545Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.2842833Z     
2025-05-07T20:32:40.2843068Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.2843403Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.2843691Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.2844008Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.2844367Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.2844680Z     
2025-05-07T20:32:40.2844882Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.2845079Z 
2025-05-07T20:32:40.2845183Z moe/activation_test.py:126: 
2025-05-07T20:32:40.2845495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.2845822Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.2846148Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.2847029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.2847893Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.2848440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.2849128Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.2849870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.2850593Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.2851349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:40.2852101Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.2852841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.2853479Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.2854085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.2854612Z     fn()
2025-05-07T20:32:40.2855124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.2855765Z     self.fn.run(
2025-05-07T20:32:40.2856240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.2856776Z     kernel = self.compile(
2025-05-07T20:32:40.2857320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.2857982Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.2858388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.2858619Z 
2025-05-07T20:32:40.2858836Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f5e3c10>
2025-05-07T20:32:40.2859914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.2861323Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3fc79ee0>}
2025-05-07T20:32:40.2862674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.2863758Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f5f0130>
2025-05-07T20:32:40.2864050Z 
2025-05-07T20:32:40.2864229Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.2864749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.2865221Z                            module_map=module_map)
2025-05-07T20:32:40.2865590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.2865952Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.2866225Z E       ^
2025-05-07T20:32:40.2866699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.2867147Z 
2025-05-07T20:32:40.2867569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.2868081Z 
2025-05-07T20:32:40.2868188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.2868651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.2869060Z     T=16384,
2025-05-07T20:32:40.2869256Z     D=5120,
2025-05-07T20:32:40.2869467Z     scale_ub=None,
2025-05-07T20:32:40.2869686Z     contiguous=True,
2025-05-07T20:32:40.2869907Z     compiled=True,
2025-05-07T20:32:40.2870121Z )
2025-05-07T20:32:40.3128037Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:40.3129477Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:40.3130809Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:40.3131807Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:40.3132908Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:40.3818896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3819465Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:40.3820120Z 
2025-05-07T20:32:40.3820205Z     @given(
2025-05-07T20:32:40.3820451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3820776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3821093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3821428Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3821765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3822065Z     )
2025-05-07T20:32:40.3822428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3822881Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3823141Z         self,
2025-05-07T20:32:40.3823343Z         T: int,
2025-05-07T20:32:40.3823556Z         D: int,
2025-05-07T20:32:40.3823790Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3824067Z         contiguous: bool,
2025-05-07T20:32:40.3824323Z         compiled: bool,
2025-05-07T20:32:40.3824570Z     ) -> None:
2025-05-07T20:32:40.3824794Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3825049Z     
2025-05-07T20:32:40.3825513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3825863Z     
2025-05-07T20:32:40.3826077Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3826381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3826701Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3826946Z         x0 = x[:, :D]
2025-05-07T20:32:40.3827266Z         x1 = x[:, D:]
2025-05-07T20:32:40.3827488Z     
2025-05-07T20:32:40.3827680Z         if contiguous:
2025-05-07T20:32:40.3827924Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3828196Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3828440Z     
2025-05-07T20:32:40.3828645Z         if scale_ub is not None:
2025-05-07T20:32:40.3828932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3829292Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3829607Z             )
2025-05-07T20:32:40.3829811Z         else:
2025-05-07T20:32:40.3830032Z             scale_ub_tensor = None
2025-05-07T20:32:40.3830288Z     
2025-05-07T20:32:40.3830533Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3830857Z             op = silu_mul_quant
2025-05-07T20:32:40.3831112Z             if compiled:
2025-05-07T20:32:40.3831369Z                 op = torch.compile(op)
2025-05-07T20:32:40.3831677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3832041Z     
2025-05-07T20:32:40.3832244Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.3832540Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.3832840Z     
2025-05-07T20:32:40.3833083Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3833429Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.3833733Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.3834124Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.3834493Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.3834814Z     
2025-05-07T20:32:40.3835022Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.3835227Z 
2025-05-07T20:32:40.3835330Z moe/activation_test.py:126: 
2025-05-07T20:32:40.3835637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3835983Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.3836320Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.3837162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.3837924Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.3838470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3839209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3839904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.3840634Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.3841388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:40.3842149Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.3842889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.3843540Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.3844143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.3844679Z     fn()
2025-05-07T20:32:40.3845196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.3845779Z     self.fn.run(
2025-05-07T20:32:40.3846259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3846806Z     kernel = self.compile(
2025-05-07T20:32:40.3847402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3848178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3848585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3848818Z 
2025-05-07T20:32:40.3849038Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f4c5910>
2025-05-07T20:32:40.3850129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3851524Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f91d080>}
2025-05-07T20:32:40.3852926Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3853961Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f4d9eb0>
2025-05-07T20:32:40.3854252Z 
2025-05-07T20:32:40.3854426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3854946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3855421Z                            module_map=module_map)
2025-05-07T20:32:40.3855843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3856209Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.3856479Z E       ^
2025-05-07T20:32:40.3856950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3857410Z 
2025-05-07T20:32:40.3857828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3858346Z 
2025-05-07T20:32:40.3858463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3858884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3859288Z     T=1,
2025-05-07T20:32:40.3859486Z     D=5120,
2025-05-07T20:32:40.3859692Z     scale_ub=1200.0,
2025-05-07T20:32:40.3859921Z     contiguous=True,
2025-05-07T20:32:40.3860155Z     compiled=True,
2025-05-07T20:32:40.3860376Z )
2025-05-07T20:32:40.6586240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6586782Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.6587049Z 
2025-05-07T20:32:40.6587132Z     @given(
2025-05-07T20:32:40.6587375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6587693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6587996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6588351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6588691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6588979Z     )
2025-05-07T20:32:40.6589332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6589783Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6590033Z         self,
2025-05-07T20:32:40.6590230Z         T: int,
2025-05-07T20:32:40.6590438Z         D: int,
2025-05-07T20:32:40.6590662Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6590948Z         contiguous: bool,
2025-05-07T20:32:40.6591196Z         compiled: bool,
2025-05-07T20:32:40.6591433Z     ) -> None:
2025-05-07T20:32:40.6591649Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6591902Z     
2025-05-07T20:32:40.6592183Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6592531Z     
2025-05-07T20:32:40.6592733Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6593310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6593628Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6593879Z         x0 = x[:, :D]
2025-05-07T20:32:40.6594103Z         x1 = x[:, D:]
2025-05-07T20:32:40.6594309Z     
2025-05-07T20:32:40.6594504Z         if contiguous:
2025-05-07T20:32:40.6594743Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6594999Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6595249Z     
2025-05-07T20:32:40.6595448Z         if scale_ub is not None:
2025-05-07T20:32:40.6595722Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6596066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6596380Z             )
2025-05-07T20:32:40.6596581Z         else:
2025-05-07T20:32:40.6596790Z             scale_ub_tensor = None
2025-05-07T20:32:40.6597046Z     
2025-05-07T20:32:40.6597281Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6597596Z             op = silu_mul_quant
2025-05-07T20:32:40.6597851Z             if compiled:
2025-05-07T20:32:40.6598198Z                 op = torch.compile(op)
2025-05-07T20:32:40.6598497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6598783Z     
2025-05-07T20:32:40.6598986Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6599151Z 
2025-05-07T20:32:40.6599252Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6599551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6599888Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6600256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6600815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.6601384Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.6602051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6602744Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6603291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6603980Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6604649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6605183Z     kernel = self.compile(
2025-05-07T20:32:40.6606021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6606795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6607192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6607428Z 
2025-05-07T20:32:40.6607735Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df479b5d0>
2025-05-07T20:32:40.6608833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6610241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f2ddee0>}
2025-05-07T20:32:40.6611600Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6612641Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f4f03f0>
2025-05-07T20:32:40.6612937Z 
2025-05-07T20:32:40.6613103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6613634Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6614180Z                            module_map=module_map)
2025-05-07T20:32:40.6614556Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6614918Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6615187Z E       ^
2025-05-07T20:32:40.6615653Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6616109Z 
2025-05-07T20:32:40.6616543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6623576Z 
2025-05-07T20:32:40.6623720Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6624158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6624577Z     T=1,
2025-05-07T20:32:40.6624779Z     D=5120,
2025-05-07T20:32:40.6624980Z     scale_ub=None,
2025-05-07T20:32:40.6625218Z     contiguous=False,
2025-05-07T20:32:40.6625458Z     compiled=True,
2025-05-07T20:32:40.6625675Z )
2025-05-07T20:32:40.7098044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.7098567Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.7098833Z 
2025-05-07T20:32:40.7098926Z     @given(
2025-05-07T20:32:40.7099159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.7099483Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.7099800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.7100259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.7100597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.7100892Z     )
2025-05-07T20:32:40.7101242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.7101693Z     def test_silu_mul_quant(
2025-05-07T20:32:40.7101943Z         self,
2025-05-07T20:32:40.7102149Z         T: int,
2025-05-07T20:32:40.7102348Z         D: int,
2025-05-07T20:32:40.7102578Z         scale_ub: Optional[float],
2025-05-07T20:32:40.7102869Z         contiguous: bool,
2025-05-07T20:32:40.7103109Z         compiled: bool,
2025-05-07T20:32:40.7103345Z     ) -> None:
2025-05-07T20:32:40.7103569Z         torch.manual_seed(2025)
2025-05-07T20:32:40.7103808Z     
2025-05-07T20:32:40.7104084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.7104434Z     
2025-05-07T20:32:40.7104632Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.7105012Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.7105330Z         x = x_sign * x_clamp
2025-05-07T20:32:40.7105579Z         x0 = x[:, :D]
2025-05-07T20:32:40.7106042Z         x1 = x[:, D:]
2025-05-07T20:32:40.7106257Z     
2025-05-07T20:32:40.7106454Z         if contiguous:
2025-05-07T20:32:40.7106686Z             x0 = x0.contiguous()
2025-05-07T20:32:40.7106953Z             x1 = x1.contiguous()
2025-05-07T20:32:40.7107204Z     
2025-05-07T20:32:40.7107397Z         if scale_ub is not None:
2025-05-07T20:32:40.7107684Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.7108028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.7108345Z             )
2025-05-07T20:32:40.7108542Z         else:
2025-05-07T20:32:40.7108759Z             scale_ub_tensor = None
2025-05-07T20:32:40.7109023Z     
2025-05-07T20:32:40.7109253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.7109576Z             op = silu_mul_quant
2025-05-07T20:32:40.7109841Z             if compiled:
2025-05-07T20:32:40.7110088Z                 op = torch.compile(op)
2025-05-07T20:32:40.7110391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.7110671Z     
2025-05-07T20:32:40.7110869Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.7111178Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.7111478Z     
2025-05-07T20:32:40.7111723Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.7112139Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.7112446Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.7112770Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.7113131Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.7113451Z     
2025-05-07T20:32:40.7113659Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.7113854Z 
2025-05-07T20:32:40.7113963Z moe/activation_test.py:126: 
2025-05-07T20:32:40.7114269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7114609Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.7114942Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.7115732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.7116488Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.7117153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.7117847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.7118534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.7119264Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.7120082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:40.7120823Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.7121554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.7122198Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.7122812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.7123334Z     fn()
2025-05-07T20:32:40.7123850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.7124442Z     self.fn.run(
2025-05-07T20:32:40.7124924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.7125526Z     kernel = self.compile(
2025-05-07T20:32:40.7126076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.7126738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.7127135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.7127374Z 
2025-05-07T20:32:40.7127669Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eba6e90>
2025-05-07T20:32:40.7128764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.7130156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f2f9f80>}
2025-05-07T20:32:40.7131516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.7132541Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eb4b470>
2025-05-07T20:32:40.7132840Z 
2025-05-07T20:32:40.7133012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.7133594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.7134076Z                            module_map=module_map)
2025-05-07T20:32:40.7134441Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.7134806Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.7135083Z E       ^
2025-05-07T20:32:40.7135547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.7136013Z 
2025-05-07T20:32:40.7136434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.7136955Z 
2025-05-07T20:32:40.7137061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.7137480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.7137877Z     T=1,
2025-05-07T20:32:40.7138075Z     D=5120,
2025-05-07T20:32:40.7138280Z     scale_ub=None,
2025-05-07T20:32:40.7138496Z     contiguous=True,
2025-05-07T20:32:40.7138776Z     compiled=False,
2025-05-07T20:32:40.7138990Z )
2025-05-07T20:32:40.8299016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8299779Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:40.8300133Z 
2025-05-07T20:32:40.8300236Z     @given(
2025-05-07T20:32:40.8300537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8300949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8301557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8301900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8302239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8302543Z     )
2025-05-07T20:32:40.8302896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8303348Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8303603Z         self,
2025-05-07T20:32:40.8303811Z         T: int,
2025-05-07T20:32:40.8304030Z         D: int,
2025-05-07T20:32:40.8304271Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8304550Z         contiguous: bool,
2025-05-07T20:32:40.8304805Z         compiled: bool,
2025-05-07T20:32:40.8305054Z     ) -> None:
2025-05-07T20:32:40.8305277Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8305534Z     
2025-05-07T20:32:40.8306114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8306565Z     
2025-05-07T20:32:40.8306760Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8307058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8307375Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8307617Z         x0 = x[:, :D]
2025-05-07T20:32:40.8307842Z         x1 = x[:, D:]
2025-05-07T20:32:40.8308059Z     
2025-05-07T20:32:40.8308252Z         if contiguous:
2025-05-07T20:32:40.8308482Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8308742Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8308990Z     
2025-05-07T20:32:40.8309184Z         if scale_ub is not None:
2025-05-07T20:32:40.8309462Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8309802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8310109Z             )
2025-05-07T20:32:40.8310304Z         else:
2025-05-07T20:32:40.8310522Z             scale_ub_tensor = None
2025-05-07T20:32:40.8310775Z     
2025-05-07T20:32:40.8311015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8311335Z             op = silu_mul_quant
2025-05-07T20:32:40.8311586Z             if compiled:
2025-05-07T20:32:40.8311842Z                 op = torch.compile(op)
2025-05-07T20:32:40.8312148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8312425Z     
2025-05-07T20:32:40.8312623Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8312794Z 
2025-05-07T20:32:40.8312894Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8313278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8313612Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8313896Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8314592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8315278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8315820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8316514Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8317228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8317758Z     kernel = self.compile(
2025-05-07T20:32:40.8318305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8319041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8319444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8319685Z 
2025-05-07T20:32:40.8319894Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eb03250>
2025-05-07T20:32:40.8320979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8322494Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f2fb9c0>}
2025-05-07T20:32:40.8323844Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8324866Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eb7b7b0>
2025-05-07T20:32:40.8325159Z 
2025-05-07T20:32:40.8325325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8325852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8326322Z                            module_map=module_map)
2025-05-07T20:32:40.8326731Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8327091Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8327359Z E       ^
2025-05-07T20:32:40.8327915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8328368Z 
2025-05-07T20:32:40.8328787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8329308Z 
2025-05-07T20:32:40.8329420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8329841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8330237Z     T=128,
2025-05-07T20:32:40.8330434Z     D=5120,
2025-05-07T20:32:40.8330632Z     scale_ub=None,
2025-05-07T20:32:40.8330848Z     contiguous=False,
2025-05-07T20:32:40.8331079Z     compiled=True,
2025-05-07T20:32:40.8331289Z )
2025-05-07T20:32:40.8331608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8332107Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.8332383Z 
2025-05-07T20:32:40.8332463Z     @given(
2025-05-07T20:32:40.8332698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8333011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8333323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8333658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8334039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8334334Z     )
2025-05-07T20:32:40.8334685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8335127Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8335374Z         self,
2025-05-07T20:32:40.8335573Z         T: int,
2025-05-07T20:32:40.8335769Z         D: int,
2025-05-07T20:32:40.8335991Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8336272Z         contiguous: bool,
2025-05-07T20:32:40.8336508Z         compiled: bool,
2025-05-07T20:32:40.8336737Z     ) -> None:
2025-05-07T20:32:40.8336957Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8337207Z     
2025-05-07T20:32:40.8337482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8337829Z     
2025-05-07T20:32:40.8338033Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8338328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8338648Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8338982Z         x0 = x[:, :D]
2025-05-07T20:32:40.8339204Z         x1 = x[:, D:]
2025-05-07T20:32:40.8339419Z     
2025-05-07T20:32:40.8339613Z         if contiguous:
2025-05-07T20:32:40.8339844Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8340113Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8340360Z     
2025-05-07T20:32:40.8340553Z         if scale_ub is not None:
2025-05-07T20:32:40.8340832Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8341218Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8341528Z             )
2025-05-07T20:32:40.8341735Z         else:
2025-05-07T20:32:40.8341953Z             scale_ub_tensor = None
2025-05-07T20:32:40.8342214Z     
2025-05-07T20:32:40.8342448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8342770Z             op = silu_mul_quant
2025-05-07T20:32:40.8343033Z             if compiled:
2025-05-07T20:32:40.8343285Z                 op = torch.compile(op)
2025-05-07T20:32:40.8343591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8343873Z     
2025-05-07T20:32:40.8344065Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8344236Z 
2025-05-07T20:32:40.8344337Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8344643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8344978Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8345317Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8345883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.8346446Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.8347157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8347857Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8348403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8349083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8349749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8350288Z     kernel = self.compile(
2025-05-07T20:32:40.8350836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8351490Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8351888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8352117Z 
2025-05-07T20:32:40.8352328Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eb16b90>
2025-05-07T20:32:40.8353456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8354826Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f0a1120>}
2025-05-07T20:32:40.8356173Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8357207Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eb13170>
2025-05-07T20:32:40.8357499Z 
2025-05-07T20:32:40.8357672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8358190Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8358661Z                            module_map=module_map)
2025-05-07T20:32:40.8359079Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8359444Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8359704Z E       ^
2025-05-07T20:32:40.8360172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8360623Z 
2025-05-07T20:32:40.8361046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8361600Z 
2025-05-07T20:32:40.8361713Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8362124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8362535Z     T=128,
2025-05-07T20:32:40.8362726Z     D=7168,
2025-05-07T20:32:40.8362917Z     scale_ub=1200.0,
2025-05-07T20:32:40.8363149Z     contiguous=False,
2025-05-07T20:32:40.8363382Z     compiled=False,
2025-05-07T20:32:40.8363590Z )
2025-05-07T20:32:40.9234186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.9234953Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.9235340Z 
2025-05-07T20:32:40.9235459Z     @given(
2025-05-07T20:32:40.9235726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.9236048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.9236364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.9236889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.9237262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.9237557Z     )
2025-05-07T20:32:40.9237918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.9238363Z     def test_silu_mul_quant(
2025-05-07T20:32:40.9238615Z         self,
2025-05-07T20:32:40.9238823Z         T: int,
2025-05-07T20:32:40.9239024Z         D: int,
2025-05-07T20:32:40.9239252Z         scale_ub: Optional[float],
2025-05-07T20:32:40.9239536Z         contiguous: bool,
2025-05-07T20:32:40.9239784Z         compiled: bool,
2025-05-07T20:32:40.9240014Z     ) -> None:
2025-05-07T20:32:40.9240238Z         torch.manual_seed(2025)
2025-05-07T20:32:40.9240485Z     
2025-05-07T20:32:40.9240761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.9241116Z     
2025-05-07T20:32:40.9241318Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.9241612Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.9241940Z         x = x_sign * x_clamp
2025-05-07T20:32:40.9242192Z         x0 = x[:, :D]
2025-05-07T20:32:40.9242413Z         x1 = x[:, D:]
2025-05-07T20:32:40.9242636Z     
2025-05-07T20:32:40.9242832Z         if contiguous:
2025-05-07T20:32:40.9243069Z             x0 = x0.contiguous()
2025-05-07T20:32:40.9243340Z             x1 = x1.contiguous()
2025-05-07T20:32:40.9243590Z     
2025-05-07T20:32:40.9243788Z         if scale_ub is not None:
2025-05-07T20:32:40.9244157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.9244506Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.9244825Z             )
2025-05-07T20:32:40.9245023Z         else:
2025-05-07T20:32:40.9245244Z             scale_ub_tensor = None
2025-05-07T20:32:40.9245508Z     
2025-05-07T20:32:40.9245745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.9246072Z             op = silu_mul_quant
2025-05-07T20:32:40.9246337Z             if compiled:
2025-05-07T20:32:40.9246593Z                 op = torch.compile(op)
2025-05-07T20:32:40.9246899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9247213Z     
2025-05-07T20:32:40.9247437Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.9247710Z 
2025-05-07T20:32:40.9247814Z moe/activation_test.py:117: 
2025-05-07T20:32:40.9248123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9248463Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.9248838Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9249543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.9250244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9250783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9251474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9252227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9252767Z     kernel = self.compile(
2025-05-07T20:32:40.9253309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9253972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9254383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9254616Z 
2025-05-07T20:32:40.9254825Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ec46b90>
2025-05-07T20:32:40.9255914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9257358Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f0a0360>}
2025-05-07T20:32:40.9258709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9259740Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ecff1b0>
2025-05-07T20:32:40.9260033Z 
2025-05-07T20:32:40.9260205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9260737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9261210Z                            module_map=module_map)
2025-05-07T20:32:40.9261579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9261940Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9262211Z E       ^
2025-05-07T20:32:40.9262682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9263134Z 
2025-05-07T20:32:40.9263552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9264072Z 
2025-05-07T20:32:40.9264184Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9264652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9265066Z     T=128,
2025-05-07T20:32:40.9265259Z     D=5120,
2025-05-07T20:32:40.9265470Z     scale_ub=None,
2025-05-07T20:32:40.9265700Z     contiguous=False,
2025-05-07T20:32:40.9265931Z     compiled=False,
2025-05-07T20:32:40.9266155Z )
2025-05-07T20:32:40.9266484Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.9266993Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:40.9267318Z 
2025-05-07T20:32:40.9267402Z     @given(
2025-05-07T20:32:40.9267643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.9267958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.9268278Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.9268617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.9268957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.9269249Z     )
2025-05-07T20:32:40.9269657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.9270108Z     def test_silu_mul_quant(
2025-05-07T20:32:40.9270356Z         self,
2025-05-07T20:32:40.9270561Z         T: int,
2025-05-07T20:32:40.9270768Z         D: int,
2025-05-07T20:32:40.9270989Z         scale_ub: Optional[float],
2025-05-07T20:32:40.9271288Z         contiguous: bool,
2025-05-07T20:32:40.9278306Z         compiled: bool,
2025-05-07T20:32:40.9278631Z     ) -> None:
2025-05-07T20:32:40.9278872Z         torch.manual_seed(2025)
2025-05-07T20:32:40.9279133Z     
2025-05-07T20:32:40.9279425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.9279779Z     
2025-05-07T20:32:40.9279989Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.9280300Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.9280613Z         x = x_sign * x_clamp
2025-05-07T20:32:40.9280864Z         x0 = x[:, :D]
2025-05-07T20:32:40.9281091Z         x1 = x[:, D:]
2025-05-07T20:32:40.9281308Z     
2025-05-07T20:32:40.9281511Z         if contiguous:
2025-05-07T20:32:40.9281762Z             x0 = x0.contiguous()
2025-05-07T20:32:40.9282027Z             x1 = x1.contiguous()
2025-05-07T20:32:40.9282277Z     
2025-05-07T20:32:40.9282488Z         if scale_ub is not None:
2025-05-07T20:32:40.9282767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.9283115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.9283488Z             )
2025-05-07T20:32:40.9283690Z         else:
2025-05-07T20:32:40.9283919Z             scale_ub_tensor = None
2025-05-07T20:32:40.9284183Z     
2025-05-07T20:32:40.9284430Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.9284755Z             op = silu_mul_quant
2025-05-07T20:32:40.9285019Z             if compiled:
2025-05-07T20:32:40.9285286Z                 op = torch.compile(op)
2025-05-07T20:32:40.9285587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9285878Z     
2025-05-07T20:32:40.9286085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.9286250Z 
2025-05-07T20:32:40.9286365Z moe/activation_test.py:117: 
2025-05-07T20:32:40.9286666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9287017Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.9287314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9288086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.9288804Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9289356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9290056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9290727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9291354Z     kernel = self.compile(
2025-05-07T20:32:40.9291914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9292585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9292990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9293231Z 
2025-05-07T20:32:40.9293442Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ec81f90>
2025-05-07T20:32:40.9294544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9295941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ec28720>}
2025-05-07T20:32:40.9297348Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9298398Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ec065f0>
2025-05-07T20:32:40.9298707Z 
2025-05-07T20:32:40.9298878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9299458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9299937Z                            module_map=module_map)
2025-05-07T20:32:40.9300321Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9300691Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9300958Z E       ^
2025-05-07T20:32:40.9301437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9301907Z 
2025-05-07T20:32:40.9302335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9302852Z 
2025-05-07T20:32:40.9302972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9303392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9303808Z     T=128,
2025-05-07T20:32:40.9304017Z     D=5120,
2025-05-07T20:32:40.9304268Z     scale_ub=1200.0,
2025-05-07T20:32:40.9304509Z     contiguous=True,
2025-05-07T20:32:40.9304747Z     compiled=False,
2025-05-07T20:32:40.9304963Z )
2025-05-07T20:32:41.2241820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2242619Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.2242996Z 
2025-05-07T20:32:41.2243080Z     @given(
2025-05-07T20:32:41.2243319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2243667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2243977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2244313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2244645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2244937Z     )
2025-05-07T20:32:41.2245293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2245741Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2245998Z         self,
2025-05-07T20:32:41.2246202Z         T: int,
2025-05-07T20:32:41.2246407Z         D: int,
2025-05-07T20:32:41.2246634Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2246908Z         contiguous: bool,
2025-05-07T20:32:41.2247189Z         compiled: bool,
2025-05-07T20:32:41.2247445Z     ) -> None:
2025-05-07T20:32:41.2247783Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2248041Z     
2025-05-07T20:32:41.2248615Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2248970Z     
2025-05-07T20:32:41.2249175Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2249474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2249785Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2250037Z         x0 = x[:, :D]
2025-05-07T20:32:41.2250264Z         x1 = x[:, D:]
2025-05-07T20:32:41.2250478Z     
2025-05-07T20:32:41.2250675Z         if contiguous:
2025-05-07T20:32:41.2250925Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2251192Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2251443Z     
2025-05-07T20:32:41.2251648Z         if scale_ub is not None:
2025-05-07T20:32:41.2251921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2252268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2252587Z             )
2025-05-07T20:32:41.2252789Z         else:
2025-05-07T20:32:41.2253002Z             scale_ub_tensor = None
2025-05-07T20:32:41.2253266Z     
2025-05-07T20:32:41.2253596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2253919Z             op = silu_mul_quant
2025-05-07T20:32:41.2254178Z             if compiled:
2025-05-07T20:32:41.2254434Z                 op = torch.compile(op)
2025-05-07T20:32:41.2254732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2255016Z     
2025-05-07T20:32:41.2255218Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2255384Z 
2025-05-07T20:32:41.2255557Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2255863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2256200Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2256486Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2257176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2257879Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2258427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2259111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2259776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2260315Z     kernel = self.compile(
2025-05-07T20:32:41.2260861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2261603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2262006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2262235Z 
2025-05-07T20:32:41.2262451Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ec0c110>
2025-05-07T20:32:41.2263542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2264932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ec298a0>}
2025-05-07T20:32:41.2266277Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2267299Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ec01770>
2025-05-07T20:32:41.2267594Z 
2025-05-07T20:32:41.2267758Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2268279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2268739Z                            module_map=module_map)
2025-05-07T20:32:41.2269152Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2269508Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2269772Z E       ^
2025-05-07T20:32:41.2270228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2270682Z 
2025-05-07T20:32:41.2271096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2271610Z 
2025-05-07T20:32:41.2271718Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2272124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2272520Z     T=1,
2025-05-07T20:32:41.2272702Z     D=7168,
2025-05-07T20:32:41.2272899Z     scale_ub=1200.0,
2025-05-07T20:32:41.2273113Z     contiguous=True,
2025-05-07T20:32:41.2273334Z     compiled=True,
2025-05-07T20:32:41.2273549Z )
2025-05-07T20:32:41.2273914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2274407Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.2274672Z 
2025-05-07T20:32:41.2274759Z     @given(
2025-05-07T20:32:41.2274987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2275303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2275614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2275984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2276317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2276608Z     )
2025-05-07T20:32:41.2276960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2277401Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2277648Z         self,
2025-05-07T20:32:41.2277847Z         T: int,
2025-05-07T20:32:41.2278040Z         D: int,
2025-05-07T20:32:41.2278262Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2278541Z         contiguous: bool,
2025-05-07T20:32:41.2278779Z         compiled: bool,
2025-05-07T20:32:41.2279004Z     ) -> None:
2025-05-07T20:32:41.2279228Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2279469Z     
2025-05-07T20:32:41.2279747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2280098Z     
2025-05-07T20:32:41.2280295Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2280596Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2280963Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2281210Z         x0 = x[:, :D]
2025-05-07T20:32:41.2281431Z         x1 = x[:, D:]
2025-05-07T20:32:41.2281649Z     
2025-05-07T20:32:41.2281843Z         if contiguous:
2025-05-07T20:32:41.2282081Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2282345Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2282589Z     
2025-05-07T20:32:41.2282781Z         if scale_ub is not None:
2025-05-07T20:32:41.2283068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2283411Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2283719Z             )
2025-05-07T20:32:41.2283920Z         else:
2025-05-07T20:32:41.2284140Z             scale_ub_tensor = None
2025-05-07T20:32:41.2284392Z     
2025-05-07T20:32:41.2284627Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2284946Z             op = silu_mul_quant
2025-05-07T20:32:41.2285204Z             if compiled:
2025-05-07T20:32:41.2285457Z                 op = torch.compile(op)
2025-05-07T20:32:41.2285756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2286030Z     
2025-05-07T20:32:41.2286228Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2286399Z 
2025-05-07T20:32:41.2286500Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2286800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2287134Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2287471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2288132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2288690Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2289355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2290053Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2290600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2291283Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2291954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2292494Z     kernel = self.compile(
2025-05-07T20:32:41.2293084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2293746Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2294147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2294377Z 
2025-05-07T20:32:41.2294591Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ee60290>
2025-05-07T20:32:41.2295669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2297104Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ec2ae80>}
2025-05-07T20:32:41.2298456Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2299480Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eeb8930>
2025-05-07T20:32:41.2299767Z 
2025-05-07T20:32:41.2299939Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2300458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2301004Z                            module_map=module_map)
2025-05-07T20:32:41.2301374Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2301727Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2301992Z E       ^
2025-05-07T20:32:41.2302462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2302909Z 
2025-05-07T20:32:41.2303335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2303847Z 
2025-05-07T20:32:41.2303953Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2304370Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2304777Z     T=1,
2025-05-07T20:32:41.2304961Z     D=7168,
2025-05-07T20:32:41.2305162Z     scale_ub=1200.0,
2025-05-07T20:32:41.2305393Z     contiguous=False,
2025-05-07T20:32:41.2305898Z     compiled=True,
2025-05-07T20:32:41.2306119Z )
2025-05-07T20:32:41.3317359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3318137Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.3318468Z 
2025-05-07T20:32:41.3318558Z     @given(
2025-05-07T20:32:41.3318786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3319110Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3319418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3320049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3320387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3320678Z     )
2025-05-07T20:32:41.3321029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3321468Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3321718Z         self,
2025-05-07T20:32:41.3321919Z         T: int,
2025-05-07T20:32:41.3322114Z         D: int,
2025-05-07T20:32:41.3322348Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3322624Z         contiguous: bool,
2025-05-07T20:32:41.3322860Z         compiled: bool,
2025-05-07T20:32:41.3323098Z     ) -> None:
2025-05-07T20:32:41.3323327Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3323566Z     
2025-05-07T20:32:41.3323840Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3324189Z     
2025-05-07T20:32:41.3324383Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3324773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3325089Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3325328Z         x0 = x[:, :D]
2025-05-07T20:32:41.3325551Z         x1 = x[:, D:]
2025-05-07T20:32:41.3325765Z     
2025-05-07T20:32:41.3325953Z         if contiguous:
2025-05-07T20:32:41.3326190Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3326454Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3326700Z     
2025-05-07T20:32:41.3326972Z         if scale_ub is not None:
2025-05-07T20:32:41.3327257Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3327704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3328016Z             )
2025-05-07T20:32:41.3328216Z         else:
2025-05-07T20:32:41.3328435Z             scale_ub_tensor = None
2025-05-07T20:32:41.3328685Z     
2025-05-07T20:32:41.3328920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3329238Z             op = silu_mul_quant
2025-05-07T20:32:41.3329490Z             if compiled:
2025-05-07T20:32:41.3329750Z                 op = torch.compile(op)
2025-05-07T20:32:41.3330053Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3330330Z     
2025-05-07T20:32:41.3330532Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.3330699Z 
2025-05-07T20:32:41.3330810Z moe/activation_test.py:117: 
2025-05-07T20:32:41.3331117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3331544Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.3331832Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3332392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.3332951Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.3333618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.3334316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.3334857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3335535Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3336196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3336732Z     kernel = self.compile(
2025-05-07T20:32:41.3337278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3337936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3338334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3338569Z 
2025-05-07T20:32:41.3338776Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ee35f90>
2025-05-07T20:32:41.3339902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3341295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ee9c680>}
2025-05-07T20:32:41.3342629Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3343659Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ee1a5f0>
2025-05-07T20:32:41.3343946Z 
2025-05-07T20:32:41.3344119Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3344643Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3345156Z                            module_map=module_map)
2025-05-07T20:32:41.3345524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3345880Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.3346142Z E       ^
2025-05-07T20:32:41.3346609Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3347057Z 
2025-05-07T20:32:41.3347475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3348029Z 
2025-05-07T20:32:41.3348146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3348559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3348962Z     T=1,
2025-05-07T20:32:41.3349156Z     D=7168,
2025-05-07T20:32:41.3349350Z     scale_ub=None,
2025-05-07T20:32:41.3349573Z     contiguous=False,
2025-05-07T20:32:41.3349804Z     compiled=True,
2025-05-07T20:32:41.3350014Z )
2025-05-07T20:32:41.4027178Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.4027917Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.4028284Z 
2025-05-07T20:32:41.4028397Z     @given(
2025-05-07T20:32:41.4028645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.4028954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.4029265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.4029872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.4030196Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.4030490Z     )
2025-05-07T20:32:41.4030845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.4031284Z     def test_silu_mul_quant(
2025-05-07T20:32:41.4031535Z         self,
2025-05-07T20:32:41.4031738Z         T: int,
2025-05-07T20:32:41.4031935Z         D: int,
2025-05-07T20:32:41.4032163Z         scale_ub: Optional[float],
2025-05-07T20:32:41.4032439Z         contiguous: bool,
2025-05-07T20:32:41.4032681Z         compiled: bool,
2025-05-07T20:32:41.4032905Z     ) -> None:
2025-05-07T20:32:41.4033128Z         torch.manual_seed(2025)
2025-05-07T20:32:41.4033376Z     
2025-05-07T20:32:41.4033644Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.4033990Z     
2025-05-07T20:32:41.4034192Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.4034481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.4034793Z         x = x_sign * x_clamp
2025-05-07T20:32:41.4035038Z         x0 = x[:, :D]
2025-05-07T20:32:41.4035253Z         x1 = x[:, D:]
2025-05-07T20:32:41.4035467Z     
2025-05-07T20:32:41.4035654Z         if contiguous:
2025-05-07T20:32:41.4035880Z             x0 = x0.contiguous()
2025-05-07T20:32:41.4036141Z             x1 = x1.contiguous()
2025-05-07T20:32:41.4036384Z     
2025-05-07T20:32:41.4036665Z         if scale_ub is not None:
2025-05-07T20:32:41.4036947Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.4037285Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.4037596Z             )
2025-05-07T20:32:41.4037789Z         else:
2025-05-07T20:32:41.4038002Z             scale_ub_tensor = None
2025-05-07T20:32:41.4038259Z     
2025-05-07T20:32:41.4038488Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.4038811Z             op = silu_mul_quant
2025-05-07T20:32:41.4039062Z             if compiled:
2025-05-07T20:32:41.4039311Z                 op = torch.compile(op)
2025-05-07T20:32:41.4039610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.4039891Z     
2025-05-07T20:32:41.4040088Z         y_fp8, y_scale = fn()
2025-05-07T20:32:41.4040380Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:41.4040677Z     
2025-05-07T20:32:41.4040914Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.4041335Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:41.4041637Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:41.4041956Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:41.4042314Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.4042631Z     
2025-05-07T20:32:41.4042842Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:41.4043039Z 
2025-05-07T20:32:41.4043216Z moe/activation_test.py:126: 
2025-05-07T20:32:41.4043518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4043859Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:41.4044184Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.4044974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:41.4045730Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:41.4046283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.4046961Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.4047765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:41.4048487Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.4056528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:41.4057398Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.4058153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:41.4058820Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:41.4059441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:41.4059968Z     fn()
2025-05-07T20:32:41.4060489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:41.4061086Z     self.fn.run(
2025-05-07T20:32:41.4061559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.4062111Z     kernel = self.compile(
2025-05-07T20:32:41.4062664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.4063330Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.4063733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4063977Z 
2025-05-07T20:32:41.4064269Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e90c790>
2025-05-07T20:32:41.4065363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.4066757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ee9d580>}
2025-05-07T20:32:41.4068160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.4069193Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e918cf0>
2025-05-07T20:32:41.4069491Z 
2025-05-07T20:32:41.4069660Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.4070240Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.4070713Z                            module_map=module_map)
2025-05-07T20:32:41.4071089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.4071456Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:41.4071734Z E       ^
2025-05-07T20:32:41.4072205Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.4072744Z 
2025-05-07T20:32:41.4073162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.4073675Z 
2025-05-07T20:32:41.4073792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.4074206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.4074616Z     T=1,
2025-05-07T20:32:41.4074815Z     D=5120,
2025-05-07T20:32:41.4075018Z     scale_ub=1200.0,
2025-05-07T20:32:41.4075262Z     contiguous=False,
2025-05-07T20:32:41.4075493Z     compiled=True,
2025-05-07T20:32:41.4075711Z )
2025-05-07T20:32:41.5271425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5272910Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.5273662Z 
2025-05-07T20:32:41.5273888Z     @given(
2025-05-07T20:32:41.5274413Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5275436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5276043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5276701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5277227Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5277552Z     )
2025-05-07T20:32:41.5277908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5278358Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5278606Z         self,
2025-05-07T20:32:41.5278807Z         T: int,
2025-05-07T20:32:41.5279011Z         D: int,
2025-05-07T20:32:41.5279233Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5279515Z         contiguous: bool,
2025-05-07T20:32:41.5279762Z         compiled: bool,
2025-05-07T20:32:41.5279991Z     ) -> None:
2025-05-07T20:32:41.5280211Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5280462Z     
2025-05-07T20:32:41.5280744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5281093Z     
2025-05-07T20:32:41.5281304Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5281603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5281913Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5282163Z         x0 = x[:, :D]
2025-05-07T20:32:41.5282385Z         x1 = x[:, D:]
2025-05-07T20:32:41.5282595Z     
2025-05-07T20:32:41.5282787Z         if contiguous:
2025-05-07T20:32:41.5283026Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5283371Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5283626Z     
2025-05-07T20:32:41.5283827Z         if scale_ub is not None:
2025-05-07T20:32:41.5284097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5284440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5284756Z             )
2025-05-07T20:32:41.5284956Z         else:
2025-05-07T20:32:41.5285182Z             scale_ub_tensor = None
2025-05-07T20:32:41.5285450Z     
2025-05-07T20:32:41.5285683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5286010Z             op = silu_mul_quant
2025-05-07T20:32:41.5286267Z             if compiled:
2025-05-07T20:32:41.5286524Z                 op = torch.compile(op)
2025-05-07T20:32:41.5286820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5287120Z     
2025-05-07T20:32:41.5287357Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5287529Z 
2025-05-07T20:32:41.5287727Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5288129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5288475Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5288759Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5289327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.5289894Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.5290569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5291340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5291887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5292580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5293245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5293791Z     kernel = self.compile(
2025-05-07T20:32:41.5294342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5295007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5295407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5295649Z 
2025-05-07T20:32:41.5295903Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e985150>
2025-05-07T20:32:41.5296990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5298398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ee9eb60>}
2025-05-07T20:32:41.5299752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5300772Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e9ed7b0>
2025-05-07T20:32:41.5301069Z 
2025-05-07T20:32:41.5301236Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5301770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5302235Z                            module_map=module_map)
2025-05-07T20:32:41.5302605Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5302964Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5303231Z E       ^
2025-05-07T20:32:41.5303738Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5304197Z 
2025-05-07T20:32:41.5304612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5305121Z 
2025-05-07T20:32:41.5305235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5305936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5306348Z     T=1,
2025-05-07T20:32:41.5306546Z     D=5120,
2025-05-07T20:32:41.5306757Z     scale_ub=1200.0,
2025-05-07T20:32:41.5306984Z     contiguous=False,
2025-05-07T20:32:41.5307216Z     compiled=False,
2025-05-07T20:32:41.5307435Z )
2025-05-07T20:32:41.5307753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5308250Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.5308516Z 
2025-05-07T20:32:41.5308602Z     @given(
2025-05-07T20:32:41.5308837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5309230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5309545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5309870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5310202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5310493Z     )
2025-05-07T20:32:41.5310839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5311343Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5311592Z         self,
2025-05-07T20:32:41.5311787Z         T: int,
2025-05-07T20:32:41.5311990Z         D: int,
2025-05-07T20:32:41.5312212Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5312481Z         contiguous: bool,
2025-05-07T20:32:41.5312727Z         compiled: bool,
2025-05-07T20:32:41.5312954Z     ) -> None:
2025-05-07T20:32:41.5313174Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5313415Z     
2025-05-07T20:32:41.5313697Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5314047Z     
2025-05-07T20:32:41.5314241Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5314541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5314857Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5315097Z         x0 = x[:, :D]
2025-05-07T20:32:41.5315319Z         x1 = x[:, D:]
2025-05-07T20:32:41.5315532Z     
2025-05-07T20:32:41.5315716Z         if contiguous:
2025-05-07T20:32:41.5316024Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5316292Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5316532Z     
2025-05-07T20:32:41.5316738Z         if scale_ub is not None:
2025-05-07T20:32:41.5317017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5317353Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5317669Z             )
2025-05-07T20:32:41.5317868Z         else:
2025-05-07T20:32:41.5318084Z             scale_ub_tensor = None
2025-05-07T20:32:41.5318337Z     
2025-05-07T20:32:41.5318582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5318902Z             op = silu_mul_quant
2025-05-07T20:32:41.5319153Z             if compiled:
2025-05-07T20:32:41.5319408Z                 op = torch.compile(op)
2025-05-07T20:32:41.5319708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5319984Z     
2025-05-07T20:32:41.5320184Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5320352Z 
2025-05-07T20:32:41.5320466Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5320761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5321100Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5321389Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5322088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5322783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5323419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5324116Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5324779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5325319Z     kernel = self.compile(
2025-05-07T20:32:41.5325863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5326529Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5326927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5327162Z 
2025-05-07T20:32:41.5327369Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e6f6c50>
2025-05-07T20:32:41.5328573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5329951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ee9f2e0>}
2025-05-07T20:32:41.5331294Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5332368Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e6b32b0>
2025-05-07T20:32:41.5332662Z 
2025-05-07T20:32:41.5332829Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5333347Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5333815Z                            module_map=module_map)
2025-05-07T20:32:41.5334185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5334542Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5334806Z E       ^
2025-05-07T20:32:41.5335267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5335724Z 
2025-05-07T20:32:41.5336141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5336703Z 
2025-05-07T20:32:41.5336814Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5337223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5337631Z     T=16384,
2025-05-07T20:32:41.5337831Z     D=5120,
2025-05-07T20:32:41.5338031Z     scale_ub=1200.0,
2025-05-07T20:32:41.5338253Z     contiguous=False,
2025-05-07T20:32:41.5338484Z     compiled=True,
2025-05-07T20:32:41.5338691Z )
2025-05-07T20:32:41.7614050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7614877Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.7615275Z 
2025-05-07T20:32:41.7615357Z     @given(
2025-05-07T20:32:41.7615597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7615919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7616228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7616579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7616912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7617224Z     )
2025-05-07T20:32:41.7617607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7618055Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7618301Z         self,
2025-05-07T20:32:41.7618498Z         T: int,
2025-05-07T20:32:41.7618703Z         D: int,
2025-05-07T20:32:41.7619211Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7619489Z         contiguous: bool,
2025-05-07T20:32:41.7619734Z         compiled: bool,
2025-05-07T20:32:41.7619967Z     ) -> None:
2025-05-07T20:32:41.7620183Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7620430Z     
2025-05-07T20:32:41.7620711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7621055Z     
2025-05-07T20:32:41.7621257Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.7621556Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.7621862Z         x = x_sign * x_clamp
2025-05-07T20:32:41.7622116Z         x0 = x[:, :D]
2025-05-07T20:32:41.7622336Z         x1 = x[:, D:]
2025-05-07T20:32:41.7622543Z     
2025-05-07T20:32:41.7622732Z         if contiguous:
2025-05-07T20:32:41.7622966Z             x0 = x0.contiguous()
2025-05-07T20:32:41.7623231Z             x1 = x1.contiguous()
2025-05-07T20:32:41.7623468Z     
2025-05-07T20:32:41.7623671Z         if scale_ub is not None:
2025-05-07T20:32:41.7624119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.7624455Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.7624778Z             )
2025-05-07T20:32:41.7624982Z         else:
2025-05-07T20:32:41.7625192Z             scale_ub_tensor = None
2025-05-07T20:32:41.7625451Z     
2025-05-07T20:32:41.7625685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.7626077Z             op = silu_mul_quant
2025-05-07T20:32:41.7626339Z             if compiled:
2025-05-07T20:32:41.7626590Z                 op = torch.compile(op)
2025-05-07T20:32:41.7626885Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7627163Z     
2025-05-07T20:32:41.7627364Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.7627532Z 
2025-05-07T20:32:41.7627637Z moe/activation_test.py:117: 
2025-05-07T20:32:41.7627931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7628276Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.7628571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7629131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.7629699Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.7630364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.7631141Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.7631682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.7632367Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.7633036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.7633567Z     kernel = self.compile(
2025-05-07T20:32:41.7634117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.7634778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7635184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7635415Z 
2025-05-07T20:32:41.7635626Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e60de10>
2025-05-07T20:32:41.7636715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.7638120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e69cfe0>}
2025-05-07T20:32:41.7639519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.7640544Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e63e430>
2025-05-07T20:32:41.7640839Z 
2025-05-07T20:32:41.7641007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.7641534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7642009Z                            module_map=module_map)
2025-05-07T20:32:41.7642370Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7642728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7642993Z E       ^
2025-05-07T20:32:41.7643454Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.7643909Z 
2025-05-07T20:32:41.7644368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7644882Z 
2025-05-07T20:32:41.7644987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7645399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7645798Z     T=2048,
2025-05-07T20:32:41.7645993Z     D=7168,
2025-05-07T20:32:41.7646199Z     scale_ub=1200.0,
2025-05-07T20:32:41.7646428Z     contiguous=False,
2025-05-07T20:32:41.7646697Z     compiled=True,
2025-05-07T20:32:41.7646913Z )
2025-05-07T20:32:41.7647253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7647879Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.7648161Z 
2025-05-07T20:32:41.7648242Z     @given(
2025-05-07T20:32:41.7648477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7648793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7649100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7649434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7649765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7650051Z     )
2025-05-07T20:32:41.7650399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7650838Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7651081Z         self,
2025-05-07T20:32:41.7651287Z         T: int,
2025-05-07T20:32:41.7651542Z         D: int,
2025-05-07T20:32:41.7651760Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7652035Z         contiguous: bool,
2025-05-07T20:32:41.7652277Z         compiled: bool,
2025-05-07T20:32:41.7652504Z     ) -> None:
2025-05-07T20:32:41.7652720Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7652968Z     
2025-05-07T20:32:41.7653243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7653585Z     
2025-05-07T20:32:41.7653789Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.7654087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.7654395Z         x = x_sign * x_clamp
2025-05-07T20:32:41.7654645Z         x0 = x[:, :D]
2025-05-07T20:32:41.7654867Z         x1 = x[:, D:]
2025-05-07T20:32:41.7655079Z     
2025-05-07T20:32:41.7655273Z         if contiguous:
2025-05-07T20:32:41.7655509Z             x0 = x0.contiguous()
2025-05-07T20:32:41.7655770Z             x1 = x1.contiguous()
2025-05-07T20:32:41.7656022Z     
2025-05-07T20:32:41.7656222Z         if scale_ub is not None:
2025-05-07T20:32:41.7656495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.7656833Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.7657155Z             )
2025-05-07T20:32:41.7657392Z         else:
2025-05-07T20:32:41.7657611Z             scale_ub_tensor = None
2025-05-07T20:32:41.7657863Z     
2025-05-07T20:32:41.7658102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.7658465Z             op = silu_mul_quant
2025-05-07T20:32:41.7658726Z             if compiled:
2025-05-07T20:32:41.7658979Z                 op = torch.compile(op)
2025-05-07T20:32:41.7659272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7659552Z     
2025-05-07T20:32:41.7659748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.7659913Z 
2025-05-07T20:32:41.7660015Z moe/activation_test.py:117: 
2025-05-07T20:32:41.7660314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7660650Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.7660936Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7661491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.7662052Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.7662710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.7663445Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.7663982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.7664662Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.7665319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.7665890Z     kernel = self.compile(
2025-05-07T20:32:41.7666433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.7667092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7667532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7667768Z 
2025-05-07T20:32:41.7667975Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e70aed0>
2025-05-07T20:32:41.7669057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.7670422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e69db20>}
2025-05-07T20:32:41.7671763Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.7672828Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e7134b0>
2025-05-07T20:32:41.7673119Z 
2025-05-07T20:32:41.7673285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.7673811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7674282Z                            module_map=module_map)
2025-05-07T20:32:41.7674641Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7675003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7675276Z E       ^
2025-05-07T20:32:41.7675735Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.7676193Z 
2025-05-07T20:32:41.7676608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7677122Z 
2025-05-07T20:32:41.8567215Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8567982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8568550Z     T=1,
2025-05-07T20:32:41.8568855Z     D=5120,
2025-05-07T20:32:41.8569122Z     scale_ub=None,
2025-05-07T20:32:41.8569413Z     contiguous=False,
2025-05-07T20:32:41.8570022Z     compiled=False,
2025-05-07T20:32:41.8570303Z )
2025-05-07T20:32:41.8570720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8579032Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.8579345Z 
2025-05-07T20:32:41.8579447Z     @given(
2025-05-07T20:32:41.8579695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8580027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8580369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8580713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8581059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8581361Z     )
2025-05-07T20:32:41.8581718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8582175Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8582435Z         self,
2025-05-07T20:32:41.8582642Z         T: int,
2025-05-07T20:32:41.8582997Z         D: int,
2025-05-07T20:32:41.8583238Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8583522Z         contiguous: bool,
2025-05-07T20:32:41.8583785Z         compiled: bool,
2025-05-07T20:32:41.8584034Z     ) -> None:
2025-05-07T20:32:41.8584268Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8584519Z     
2025-05-07T20:32:41.8584812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8585258Z     
2025-05-07T20:32:41.8585464Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8585776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8586101Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8586353Z         x0 = x[:, :D]
2025-05-07T20:32:41.8586591Z         x1 = x[:, D:]
2025-05-07T20:32:41.8586816Z     
2025-05-07T20:32:41.8587011Z         if contiguous:
2025-05-07T20:32:41.8587259Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8587533Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8587780Z     
2025-05-07T20:32:41.8587986Z         if scale_ub is not None:
2025-05-07T20:32:41.8588280Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8588631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8588948Z             )
2025-05-07T20:32:41.8589143Z         else:
2025-05-07T20:32:41.8589370Z             scale_ub_tensor = None
2025-05-07T20:32:41.8589630Z     
2025-05-07T20:32:41.8589881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8590306Z             op = silu_mul_quant
2025-05-07T20:32:41.8590573Z             if compiled:
2025-05-07T20:32:41.8590836Z                 op = torch.compile(op)
2025-05-07T20:32:41.8591152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8591440Z     
2025-05-07T20:32:41.8591642Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8591819Z 
2025-05-07T20:32:41.8591925Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8592231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8592562Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8592863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8593571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8594278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8594825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8595528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8596202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8596739Z     kernel = self.compile(
2025-05-07T20:32:41.8597337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8598089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8598500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8598733Z 
2025-05-07T20:32:41.8598945Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e72c850>
2025-05-07T20:32:41.8600040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8601439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e69ee80>}
2025-05-07T20:32:41.8602793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8603874Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e790e70>
2025-05-07T20:32:41.8604167Z 
2025-05-07T20:32:41.8604339Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8604877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8605354Z                            module_map=module_map)
2025-05-07T20:32:41.8606075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8606452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8606723Z E       ^
2025-05-07T20:32:41.8607196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8607717Z 
2025-05-07T20:32:41.8608137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8608658Z 
2025-05-07T20:32:41.8608774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8609203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8609608Z     T=4096,
2025-05-07T20:32:41.8609811Z     D=7168,
2025-05-07T20:32:41.8610019Z     scale_ub=1200.0,
2025-05-07T20:32:41.8610256Z     contiguous=False,
2025-05-07T20:32:41.8610490Z     compiled=False,
2025-05-07T20:32:41.8610712Z )
2025-05-07T20:32:41.8611043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8611635Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.8611924Z 
2025-05-07T20:32:41.8612008Z     @given(
2025-05-07T20:32:41.8612261Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8612586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8612914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8613259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8613598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8613905Z     )
2025-05-07T20:32:41.8614270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8614724Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8614973Z         self,
2025-05-07T20:32:41.8615185Z         T: int,
2025-05-07T20:32:41.8615401Z         D: int,
2025-05-07T20:32:41.8615629Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8615924Z         contiguous: bool,
2025-05-07T20:32:41.8616180Z         compiled: bool,
2025-05-07T20:32:41.8616419Z     ) -> None:
2025-05-07T20:32:41.8616651Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8616909Z     
2025-05-07T20:32:41.8617189Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8617546Z     
2025-05-07T20:32:41.8617760Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8618058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8618456Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8618717Z         x0 = x[:, :D]
2025-05-07T20:32:41.8618942Z         x1 = x[:, D:]
2025-05-07T20:32:41.8619165Z     
2025-05-07T20:32:41.8619365Z         if contiguous:
2025-05-07T20:32:41.8619610Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8619873Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8620128Z     
2025-05-07T20:32:41.8620340Z         if scale_ub is not None:
2025-05-07T20:32:41.8620618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8620976Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8621298Z             )
2025-05-07T20:32:41.8621499Z         else:
2025-05-07T20:32:41.8621725Z             scale_ub_tensor = None
2025-05-07T20:32:41.8621990Z     
2025-05-07T20:32:41.8622231Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8622559Z             op = silu_mul_quant
2025-05-07T20:32:41.8622825Z             if compiled:
2025-05-07T20:32:41.8623079Z                 op = torch.compile(op)
2025-05-07T20:32:41.8623460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8623745Z     
2025-05-07T20:32:41.8623941Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8624116Z 
2025-05-07T20:32:41.8624219Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8624522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8624863Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8625211Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8625908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8626602Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8627150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8627883Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8628555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8629097Z     kernel = self.compile(
2025-05-07T20:32:41.8629641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8630290Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8630693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8630971Z 
2025-05-07T20:32:41.8631183Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e74f750>
2025-05-07T20:32:41.8632256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8633628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e7dc040>}
2025-05-07T20:32:41.8634968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8635992Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e75bdb0>
2025-05-07T20:32:41.8636284Z 
2025-05-07T20:32:41.8636455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8636970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8637437Z                            module_map=module_map)
2025-05-07T20:32:41.8637804Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8638159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8638420Z E       ^
2025-05-07T20:32:41.8638935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8639386Z 
2025-05-07T20:32:41.8639806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8640314Z 
2025-05-07T20:32:41.8640426Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8640836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8641252Z     T=16384,
2025-05-07T20:32:41.8641453Z     D=7168,
2025-05-07T20:32:41.8641649Z     scale_ub=None,
2025-05-07T20:32:41.8641865Z     contiguous=True,
2025-05-07T20:32:41.8642091Z     compiled=True,
2025-05-07T20:32:41.8642290Z )
2025-05-07T20:32:42.0007387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0008263Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.0008628Z 
2025-05-07T20:32:42.0008737Z     @given(
2025-05-07T20:32:42.0009257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0009575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0009889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0010226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0010553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0010845Z     )
2025-05-07T20:32:42.0011200Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0011732Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0011977Z         self,
2025-05-07T20:32:42.0012182Z         T: int,
2025-05-07T20:32:42.0012387Z         D: int,
2025-05-07T20:32:42.0012609Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0012886Z         contiguous: bool,
2025-05-07T20:32:42.0013130Z         compiled: bool,
2025-05-07T20:32:42.0013356Z     ) -> None:
2025-05-07T20:32:42.0013580Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0013832Z     
2025-05-07T20:32:42.0014111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0014460Z     
2025-05-07T20:32:42.0014662Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0014954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0015274Z         x = x_sign * x_clamp
2025-05-07T20:32:42.0015522Z         x0 = x[:, :D]
2025-05-07T20:32:42.0015740Z         x1 = x[:, D:]
2025-05-07T20:32:42.0016053Z     
2025-05-07T20:32:42.0016248Z         if contiguous:
2025-05-07T20:32:42.0016483Z             x0 = x0.contiguous()
2025-05-07T20:32:42.0016750Z             x1 = x1.contiguous()
2025-05-07T20:32:42.0016997Z     
2025-05-07T20:32:42.0017193Z         if scale_ub is not None:
2025-05-07T20:32:42.0017507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.0017860Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.0018172Z             )
2025-05-07T20:32:42.0018368Z         else:
2025-05-07T20:32:42.0018589Z             scale_ub_tensor = None
2025-05-07T20:32:42.0018847Z     
2025-05-07T20:32:42.0019074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.0019393Z             op = silu_mul_quant
2025-05-07T20:32:42.0019651Z             if compiled:
2025-05-07T20:32:42.0019901Z                 op = torch.compile(op)
2025-05-07T20:32:42.0020201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0020489Z     
2025-05-07T20:32:42.0020687Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.0020857Z 
2025-05-07T20:32:42.0020960Z moe/activation_test.py:117: 
2025-05-07T20:32:42.0021262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0021601Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.0021880Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0022444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.0023093Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.0023756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.0024453Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.0024997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.0025683Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.0026349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.0026892Z     kernel = self.compile(
2025-05-07T20:32:42.0027435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.0028089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.0028542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0028781Z 
2025-05-07T20:32:42.0028988Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3fece810>
2025-05-07T20:32:42.0030073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.0031520Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e7dd260>}
2025-05-07T20:32:42.0032866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.0033897Z context = <triton._C.libtriton.ir.context object at 0x7f5d3fe42e30>
2025-05-07T20:32:42.0034190Z 
2025-05-07T20:32:42.0034369Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.0034895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.0035363Z                            module_map=module_map)
2025-05-07T20:32:42.0035735Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.0036097Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.0036429Z E       ^
2025-05-07T20:32:42.0036899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.0037352Z 
2025-05-07T20:32:42.0037780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.0038291Z 
2025-05-07T20:32:42.0038404Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0038822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0039236Z     T=4096,
2025-05-07T20:32:42.0039436Z     D=5120,
2025-05-07T20:32:42.0039630Z     scale_ub=None,
2025-05-07T20:32:42.0039851Z     contiguous=False,
2025-05-07T20:32:42.0040085Z     compiled=True,
2025-05-07T20:32:42.0040290Z )
2025-05-07T20:32:42.0040613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0041110Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.0041386Z 
2025-05-07T20:32:42.0041476Z     @given(
2025-05-07T20:32:42.0041707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0042022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0042334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0042661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0042991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0043286Z     )
2025-05-07T20:32:42.0043686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0044136Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0044384Z         self,
2025-05-07T20:32:42.0044578Z         T: int,
2025-05-07T20:32:42.0044780Z         D: int,
2025-05-07T20:32:42.0045002Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0045279Z         contiguous: bool,
2025-05-07T20:32:42.0045516Z         compiled: bool,
2025-05-07T20:32:42.0045745Z     ) -> None:
2025-05-07T20:32:42.0045972Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0046213Z     
2025-05-07T20:32:42.0046491Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0046841Z     
2025-05-07T20:32:42.0047035Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0047360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0047763Z         x = x_sign * x_clamp
2025-05-07T20:32:42.0048002Z         x0 = x[:, :D]
2025-05-07T20:32:42.0048232Z         x1 = x[:, D:]
2025-05-07T20:32:42.0048455Z     
2025-05-07T20:32:42.0048691Z         if contiguous:
2025-05-07T20:32:42.0048930Z             x0 = x0.contiguous()
2025-05-07T20:32:42.0049193Z             x1 = x1.contiguous()
2025-05-07T20:32:42.0049432Z     
2025-05-07T20:32:42.0049632Z         if scale_ub is not None:
2025-05-07T20:32:42.0049911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.0050246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.0050610Z             )
2025-05-07T20:32:42.0050800Z         else:
2025-05-07T20:32:42.0051017Z             scale_ub_tensor = None
2025-05-07T20:32:42.0051273Z     
2025-05-07T20:32:42.0051507Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.0051818Z             op = silu_mul_quant
2025-05-07T20:32:42.0052075Z             if compiled:
2025-05-07T20:32:42.0052327Z                 op = torch.compile(op)
2025-05-07T20:32:42.0052620Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0052902Z     
2025-05-07T20:32:42.0053103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.0053269Z 
2025-05-07T20:32:42.0053368Z moe/activation_test.py:117: 
2025-05-07T20:32:42.0053670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0054006Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.0054286Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0054847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.0055461Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.0056122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.0056809Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.0057347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.0058037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.0058702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.0059231Z     kernel = self.compile(
2025-05-07T20:32:42.0059776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.0060434Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.0060837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0061074Z 
2025-05-07T20:32:42.0061280Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3fee7fd0>
2025-05-07T20:32:42.0062362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.0063782Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e7ddda0>}
2025-05-07T20:32:42.0065124Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.0066143Z context = <triton._C.libtriton.ir.context object at 0x7f5d3fe9beb0>
2025-05-07T20:32:42.0066443Z 
2025-05-07T20:32:42.0066609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.0067136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.0067606Z                            module_map=module_map)
2025-05-07T20:32:42.0067968Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.0068326Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.0068596Z E       ^
2025-05-07T20:32:42.0069102Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.0069559Z 
2025-05-07T20:32:42.0069972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.0070486Z 
2025-05-07T20:32:42.1211757Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1212758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1213301Z     T=4096,
2025-05-07T20:32:42.1213496Z     D=5120,
2025-05-07T20:32:42.1213688Z     scale_ub=1200.0,
2025-05-07T20:32:42.1213920Z     contiguous=False,
2025-05-07T20:32:42.1214152Z     compiled=False,
2025-05-07T20:32:42.1214357Z )
2025-05-07T20:32:42.1214682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1215185Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1215460Z 
2025-05-07T20:32:42.1215549Z     @given(
2025-05-07T20:32:42.1215783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1216099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1216405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1216739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1217072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1217513Z     )
2025-05-07T20:32:42.1217859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1218303Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1218551Z         self,
2025-05-07T20:32:42.1218743Z         T: int,
2025-05-07T20:32:42.1218950Z         D: int,
2025-05-07T20:32:42.1219173Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1219440Z         contiguous: bool,
2025-05-07T20:32:42.1219690Z         compiled: bool,
2025-05-07T20:32:42.1219924Z     ) -> None:
2025-05-07T20:32:42.1220142Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1220390Z     
2025-05-07T20:32:42.1220668Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1221012Z     
2025-05-07T20:32:42.1221212Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1221512Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1221822Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1222066Z         x0 = x[:, :D]
2025-05-07T20:32:42.1222289Z         x1 = x[:, D:]
2025-05-07T20:32:42.1222505Z     
2025-05-07T20:32:42.1222694Z         if contiguous:
2025-05-07T20:32:42.1222931Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1223201Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1223438Z     
2025-05-07T20:32:42.1223640Z         if scale_ub is not None:
2025-05-07T20:32:42.1223918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1224257Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1224653Z             )
2025-05-07T20:32:42.1224860Z         else:
2025-05-07T20:32:42.1225069Z             scale_ub_tensor = None
2025-05-07T20:32:42.1225330Z     
2025-05-07T20:32:42.1225566Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1225900Z             op = silu_mul_quant
2025-05-07T20:32:42.1233256Z             if compiled:
2025-05-07T20:32:42.1233538Z                 op = torch.compile(op)
2025-05-07T20:32:42.1233856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1234148Z     
2025-05-07T20:32:42.1234360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1234531Z 
2025-05-07T20:32:42.1234648Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1234951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1235295Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1235593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1236419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1237132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1237685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1238393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1239063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1239661Z     kernel = self.compile(
2025-05-07T20:32:42.1240218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1240896Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1241300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1241538Z 
2025-05-07T20:32:42.1241755Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f1f13d0>
2025-05-07T20:32:42.1242844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1244240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e7df420>}
2025-05-07T20:32:42.1245636Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1246671Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f189a30>
2025-05-07T20:32:42.1246974Z 
2025-05-07T20:32:42.1247160Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1247847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1248318Z                            module_map=module_map)
2025-05-07T20:32:42.1248702Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1249070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1249334Z E       ^
2025-05-07T20:32:42.1249808Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1250269Z 
2025-05-07T20:32:42.1250693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1251212Z 
2025-05-07T20:32:42.1251320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1251748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1252159Z     T=4096,
2025-05-07T20:32:42.1252367Z     D=5120,
2025-05-07T20:32:42.1252626Z     scale_ub=1200.0,
2025-05-07T20:32:42.1252865Z     contiguous=False,
2025-05-07T20:32:42.1253106Z     compiled=True,
2025-05-07T20:32:42.1253328Z )
2025-05-07T20:32:42.1253653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1254166Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.1254452Z 
2025-05-07T20:32:42.1254536Z     @given(
2025-05-07T20:32:42.1254784Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1255111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1255437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1255780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1256113Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1256415Z     )
2025-05-07T20:32:42.1256780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1257233Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1257564Z         self,
2025-05-07T20:32:42.1257777Z         T: int,
2025-05-07T20:32:42.1257990Z         D: int,
2025-05-07T20:32:42.1258215Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1258497Z         contiguous: bool,
2025-05-07T20:32:42.1258750Z         compiled: bool,
2025-05-07T20:32:42.1258981Z     ) -> None:
2025-05-07T20:32:42.1259211Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1259467Z     
2025-05-07T20:32:42.1259795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1260157Z     
2025-05-07T20:32:42.1260369Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1260667Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1260993Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1261251Z         x0 = x[:, :D]
2025-05-07T20:32:42.1261475Z         x1 = x[:, D:]
2025-05-07T20:32:42.1261696Z     
2025-05-07T20:32:42.1261894Z         if contiguous:
2025-05-07T20:32:42.1262135Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1262404Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1262660Z     
2025-05-07T20:32:42.1262866Z         if scale_ub is not None:
2025-05-07T20:32:42.1263152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1263499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1263819Z             )
2025-05-07T20:32:42.1264022Z         else:
2025-05-07T20:32:42.1264244Z             scale_ub_tensor = None
2025-05-07T20:32:42.1264568Z     
2025-05-07T20:32:42.1264808Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1265135Z             op = silu_mul_quant
2025-05-07T20:32:42.1265401Z             if compiled:
2025-05-07T20:32:42.1265665Z                 op = torch.compile(op)
2025-05-07T20:32:42.1265977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1266266Z     
2025-05-07T20:32:42.1266467Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1266644Z 
2025-05-07T20:32:42.1266753Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1267067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1267418Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1267707Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1268276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1268851Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1269519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1270221Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1270772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1271470Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1272186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1272731Z     kernel = self.compile(
2025-05-07T20:32:42.1273282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1273939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1274349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1274594Z 
2025-05-07T20:32:42.1274806Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f1941d0>
2025-05-07T20:32:42.1275896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1277313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f134860>}
2025-05-07T20:32:42.1278717Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1279752Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f1507f0>
2025-05-07T20:32:42.1280043Z 
2025-05-07T20:32:42.1280223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1280807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1281278Z                            module_map=module_map)
2025-05-07T20:32:42.1281655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1282022Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1282287Z E       ^
2025-05-07T20:32:42.1282766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1283220Z 
2025-05-07T20:32:42.1283646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1284158Z 
2025-05-07T20:32:42.2152957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2154225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2155312Z     T=2048,
2025-05-07T20:32:42.2156065Z     D=7168,
2025-05-07T20:32:42.2156455Z     scale_ub=1200.0,
2025-05-07T20:32:42.2156901Z     contiguous=False,
2025-05-07T20:32:42.2157305Z     compiled=False,
2025-05-07T20:32:42.2157542Z )
2025-05-07T20:32:42.2157893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2158394Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.2158681Z 
2025-05-07T20:32:42.2158767Z     @given(
2025-05-07T20:32:42.2159013Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2159332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2159649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2159981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2160320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2160616Z     )
2025-05-07T20:32:42.2160966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2161417Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2161664Z         self,
2025-05-07T20:32:42.2161858Z         T: int,
2025-05-07T20:32:42.2162062Z         D: int,
2025-05-07T20:32:42.2162284Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2162553Z         contiguous: bool,
2025-05-07T20:32:42.2162798Z         compiled: bool,
2025-05-07T20:32:42.2163026Z     ) -> None:
2025-05-07T20:32:42.2163240Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2163487Z     
2025-05-07T20:32:42.2163844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2164189Z     
2025-05-07T20:32:42.2164391Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2164690Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2165001Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2165240Z         x0 = x[:, :D]
2025-05-07T20:32:42.2165464Z         x1 = x[:, D:]
2025-05-07T20:32:42.2165680Z     
2025-05-07T20:32:42.2165870Z         if contiguous:
2025-05-07T20:32:42.2166108Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2166367Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2166605Z     
2025-05-07T20:32:42.2166801Z         if scale_ub is not None:
2025-05-07T20:32:42.2167075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2167409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2167825Z             )
2025-05-07T20:32:42.2168029Z         else:
2025-05-07T20:32:42.2168245Z             scale_ub_tensor = None
2025-05-07T20:32:42.2168576Z     
2025-05-07T20:32:42.2168814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2169124Z             op = silu_mul_quant
2025-05-07T20:32:42.2169379Z             if compiled:
2025-05-07T20:32:42.2169634Z                 op = torch.compile(op)
2025-05-07T20:32:42.2169928Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2170212Z     
2025-05-07T20:32:42.2170411Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2170650Z 
2025-05-07T20:32:42.2170757Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2171056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2171394Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2171679Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2172369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2173073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2173616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2174311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2174976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2175529Z     kernel = self.compile(
2025-05-07T20:32:42.2176129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2176782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2177189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2177425Z 
2025-05-07T20:32:42.2177633Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f15c110>
2025-05-07T20:32:42.2178718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2180105Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f1356c0>}
2025-05-07T20:32:42.2181434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2182458Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ed057f0>
2025-05-07T20:32:42.2182748Z 
2025-05-07T20:32:42.2182912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2183433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2183935Z                            module_map=module_map)
2025-05-07T20:32:42.2184298Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2184652Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2184903Z E       ^
2025-05-07T20:32:42.2185367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2185818Z 
2025-05-07T20:32:42.2186229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2186743Z 
2025-05-07T20:32:42.2186850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2187258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2187662Z     T=1,
2025-05-07T20:32:42.2187847Z     D=7168,
2025-05-07T20:32:42.2188035Z     scale_ub=None,
2025-05-07T20:32:42.2188249Z     contiguous=True,
2025-05-07T20:32:42.2188470Z     compiled=False,
2025-05-07T20:32:42.2188673Z )
2025-05-07T20:32:42.2189040Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2189522Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2189777Z 
2025-05-07T20:32:42.2189860Z     @given(
2025-05-07T20:32:42.2190083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2190394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2190739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2191065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2191394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2191681Z     )
2025-05-07T20:32:42.2192019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2192459Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2192699Z         self,
2025-05-07T20:32:42.2192888Z         T: int,
2025-05-07T20:32:42.2193089Z         D: int,
2025-05-07T20:32:42.2193310Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2193583Z         contiguous: bool,
2025-05-07T20:32:42.2193816Z         compiled: bool,
2025-05-07T20:32:42.2194046Z     ) -> None:
2025-05-07T20:32:42.2194268Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2194508Z     
2025-05-07T20:32:42.2194787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2195135Z     
2025-05-07T20:32:42.2195336Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2195682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2195996Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2196236Z         x0 = x[:, :D]
2025-05-07T20:32:42.2196460Z         x1 = x[:, D:]
2025-05-07T20:32:42.2196676Z     
2025-05-07T20:32:42.2196862Z         if contiguous:
2025-05-07T20:32:42.2197098Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2197361Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2197626Z     
2025-05-07T20:32:42.2197855Z         if scale_ub is not None:
2025-05-07T20:32:42.2198132Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2198464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2198779Z             )
2025-05-07T20:32:42.2198981Z         else:
2025-05-07T20:32:42.2199196Z             scale_ub_tensor = None
2025-05-07T20:32:42.2199449Z     
2025-05-07T20:32:42.2199685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2200007Z             op = silu_mul_quant
2025-05-07T20:32:42.2200257Z             if compiled:
2025-05-07T20:32:42.2200510Z                 op = torch.compile(op)
2025-05-07T20:32:42.2200807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2201083Z     
2025-05-07T20:32:42.2201282Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2201447Z 
2025-05-07T20:32:42.2201552Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2201848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2202258Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2202545Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2203235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2203931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2204469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2205162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2206184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2206880Z     kernel = self.compile(
2025-05-07T20:32:42.2207426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2208136Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2208629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2208869Z 
2025-05-07T20:32:42.2209080Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ed2c9d0>
2025-05-07T20:32:42.2210160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2211593Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f134fe0>}
2025-05-07T20:32:42.2212931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2213962Z context = <triton._C.libtriton.ir.context object at 0x7f5d3edd1030>
2025-05-07T20:32:42.2214259Z 
2025-05-07T20:32:42.2214424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2214951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2215413Z                            module_map=module_map)
2025-05-07T20:32:42.2215785Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2216264Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2216529Z E       ^
2025-05-07T20:32:42.2216994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2217449Z 
2025-05-07T20:32:42.2217866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2218375Z 
2025-05-07T20:32:42.2218488Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2218915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2219317Z     T=16384,
2025-05-07T20:32:42.2219520Z     D=7168,
2025-05-07T20:32:42.2219719Z     scale_ub=1200.0,
2025-05-07T20:32:42.2219939Z     contiguous=False,
2025-05-07T20:32:42.2220169Z     compiled=True,
2025-05-07T20:32:42.5681073Z )
2025-05-07T20:32:42.5681635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5682364Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.5682772Z 
2025-05-07T20:32:42.5682864Z     @given(
2025-05-07T20:32:42.5683103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5683419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5683719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5684054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5684663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5684950Z     )
2025-05-07T20:32:42.5685304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5685739Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5685974Z         self,
2025-05-07T20:32:42.5686169Z         T: int,
2025-05-07T20:32:42.5686368Z         D: int,
2025-05-07T20:32:42.5686580Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5686860Z         contiguous: bool,
2025-05-07T20:32:42.5687104Z         compiled: bool,
2025-05-07T20:32:42.5687326Z     ) -> None:
2025-05-07T20:32:42.5687632Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5687884Z     
2025-05-07T20:32:42.5688162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5688502Z     
2025-05-07T20:32:42.5688702Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5689002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5689308Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5689559Z         x0 = x[:, :D]
2025-05-07T20:32:42.5689869Z         x1 = x[:, D:]
2025-05-07T20:32:42.5690077Z     
2025-05-07T20:32:42.5690274Z         if contiguous:
2025-05-07T20:32:42.5690511Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5690775Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5691018Z     
2025-05-07T20:32:42.5691215Z         if scale_ub is not None:
2025-05-07T20:32:42.5691486Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5691907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5692220Z             )
2025-05-07T20:32:42.5692422Z         else:
2025-05-07T20:32:42.5692639Z             scale_ub_tensor = None
2025-05-07T20:32:42.5692902Z     
2025-05-07T20:32:42.5693143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5693456Z             op = silu_mul_quant
2025-05-07T20:32:42.5693714Z             if compiled:
2025-05-07T20:32:42.5693968Z                 op = torch.compile(op)
2025-05-07T20:32:42.5694269Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5694551Z     
2025-05-07T20:32:42.5694751Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5694937Z 
2025-05-07T20:32:42.5695044Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5695340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5695683Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5695973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5696629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5697199Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5697869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5698569Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5699110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5699800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5700465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5701003Z     kernel = self.compile(
2025-05-07T20:32:42.5701543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5702212Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5702615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5702846Z 
2025-05-07T20:32:42.5703054Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ed1fd10>
2025-05-07T20:32:42.5704187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5705857Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f137b00>}
2025-05-07T20:32:42.5707215Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5708252Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ed0ba30>
2025-05-07T20:32:42.5708544Z 
2025-05-07T20:32:42.5708710Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5709237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5709707Z                            module_map=module_map)
2025-05-07T20:32:42.5710079Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5710504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5710772Z E       ^
2025-05-07T20:32:42.5711242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5711693Z 
2025-05-07T20:32:42.5712112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5712692Z 
2025-05-07T20:32:42.5712802Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5713220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5713628Z     T=1,
2025-05-07T20:32:42.5713812Z     D=7168,
2025-05-07T20:32:42.5714013Z     scale_ub=None,
2025-05-07T20:32:42.5714237Z     contiguous=False,
2025-05-07T20:32:42.5714464Z     compiled=False,
2025-05-07T20:32:42.5721475Z )
2025-05-07T20:32:42.5721852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5722368Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5722649Z 
2025-05-07T20:32:42.5722733Z     @given(
2025-05-07T20:32:42.5722982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5723303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5723624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5723969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5724422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5724722Z     )
2025-05-07T20:32:42.5725085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5725539Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5725786Z         self,
2025-05-07T20:32:42.5725993Z         T: int,
2025-05-07T20:32:42.5726206Z         D: int,
2025-05-07T20:32:42.5726429Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5726714Z         contiguous: bool,
2025-05-07T20:32:42.5726973Z         compiled: bool,
2025-05-07T20:32:42.5727204Z     ) -> None:
2025-05-07T20:32:42.5727447Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5727828Z     
2025-05-07T20:32:42.5728109Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5728469Z     
2025-05-07T20:32:42.5728683Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5728982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5729311Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5729567Z         x0 = x[:, :D]
2025-05-07T20:32:42.5729798Z         x1 = x[:, D:]
2025-05-07T20:32:42.5730013Z     
2025-05-07T20:32:42.5730213Z         if contiguous:
2025-05-07T20:32:42.5730459Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5730725Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5730980Z     
2025-05-07T20:32:42.5731192Z         if scale_ub is not None:
2025-05-07T20:32:42.5731481Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5731905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5732229Z             )
2025-05-07T20:32:42.5732430Z         else:
2025-05-07T20:32:42.5732656Z             scale_ub_tensor = None
2025-05-07T20:32:42.5732922Z     
2025-05-07T20:32:42.5733162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5733492Z             op = silu_mul_quant
2025-05-07T20:32:42.5733759Z             if compiled:
2025-05-07T20:32:42.5734018Z                 op = torch.compile(op)
2025-05-07T20:32:42.5734335Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5734625Z     
2025-05-07T20:32:42.5734828Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5735007Z 
2025-05-07T20:32:42.5735113Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5735425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5735775Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5736064Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5736828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5737538Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5738081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5738779Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5739500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5740047Z     kernel = self.compile(
2025-05-07T20:32:42.5740593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5741264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5741673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5741909Z 
2025-05-07T20:32:42.5742134Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e864c50>
2025-05-07T20:32:42.5743220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5744608Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e86c9a0>}
2025-05-07T20:32:42.5746015Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5747058Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e8152b0>
2025-05-07T20:32:42.5747348Z 
2025-05-07T20:32:42.5747532Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5748061Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5748542Z                            module_map=module_map)
2025-05-07T20:32:42.5748919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5749277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5749550Z E       ^
2025-05-07T20:32:42.5750035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5750488Z 
2025-05-07T20:32:42.5750917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5751432Z 
2025-05-07T20:32:42.5751541Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5751967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5752432Z     T=2048,
2025-05-07T20:32:42.5752637Z     D=7168,
2025-05-07T20:32:42.5752852Z     scale_ub=None,
2025-05-07T20:32:42.5753089Z     contiguous=False,
2025-05-07T20:32:42.5753326Z     compiled=True,
2025-05-07T20:32:42.5753547Z )
2025-05-07T20:32:42.6437428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6438237Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6438623Z 
2025-05-07T20:32:42.6438767Z     @given(
2025-05-07T20:32:42.6439083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6439500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6439822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6440161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6440503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6440803Z     )
2025-05-07T20:32:42.6441163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6441875Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6442140Z         self,
2025-05-07T20:32:42.6442343Z         T: int,
2025-05-07T20:32:42.6442563Z         D: int,
2025-05-07T20:32:42.6442794Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6443078Z         contiguous: bool,
2025-05-07T20:32:42.6443326Z         compiled: bool,
2025-05-07T20:32:42.6443564Z     ) -> None:
2025-05-07T20:32:42.6443794Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6444180Z     
2025-05-07T20:32:42.6444468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6444824Z     
2025-05-07T20:32:42.6445024Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6445328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6445653Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6445898Z         x0 = x[:, :D]
2025-05-07T20:32:42.6446135Z         x1 = x[:, D:]
2025-05-07T20:32:42.6446358Z     
2025-05-07T20:32:42.6446552Z         if contiguous:
2025-05-07T20:32:42.6446796Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6447066Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6447311Z     
2025-05-07T20:32:42.6447519Z         if scale_ub is not None:
2025-05-07T20:32:42.6447957Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6448297Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6448615Z             )
2025-05-07T20:32:42.6448917Z         else:
2025-05-07T20:32:42.6449144Z             scale_ub_tensor = None
2025-05-07T20:32:42.6449400Z     
2025-05-07T20:32:42.6449646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6449972Z             op = silu_mul_quant
2025-05-07T20:32:42.6450231Z             if compiled:
2025-05-07T20:32:42.6450495Z                 op = torch.compile(op)
2025-05-07T20:32:42.6450807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6451086Z     
2025-05-07T20:32:42.6451297Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6451470Z 
2025-05-07T20:32:42.6451575Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6451881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6452217Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6452509Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6453082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6453662Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6454322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6455018Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6455566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6456322Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6456995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6457533Z     kernel = self.compile(
2025-05-07T20:32:42.6458078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6458731Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6459136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6459371Z 
2025-05-07T20:32:42.6459586Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e82bd10>
2025-05-07T20:32:42.6460667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6462102Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e86dd00>}
2025-05-07T20:32:42.6463455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6464483Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e8e0370>
2025-05-07T20:32:42.6464818Z 
2025-05-07T20:32:42.6464993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6465515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6465991Z                            module_map=module_map)
2025-05-07T20:32:42.6466362Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6466723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6466987Z E       ^
2025-05-07T20:32:42.6467484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6467941Z 
2025-05-07T20:32:42.6468361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6468873Z 
2025-05-07T20:32:42.6468987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6469406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6469862Z     T=4096,
2025-05-07T20:32:42.6470061Z     D=7168,
2025-05-07T20:32:42.6470264Z     scale_ub=None,
2025-05-07T20:32:42.6470486Z     contiguous=False,
2025-05-07T20:32:42.6470722Z     compiled=True,
2025-05-07T20:32:42.6470938Z )
2025-05-07T20:32:42.6471261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6471765Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6472038Z 
2025-05-07T20:32:42.6472131Z     @given(
2025-05-07T20:32:42.6472370Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6472694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6473010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6473350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6473685Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6473982Z     )
2025-05-07T20:32:42.6474349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6474793Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6475045Z         self,
2025-05-07T20:32:42.6475250Z         T: int,
2025-05-07T20:32:42.6475450Z         D: int,
2025-05-07T20:32:42.6475675Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6475951Z         contiguous: bool,
2025-05-07T20:32:42.6476189Z         compiled: bool,
2025-05-07T20:32:42.6476418Z     ) -> None:
2025-05-07T20:32:42.6476686Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6476933Z     
2025-05-07T20:32:42.6477208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6477556Z     
2025-05-07T20:32:42.6477748Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6478042Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6478355Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6478600Z         x0 = x[:, :D]
2025-05-07T20:32:42.6478813Z         x1 = x[:, D:]
2025-05-07T20:32:42.6479033Z     
2025-05-07T20:32:42.6479220Z         if contiguous:
2025-05-07T20:32:42.6479448Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6479710Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6479956Z     
2025-05-07T20:32:42.6480147Z         if scale_ub is not None:
2025-05-07T20:32:42.6480424Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6480757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6481062Z             )
2025-05-07T20:32:42.6481264Z         else:
2025-05-07T20:32:42.6481524Z             scale_ub_tensor = None
2025-05-07T20:32:42.6481776Z     
2025-05-07T20:32:42.6482010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6482328Z             op = silu_mul_quant
2025-05-07T20:32:42.6482577Z             if compiled:
2025-05-07T20:32:42.6482829Z                 op = torch.compile(op)
2025-05-07T20:32:42.6483129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6483451Z     
2025-05-07T20:32:42.6483646Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6483819Z 
2025-05-07T20:32:42.6483921Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6484220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6484552Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6484839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6485399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6485962Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6486625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6487320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6487929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6488608Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6489321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6489864Z     kernel = self.compile(
2025-05-07T20:32:42.6490399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6491059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6491463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6491691Z 
2025-05-07T20:32:42.6491907Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e4a5390>
2025-05-07T20:32:42.6492979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6494359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e86e840>}
2025-05-07T20:32:42.6495695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6496716Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e4bd9f0>
2025-05-07T20:32:42.6497047Z 
2025-05-07T20:32:42.6497223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6497741Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6498209Z                            module_map=module_map)
2025-05-07T20:32:42.6498578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6498928Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6499195Z E       ^
2025-05-07T20:32:42.6499660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6500106Z 
2025-05-07T20:32:42.6500525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6501033Z 
2025-05-07T20:32:42.7764021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7765245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7766749Z     T=16384,
2025-05-07T20:32:42.7767158Z     D=5120,
2025-05-07T20:32:42.7767497Z     scale_ub=1200.0,
2025-05-07T20:32:42.7767867Z     contiguous=False,
2025-05-07T20:32:42.7768093Z     compiled=False,
2025-05-07T20:32:42.7768302Z )
2025-05-07T20:32:42.7768620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7769126Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.7769494Z 
2025-05-07T20:32:42.7769584Z     @given(
2025-05-07T20:32:42.7769812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7770128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7770440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7770765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7771096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7771388Z     )
2025-05-07T20:32:42.7771747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7772191Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7772437Z         self,
2025-05-07T20:32:42.7772637Z         T: int,
2025-05-07T20:32:42.7772834Z         D: int,
2025-05-07T20:32:42.7773056Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7773335Z         contiguous: bool,
2025-05-07T20:32:42.7773574Z         compiled: bool,
2025-05-07T20:32:42.7773894Z     ) -> None:
2025-05-07T20:32:42.7774116Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7774362Z     
2025-05-07T20:32:42.7774642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7774992Z     
2025-05-07T20:32:42.7775188Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7775488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7775805Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7776053Z         x0 = x[:, :D]
2025-05-07T20:32:42.7776271Z         x1 = x[:, D:]
2025-05-07T20:32:42.7776485Z     
2025-05-07T20:32:42.7776679Z         if contiguous:
2025-05-07T20:32:42.7776906Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7777166Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7777412Z     
2025-05-07T20:32:42.7777606Z         if scale_ub is not None:
2025-05-07T20:32:42.7777928Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7778270Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7778582Z             )
2025-05-07T20:32:42.7778783Z         else:
2025-05-07T20:32:42.7779001Z             scale_ub_tensor = None
2025-05-07T20:32:42.7779257Z     
2025-05-07T20:32:42.7779483Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7779802Z             op = silu_mul_quant
2025-05-07T20:32:42.7780057Z             if compiled:
2025-05-07T20:32:42.7780307Z                 op = torch.compile(op)
2025-05-07T20:32:42.7780606Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7780962Z     
2025-05-07T20:32:42.7781163Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7781336Z 
2025-05-07T20:32:42.7781440Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7781742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7782077Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7782356Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7783045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7783744Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7784281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7784968Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7785636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7786255Z     kernel = self.compile(
2025-05-07T20:32:42.7786796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7787453Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7787854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7788082Z 
2025-05-07T20:32:42.7788344Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e4126d0>
2025-05-07T20:32:42.7789422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7790812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e4f4040>}
2025-05-07T20:32:42.7792161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7793186Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e4dacf0>
2025-05-07T20:32:42.7793478Z 
2025-05-07T20:32:42.7793645Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7794221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7794692Z                            module_map=module_map)
2025-05-07T20:32:42.7795062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7795411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7795680Z E       ^
2025-05-07T20:32:42.7796148Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7796601Z 
2025-05-07T20:32:42.7797022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7797538Z 
2025-05-07T20:32:42.7797644Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7798061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7798468Z     T=16384,
2025-05-07T20:32:42.7798662Z     D=5120,
2025-05-07T20:32:42.7798871Z     scale_ub=1200.0,
2025-05-07T20:32:42.7799100Z     contiguous=True,
2025-05-07T20:32:42.7799321Z     compiled=True,
2025-05-07T20:32:42.7799532Z )
2025-05-07T20:32:42.7799855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7800343Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7800628Z 
2025-05-07T20:32:42.7800709Z     @given(
2025-05-07T20:32:42.7800944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7801314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7801617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7801969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7802298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7802589Z     )
2025-05-07T20:32:42.7802935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7803386Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7803635Z         self,
2025-05-07T20:32:42.7803828Z         T: int,
2025-05-07T20:32:42.7804034Z         D: int,
2025-05-07T20:32:42.7804257Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7804521Z         contiguous: bool,
2025-05-07T20:32:42.7804763Z         compiled: bool,
2025-05-07T20:32:42.7804987Z     ) -> None:
2025-05-07T20:32:42.7805199Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7805451Z     
2025-05-07T20:32:42.7806014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7806441Z     
2025-05-07T20:32:42.7806636Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7806931Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7807243Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7807483Z         x0 = x[:, :D]
2025-05-07T20:32:42.7807786Z         x1 = x[:, D:]
2025-05-07T20:32:42.7807999Z     
2025-05-07T20:32:42.7808190Z         if contiguous:
2025-05-07T20:32:42.7808495Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7808764Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7809001Z     
2025-05-07T20:32:42.7809198Z         if scale_ub is not None:
2025-05-07T20:32:42.7809477Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7809811Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7810123Z             )
2025-05-07T20:32:42.7810323Z         else:
2025-05-07T20:32:42.7810534Z             scale_ub_tensor = None
2025-05-07T20:32:42.7810793Z     
2025-05-07T20:32:42.7811036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7811356Z             op = silu_mul_quant
2025-05-07T20:32:42.7811605Z             if compiled:
2025-05-07T20:32:42.7811858Z                 op = torch.compile(op)
2025-05-07T20:32:42.7812158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7812432Z     
2025-05-07T20:32:42.7812628Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7812811Z 
2025-05-07T20:32:42.7813002Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7813306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7813636Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7813921Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7821920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7822508Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7823184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7823889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7824441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7825129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7825808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7826369Z     kernel = self.compile(
2025-05-07T20:32:42.7826926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7827588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7828003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7828238Z 
2025-05-07T20:32:42.7828566Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eae2990>
2025-05-07T20:32:42.7829659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7831042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e4f5300>}
2025-05-07T20:32:42.7832404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7833442Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ea19c30>
2025-05-07T20:32:42.7833737Z 
2025-05-07T20:32:42.7833915Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7834498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7834979Z                            module_map=module_map)
2025-05-07T20:32:42.7835360Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7835726Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7835992Z E       ^
2025-05-07T20:32:42.7836468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7836968Z 
2025-05-07T20:32:42.7837389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7837910Z 
2025-05-07T20:32:43.0805351Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.0806214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.0806782Z     T=16384,
2025-05-07T20:32:43.0807049Z     D=5120,
2025-05-07T20:32:43.0807278Z     scale_ub=None,
2025-05-07T20:32:43.0807511Z     contiguous=False,
2025-05-07T20:32:43.0807886Z     compiled=True,
2025-05-07T20:32:43.0808105Z )
2025-05-07T20:32:43.0808429Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.0808942Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.0809223Z 
2025-05-07T20:32:43.0809313Z     @given(
2025-05-07T20:32:43.0809557Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.0810169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.0810494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.0810838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.0811176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.0811474Z     )
2025-05-07T20:32:43.0811836Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.0812289Z     def test_silu_mul_quant(
2025-05-07T20:32:43.0812546Z         self,
2025-05-07T20:32:43.0812753Z         T: int,
2025-05-07T20:32:43.0812958Z         D: int,
2025-05-07T20:32:43.0813190Z         scale_ub: Optional[float],
2025-05-07T20:32:43.0813473Z         contiguous: bool,
2025-05-07T20:32:43.0813723Z         compiled: bool,
2025-05-07T20:32:43.0813964Z     ) -> None:
2025-05-07T20:32:43.0814196Z         torch.manual_seed(2025)
2025-05-07T20:32:43.0814446Z     
2025-05-07T20:32:43.0814741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.0815099Z     
2025-05-07T20:32:43.0815312Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.0815609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.0815938Z         x = x_sign * x_clamp
2025-05-07T20:32:43.0816195Z         x0 = x[:, :D]
2025-05-07T20:32:43.0816420Z         x1 = x[:, D:]
2025-05-07T20:32:43.0816643Z     
2025-05-07T20:32:43.0816854Z         if contiguous:
2025-05-07T20:32:43.0817179Z             x0 = x0.contiguous()
2025-05-07T20:32:43.0817460Z             x1 = x1.contiguous()
2025-05-07T20:32:43.0817716Z     
2025-05-07T20:32:43.0817917Z         if scale_ub is not None:
2025-05-07T20:32:43.0818207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.0818558Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.0818874Z             )
2025-05-07T20:32:43.0819084Z         else:
2025-05-07T20:32:43.0819313Z             scale_ub_tensor = None
2025-05-07T20:32:43.0819575Z     
2025-05-07T20:32:43.0819815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.0820143Z             op = silu_mul_quant
2025-05-07T20:32:43.0820409Z             if compiled:
2025-05-07T20:32:43.0820667Z                 op = torch.compile(op)
2025-05-07T20:32:43.0820977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.0821270Z     
2025-05-07T20:32:43.0821471Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.0821647Z 
2025-05-07T20:32:43.0821755Z moe/activation_test.py:117: 
2025-05-07T20:32:43.0822144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.0822484Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.0822780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.0823352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.0823925Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.0824671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.0825378Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.0825929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.0826618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.0827300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.0827896Z     kernel = self.compile(
2025-05-07T20:32:43.0828455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.0829119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.0829535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.0829820Z 
2025-05-07T20:32:43.0830042Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ea52bd0>
2025-05-07T20:32:43.0831142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.0832541Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e4f5e40>}
2025-05-07T20:32:43.0833902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.0834938Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eacb1f0>
2025-05-07T20:32:43.0835233Z 
2025-05-07T20:32:43.0835418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.0835947Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.0836429Z                            module_map=module_map)
2025-05-07T20:32:43.0836809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.0837177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.0837443Z E       ^
2025-05-07T20:32:43.0838059Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.0838519Z 
2025-05-07T20:32:43.0838956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.0839477Z 
2025-05-07T20:32:43.0839586Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.0840015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.0840446Z     T=2048,
2025-05-07T20:32:43.0840653Z     D=5120,
2025-05-07T20:32:43.0840853Z     scale_ub=None,
2025-05-07T20:32:43.0841089Z     contiguous=False,
2025-05-07T20:32:43.0841325Z     compiled=True,
2025-05-07T20:32:43.0841537Z )
2025-05-07T20:32:43.1567014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1568110Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.1568501Z 
2025-05-07T20:32:43.1568599Z     @given(
2025-05-07T20:32:43.1569185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1569592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1569956Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1570288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1570609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1570899Z     )
2025-05-07T20:32:43.1571245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1571767Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1572004Z         self,
2025-05-07T20:32:43.1572200Z         T: int,
2025-05-07T20:32:43.1572400Z         D: int,
2025-05-07T20:32:43.1572613Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1572886Z         contiguous: bool,
2025-05-07T20:32:43.1573126Z         compiled: bool,
2025-05-07T20:32:43.1573346Z     ) -> None:
2025-05-07T20:32:43.1573565Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1573809Z     
2025-05-07T20:32:43.1574080Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1574422Z     
2025-05-07T20:32:43.1574616Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1574908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1575225Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1575473Z         x0 = x[:, :D]
2025-05-07T20:32:43.1575693Z         x1 = x[:, D:]
2025-05-07T20:32:43.1575908Z     
2025-05-07T20:32:43.1576184Z         if contiguous:
2025-05-07T20:32:43.1576421Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1576679Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1576927Z     
2025-05-07T20:32:43.1577133Z         if scale_ub is not None:
2025-05-07T20:32:43.1577407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1577751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1578070Z             )
2025-05-07T20:32:43.1578269Z         else:
2025-05-07T20:32:43.1578489Z             scale_ub_tensor = None
2025-05-07T20:32:43.1578752Z     
2025-05-07T20:32:43.1578986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1579312Z             op = silu_mul_quant
2025-05-07T20:32:43.1579571Z             if compiled:
2025-05-07T20:32:43.1579822Z                 op = torch.compile(op)
2025-05-07T20:32:43.1580127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1580411Z     
2025-05-07T20:32:43.1580608Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1580787Z 
2025-05-07T20:32:43.1580888Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1581190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1581532Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1581814Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1582378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.1583024Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.1583689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1584393Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1584936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1585626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1586293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1586833Z     kernel = self.compile(
2025-05-07T20:32:43.1587380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1588093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1588494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1588777Z 
2025-05-07T20:32:43.1588988Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eac7fd0>
2025-05-07T20:32:43.1590078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1591473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e4f7240>}
2025-05-07T20:32:43.1592861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1593893Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eaa4670>
2025-05-07T20:32:43.1594188Z 
2025-05-07T20:32:43.1594360Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1594886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1595350Z                            module_map=module_map)
2025-05-07T20:32:43.1595717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1596077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1596338Z E       ^
2025-05-07T20:32:43.1596854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1597309Z 
2025-05-07T20:32:43.1597727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1598239Z 
2025-05-07T20:32:43.1598355Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1598766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1599177Z     T=2048,
2025-05-07T20:32:43.1599379Z     D=5120,
2025-05-07T20:32:43.1599578Z     scale_ub=1200.0,
2025-05-07T20:32:43.1599813Z     contiguous=False,
2025-05-07T20:32:43.1600049Z     compiled=True,
2025-05-07T20:32:43.1600256Z )
2025-05-07T20:32:43.1600585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1601086Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.1601362Z 
2025-05-07T20:32:43.1601454Z     @given(
2025-05-07T20:32:43.1601687Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1602007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1602320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1602648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1602981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1603277Z     )
2025-05-07T20:32:43.1603680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1604130Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1604379Z         self,
2025-05-07T20:32:43.1604580Z         T: int,
2025-05-07T20:32:43.1604781Z         D: int,
2025-05-07T20:32:43.1605007Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1605281Z         contiguous: bool,
2025-05-07T20:32:43.1605521Z         compiled: bool,
2025-05-07T20:32:43.1606058Z     ) -> None:
2025-05-07T20:32:43.1606281Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1606527Z     
2025-05-07T20:32:43.1606806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1607149Z     
2025-05-07T20:32:43.1607345Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1607702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1608008Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1608247Z         x0 = x[:, :D]
2025-05-07T20:32:43.1608463Z         x1 = x[:, D:]
2025-05-07T20:32:43.1608673Z     
2025-05-07T20:32:43.1608930Z         if contiguous:
2025-05-07T20:32:43.1609165Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1609420Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1609652Z     
2025-05-07T20:32:43.1609847Z         if scale_ub is not None:
2025-05-07T20:32:43.1610116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1610448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1610747Z             )
2025-05-07T20:32:43.1611007Z         else:
2025-05-07T20:32:43.1611216Z             scale_ub_tensor = None
2025-05-07T20:32:43.1611461Z     
2025-05-07T20:32:43.1611692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1612007Z             op = silu_mul_quant
2025-05-07T20:32:43.1612255Z             if compiled:
2025-05-07T20:32:43.1612503Z                 op = torch.compile(op)
2025-05-07T20:32:43.1612798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1613067Z     
2025-05-07T20:32:43.1613264Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1613430Z 
2025-05-07T20:32:43.1613535Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1613829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1614164Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1614442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1615002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.1615625Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.1616282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1616969Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1617499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1618181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1618842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1619380Z     kernel = self.compile(
2025-05-07T20:32:43.1619919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1620580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1620988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1621217Z 
2025-05-07T20:32:43.1621433Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e5d5310>
2025-05-07T20:32:43.1622507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1623953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e50c720>}
2025-05-07T20:32:43.1625299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1626325Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e599970>
2025-05-07T20:32:43.1626617Z 
2025-05-07T20:32:43.1626788Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1627308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1627776Z                            module_map=module_map)
2025-05-07T20:32:43.1628144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1628499Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1628764Z E       ^
2025-05-07T20:32:43.1629283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1629734Z 
2025-05-07T20:32:43.1630157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1630666Z 
2025-05-07T20:32:43.2961998Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2963190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2964605Z     T=4096,
2025-05-07T20:32:43.2964992Z     D=5120,
2025-05-07T20:32:43.2965385Z     scale_ub=1200.0,
2025-05-07T20:32:43.2965825Z     contiguous=True,
2025-05-07T20:32:43.2966270Z     compiled=True,
2025-05-07T20:32:43.2966682Z )
2025-05-07T20:32:43.2967309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2968073Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2968344Z 
2025-05-07T20:32:43.2968436Z     @given(
2025-05-07T20:32:43.2968672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2968980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2969295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2969628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2969956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2970249Z     )
2025-05-07T20:32:43.2970702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2971137Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2971390Z         self,
2025-05-07T20:32:43.2971598Z         T: int,
2025-05-07T20:32:43.2971794Z         D: int,
2025-05-07T20:32:43.2972020Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2972299Z         contiguous: bool,
2025-05-07T20:32:43.2972543Z         compiled: bool,
2025-05-07T20:32:43.2972768Z     ) -> None:
2025-05-07T20:32:43.2972994Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2973246Z     
2025-05-07T20:32:43.2973520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2973867Z     
2025-05-07T20:32:43.2974072Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2974366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2974683Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2974933Z         x0 = x[:, :D]
2025-05-07T20:32:43.2975153Z         x1 = x[:, D:]
2025-05-07T20:32:43.2975371Z     
2025-05-07T20:32:43.2975566Z         if contiguous:
2025-05-07T20:32:43.2975799Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2976067Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2976313Z     
2025-05-07T20:32:43.2976505Z         if scale_ub is not None:
2025-05-07T20:32:43.2976782Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2977122Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2977434Z             )
2025-05-07T20:32:43.2977725Z         else:
2025-05-07T20:32:43.2977991Z             scale_ub_tensor = None
2025-05-07T20:32:43.2978248Z     
2025-05-07T20:32:43.2978478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2978800Z             op = silu_mul_quant
2025-05-07T20:32:43.2979050Z             if compiled:
2025-05-07T20:32:43.2979297Z                 op = torch.compile(op)
2025-05-07T20:32:43.2979597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2979879Z     
2025-05-07T20:32:43.2980080Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2980245Z 
2025-05-07T20:32:43.2980346Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2980643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2980980Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2981257Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2981821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2982465Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2983125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2983817Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2984353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2985103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2985764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2986300Z     kernel = self.compile(
2025-05-07T20:32:43.2986849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2987512Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2987913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2988148Z 
2025-05-07T20:32:43.2988357Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e5aa750>
2025-05-07T20:32:43.2989435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2990869Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e50d260>}
2025-05-07T20:32:43.2992207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2993235Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e576d70>
2025-05-07T20:32:43.2993531Z 
2025-05-07T20:32:43.2993696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2994224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2994686Z                            module_map=module_map)
2025-05-07T20:32:43.2995056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2995417Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2995682Z E       ^
2025-05-07T20:32:43.2996153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2996609Z 
2025-05-07T20:32:43.2997023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2997533Z 
2025-05-07T20:32:43.2997648Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2998105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2998512Z     T=128,
2025-05-07T20:32:43.2998703Z     D=5120,
2025-05-07T20:32:43.2998918Z     scale_ub=1200.0,
2025-05-07T20:32:43.2999143Z     contiguous=False,
2025-05-07T20:32:43.2999377Z     compiled=True,
2025-05-07T20:32:43.2999588Z )
2025-05-07T20:32:43.5480684Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5481442Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.5481903Z 
2025-05-07T20:32:43.5482016Z     @given(
2025-05-07T20:32:43.5482345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5482737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5483052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5483387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5483715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5484005Z     )
2025-05-07T20:32:43.5484653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5485096Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5485345Z         self,
2025-05-07T20:32:43.5485546Z         T: int,
2025-05-07T20:32:43.5485750Z         D: int,
2025-05-07T20:32:43.5485966Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5486242Z         contiguous: bool,
2025-05-07T20:32:43.5486484Z         compiled: bool,
2025-05-07T20:32:43.5486795Z     ) -> None:
2025-05-07T20:32:43.5487020Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5487272Z     
2025-05-07T20:32:43.5487622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5487982Z     
2025-05-07T20:32:43.5488187Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.5488486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.5488808Z         x = x_sign * x_clamp
2025-05-07T20:32:43.5489059Z         x0 = x[:, :D]
2025-05-07T20:32:43.5489292Z         x1 = x[:, D:]
2025-05-07T20:32:43.5489535Z     
2025-05-07T20:32:43.5489718Z         if contiguous:
2025-05-07T20:32:43.5489956Z             x0 = x0.contiguous()
2025-05-07T20:32:43.5490219Z             x1 = x1.contiguous()
2025-05-07T20:32:43.5490459Z     
2025-05-07T20:32:43.5490653Z         if scale_ub is not None:
2025-05-07T20:32:43.5490929Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.5491269Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.5491666Z             )
2025-05-07T20:32:43.5491862Z         else:
2025-05-07T20:32:43.5492080Z             scale_ub_tensor = None
2025-05-07T20:32:43.5492333Z     
2025-05-07T20:32:43.5492566Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.5492881Z             op = silu_mul_quant
2025-05-07T20:32:43.5493132Z             if compiled:
2025-05-07T20:32:43.5493385Z                 op = torch.compile(op)
2025-05-07T20:32:43.5493685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5493959Z     
2025-05-07T20:32:43.5494160Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.5494322Z 
2025-05-07T20:32:43.5494433Z moe/activation_test.py:117: 
2025-05-07T20:32:43.5494727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5495067Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.5495351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5495914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.5496481Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.5497142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.5497860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.5498416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.5499207Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.5499874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.5500409Z     kernel = self.compile(
2025-05-07T20:32:43.5500946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.5501604Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.5502009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5502235Z 
2025-05-07T20:32:43.5502448Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e3d3a90>
2025-05-07T20:32:43.5503522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.5504964Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e50e480>}
2025-05-07T20:32:43.5506594Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.5507698Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e304130>
2025-05-07T20:32:43.5507986Z 
2025-05-07T20:32:43.5508156Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.5508686Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.5509161Z                            module_map=module_map)
2025-05-07T20:32:43.5509542Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.5509908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.5510184Z E       ^
2025-05-07T20:32:43.5510662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.5511113Z 
2025-05-07T20:32:43.5511541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.5512054Z 
2025-05-07T20:32:43.5512164Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.5512658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.5513078Z     T=16384,
2025-05-07T20:32:43.5513282Z     D=7168,
2025-05-07T20:32:43.5513489Z     scale_ub=1200.0,
2025-05-07T20:32:43.5513726Z     contiguous=True,
2025-05-07T20:32:43.5513953Z     compiled=True,
2025-05-07T20:32:43.5514176Z )
2025-05-07T20:32:43.5514506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5515008Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.5515297Z 
2025-05-07T20:32:43.5515380Z     @given(
2025-05-07T20:32:43.5515623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5515947Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5516256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5516599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5516939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5517236Z     )
2025-05-07T20:32:43.5517593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5518048Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5518294Z         self,
2025-05-07T20:32:43.5518504Z         T: int,
2025-05-07T20:32:43.5518713Z         D: int,
2025-05-07T20:32:43.5518933Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5519213Z         contiguous: bool,
2025-05-07T20:32:43.5519464Z         compiled: bool,
2025-05-07T20:32:43.5519767Z     ) -> None:
2025-05-07T20:32:43.5519993Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5520247Z     
2025-05-07T20:32:43.5520529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5520868Z     
2025-05-07T20:32:43.5521070Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.5521370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.5521684Z         x = x_sign * x_clamp
2025-05-07T20:32:43.5521942Z         x0 = x[:, :D]
2025-05-07T20:32:43.5522170Z         x1 = x[:, D:]
2025-05-07T20:32:43.5522381Z     
2025-05-07T20:32:43.5522580Z         if contiguous:
2025-05-07T20:32:43.5522822Z             x0 = x0.contiguous()
2025-05-07T20:32:43.5523081Z             x1 = x1.contiguous()
2025-05-07T20:32:43.5523330Z     
2025-05-07T20:32:43.5523536Z         if scale_ub is not None:
2025-05-07T20:32:43.5523812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.5524165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.5524551Z             )
2025-05-07T20:32:43.5524760Z         else:
2025-05-07T20:32:43.5524976Z             scale_ub_tensor = None
2025-05-07T20:32:43.5525239Z     
2025-05-07T20:32:43.5525478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.5525800Z             op = silu_mul_quant
2025-05-07T20:32:43.5526051Z             if compiled:
2025-05-07T20:32:43.5526308Z                 op = torch.compile(op)
2025-05-07T20:32:43.5526659Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5526935Z     
2025-05-07T20:32:43.5527135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.5527301Z 
2025-05-07T20:32:43.5527410Z moe/activation_test.py:117: 
2025-05-07T20:32:43.5527805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5528168Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.5528457Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5529020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.5529586Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.5530254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.5530948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.5531484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.5532223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.5532891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.5533428Z     kernel = self.compile(
2025-05-07T20:32:43.5533968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.5534626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.5535032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5535263Z 
2025-05-07T20:32:43.5535473Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e3ad0d0>
2025-05-07T20:32:43.5536560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.5537948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e50fd80>}
2025-05-07T20:32:43.5539301Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.5540381Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e31d730>
2025-05-07T20:32:43.5540673Z 
2025-05-07T20:32:43.5540840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.5541368Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.5541840Z                            module_map=module_map)
2025-05-07T20:32:43.5542213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.5542573Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.5542838Z E       ^
2025-05-07T20:32:43.5543311Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.5543761Z 
2025-05-07T20:32:43.5544181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.5544702Z 
2025-05-07T20:32:43.6500391Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6501307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6501881Z     T=16384,
2025-05-07T20:32:43.6502122Z     D=5120,
2025-05-07T20:32:43.6502317Z     scale_ub=1200.0,
2025-05-07T20:32:43.6502545Z     contiguous=True,
2025-05-07T20:32:43.6502772Z     compiled=False,
2025-05-07T20:32:43.6502980Z )
2025-05-07T20:32:43.6503303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6503895Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.6504175Z 
2025-05-07T20:32:43.6504258Z     @given(
2025-05-07T20:32:43.6504495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6504811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6505128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6505457Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6506042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6507293Z     )
2025-05-07T20:32:43.6507709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.6508176Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6508436Z         self,
2025-05-07T20:32:43.6508632Z         T: int,
2025-05-07T20:32:43.6508840Z         D: int,
2025-05-07T20:32:43.6509067Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6509343Z         contiguous: bool,
2025-05-07T20:32:43.6509847Z         compiled: bool,
2025-05-07T20:32:43.6510085Z     ) -> None:
2025-05-07T20:32:43.6510305Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6510557Z     
2025-05-07T20:32:43.6510843Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6511186Z     
2025-05-07T20:32:43.6511389Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.6511689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.6512006Z         x = x_sign * x_clamp
2025-05-07T20:32:43.6512261Z         x0 = x[:, :D]
2025-05-07T20:32:43.6512482Z         x1 = x[:, D:]
2025-05-07T20:32:43.6512698Z     
2025-05-07T20:32:43.6512886Z         if contiguous:
2025-05-07T20:32:43.6513126Z             x0 = x0.contiguous()
2025-05-07T20:32:43.6513390Z             x1 = x1.contiguous()
2025-05-07T20:32:43.6513628Z     
2025-05-07T20:32:43.6513826Z         if scale_ub is not None:
2025-05-07T20:32:43.6514106Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.6514447Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.6514762Z             )
2025-05-07T20:32:43.6514965Z         else:
2025-05-07T20:32:43.6515181Z             scale_ub_tensor = None
2025-05-07T20:32:43.6515445Z     
2025-05-07T20:32:43.6515687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.6516000Z             op = silu_mul_quant
2025-05-07T20:32:43.6516259Z             if compiled:
2025-05-07T20:32:43.6516514Z                 op = torch.compile(op)
2025-05-07T20:32:43.6516901Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.6517178Z     
2025-05-07T20:32:43.6517379Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.6517544Z 
2025-05-07T20:32:43.6517652Z moe/activation_test.py:117: 
2025-05-07T20:32:43.6517948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.6518289Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.6518579Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.6519279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.6519979Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.6520523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.6521217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.6521962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.6522503Z     kernel = self.compile(
2025-05-07T20:32:43.6523050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.6523711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.6524106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.6524411Z 
2025-05-07T20:32:43.6524619Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e13e190>
2025-05-07T20:32:43.6525702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.6527119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e3f8cc0>}
2025-05-07T20:32:43.6528558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.6529593Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e1be7b0>
2025-05-07T20:32:43.6529890Z 
2025-05-07T20:32:43.6530139Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.6530673Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.6531145Z                            module_map=module_map)
2025-05-07T20:32:43.6531523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.6531889Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.6532153Z E       ^
2025-05-07T20:32:43.6532636Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.6533096Z 
2025-05-07T20:32:43.6533514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.6534029Z 
2025-05-07T20:32:43.6534142Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6534617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6535032Z     T=1,
2025-05-07T20:32:43.6535230Z     D=7168,
2025-05-07T20:32:43.6535437Z     scale_ub=1200.0,
2025-05-07T20:32:43.6535667Z     contiguous=False,
2025-05-07T20:32:43.6535907Z     compiled=False,
2025-05-07T20:32:43.6536128Z )
2025-05-07T20:32:43.6536450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6536950Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.6537229Z 
2025-05-07T20:32:43.6537309Z     @given(
2025-05-07T20:32:43.6537602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6537923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6538243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6538587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6538919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6539217Z     )
2025-05-07T20:32:43.6539582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
﻿2025-05-07T20:32:43.6542619Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6542870Z         self,
2025-05-07T20:32:43.6543076Z         T: int,
2025-05-07T20:32:43.6543274Z         D: int,
2025-05-07T20:32:43.6543500Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6543772Z         contiguous: bool,
2025-05-07T20:32:43.6544008Z         compiled: bool,
2025-05-07T20:32:43.6544235Z     ) -> None:
2025-05-07T20:32:43.6544453Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6544707Z     
2025-05-07T20:32:43.6545041Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6545384Z     
2025-05-07T20:32:43.6545575Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.6545868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.6546176Z         x = x_sign * x_clamp
2025-05-07T20:32:43.6546415Z         x0 = x[:, :D]
2025-05-07T20:32:43.6546635Z         x1 = x[:, D:]
2025-05-07T20:32:43.6546849Z     
2025-05-07T20:32:43.6547042Z         if contiguous:
2025-05-07T20:32:43.6547298Z             x0 = x0.contiguous()
2025-05-07T20:32:43.6547558Z             x1 = x1.contiguous()
2025-05-07T20:32:43.6547808Z     
2025-05-07T20:32:43.6547998Z         if scale_ub is not None:
2025-05-07T20:32:43.6548273Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.6548615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.6548923Z             )
2025-05-07T20:32:43.6549125Z         else:
2025-05-07T20:32:43.6549350Z             scale_ub_tensor = None
2025-05-07T20:32:43.6549604Z     
2025-05-07T20:32:43.6549831Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.6550147Z             op = silu_mul_quant
2025-05-07T20:32:43.6550395Z             if compiled:
2025-05-07T20:32:43.6550638Z                 op = torch.compile(op)
2025-05-07T20:32:43.6550930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.6551208Z     
2025-05-07T20:32:43.6551449Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.6551622Z 
2025-05-07T20:32:43.6551723Z moe/activation_test.py:117: 
2025-05-07T20:32:43.6552022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.6552357Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.6552642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.6553336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.6554037Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.6554571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.6555257Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.6555921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.6556450Z     kernel = self.compile(
2025-05-07T20:32:43.6557003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.6557670Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.6558102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.6558351Z 
2025-05-07T20:32:43.6558557Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e1715d0>
2025-05-07T20:32:43.6559691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.6561062Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e3f9080>}
2025-05-07T20:32:43.6562409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.6563515Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e1303f0>
2025-05-07T20:32:43.6563803Z 
2025-05-07T20:32:43.6563967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.6564497Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.6565007Z                            module_map=module_map)
2025-05-07T20:32:43.6565366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.6565728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.6565995Z E       ^
2025-05-07T20:32:43.6566464Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.6566910Z 
2025-05-07T20:32:43.6567325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.6567920Z 
2025-05-07T20:32:43.7903306Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7903752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7904180Z     T=4096,
2025-05-07T20:32:43.7904377Z     D=7168,
2025-05-07T20:32:43.7904584Z     scale_ub=1200.0,
2025-05-07T20:32:43.7904811Z     contiguous=False,
2025-05-07T20:32:43.7905044Z     compiled=True,
2025-05-07T20:32:43.7905251Z )
2025-05-07T20:32:43.7905574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7906238Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.7906511Z 
2025-05-07T20:32:43.7906592Z     @given(
2025-05-07T20:32:43.7906831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7907149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7907573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7907912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7908245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7908530Z     )
2025-05-07T20:32:43.7908886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7909331Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7909577Z         self,
2025-05-07T20:32:43.7909769Z         T: int,
2025-05-07T20:32:43.7909977Z         D: int,
2025-05-07T20:32:43.7910199Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7910468Z         contiguous: bool,
2025-05-07T20:32:43.7910710Z         compiled: bool,
2025-05-07T20:32:43.7910937Z     ) -> None:
2025-05-07T20:32:43.7911155Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7911404Z     
2025-05-07T20:32:43.7911684Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7912030Z     
2025-05-07T20:32:43.7912231Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.7912531Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.7912838Z         x = x_sign * x_clamp
2025-05-07T20:32:43.7913084Z         x0 = x[:, :D]
2025-05-07T20:32:43.7913310Z         x1 = x[:, D:]
2025-05-07T20:32:43.7913516Z     
2025-05-07T20:32:43.7913715Z         if contiguous:
2025-05-07T20:32:43.7913950Z             x0 = x0.contiguous()
2025-05-07T20:32:43.7914214Z             x1 = x1.contiguous()
2025-05-07T20:32:43.7914526Z     
2025-05-07T20:32:43.7914729Z         if scale_ub is not None:
2025-05-07T20:32:43.7915011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.7915346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.7915660Z             )
2025-05-07T20:32:43.7915858Z         else:
2025-05-07T20:32:43.7916086Z             scale_ub_tensor = None
2025-05-07T20:32:43.7916353Z     
2025-05-07T20:32:43.7916591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.7916916Z             op = silu_mul_quant
2025-05-07T20:32:43.7917258Z             if compiled:
2025-05-07T20:32:43.7925419Z                 op = torch.compile(op)
2025-05-07T20:32:43.7925769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7926052Z     
2025-05-07T20:32:43.7926258Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.7926433Z 
2025-05-07T20:32:43.7926540Z moe/activation_test.py:117: 
2025-05-07T20:32:43.7926859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7927310Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.7927713Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7928290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.7928871Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.7929541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.7930235Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.7930779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.7931473Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.7932151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.7932694Z     kernel = self.compile(
2025-05-07T20:32:43.7933258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.7933927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.7934333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7934574Z 
2025-05-07T20:32:43.7934786Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e1b8150>
2025-05-07T20:32:43.7935934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.7937322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e3fb060>}
2025-05-07T20:32:43.7938682Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.7939714Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e1447b0>
2025-05-07T20:32:43.7940018Z 
2025-05-07T20:32:43.7940190Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.7940719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.7941210Z                            module_map=module_map)
2025-05-07T20:32:43.7941581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.7941952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.7942223Z E       ^
2025-05-07T20:32:43.7942696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.7943145Z 
2025-05-07T20:32:43.7943611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.7944135Z 
2025-05-07T20:32:43.7944246Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7944671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7945086Z     T=128,
2025-05-07T20:32:43.7945282Z     D=7168,
2025-05-07T20:32:43.7945490Z     scale_ub=1200.0,
2025-05-07T20:32:43.7945733Z     contiguous=False,
2025-05-07T20:32:43.7945966Z     compiled=True,
2025-05-07T20:32:43.7946235Z )
2025-05-07T20:32:43.8655529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8656038Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.8656327Z 
2025-05-07T20:32:43.8656418Z     @given(
2025-05-07T20:32:43.8656664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8657027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8657453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8657989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8658643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8659215Z     )
2025-05-07T20:32:43.8659916Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8660794Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8661288Z         self,
2025-05-07T20:32:43.8661680Z         T: int,
2025-05-07T20:32:43.8662069Z         D: int,
2025-05-07T20:32:43.8662512Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8663054Z         contiguous: bool,
2025-05-07T20:32:43.8663529Z         compiled: bool,
2025-05-07T20:32:43.8663983Z     ) -> None:
2025-05-07T20:32:43.8664419Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8664898Z     
2025-05-07T20:32:43.8665450Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8666146Z     
2025-05-07T20:32:43.8666530Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8667122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8667730Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8667999Z         x0 = x[:, :D]
2025-05-07T20:32:43.8668238Z         x1 = x[:, D:]
2025-05-07T20:32:43.8668454Z     
2025-05-07T20:32:43.8668648Z         if contiguous:
2025-05-07T20:32:43.8668886Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8669226Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8669473Z     
2025-05-07T20:32:43.8669668Z         if scale_ub is not None:
2025-05-07T20:32:43.8669943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8670286Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8670596Z             )
2025-05-07T20:32:43.8670801Z         else:
2025-05-07T20:32:43.8671024Z             scale_ub_tensor = None
2025-05-07T20:32:43.8671281Z     
2025-05-07T20:32:43.8671524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8671854Z             op = silu_mul_quant
2025-05-07T20:32:43.8672105Z             if compiled:
2025-05-07T20:32:43.8672365Z                 op = torch.compile(op)
2025-05-07T20:32:43.8672669Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8672959Z     
2025-05-07T20:32:43.8673154Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8673330Z 
2025-05-07T20:32:43.8673435Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8673746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8674080Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8674371Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8674943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.8675503Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.8676240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8676937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8677484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8678166Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8678836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8679382Z     kernel = self.compile(
2025-05-07T20:32:43.8679995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8680648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8681050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8681284Z 
2025-05-07T20:32:43.8681502Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e09d390>
2025-05-07T20:32:43.8682623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8683996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e0c0360>}
2025-05-07T20:32:43.8685340Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8686366Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e0783f0>
2025-05-07T20:32:43.8686652Z 
2025-05-07T20:32:43.8686825Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8687346Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8687928Z                            module_map=module_map)
2025-05-07T20:32:43.8688342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8688704Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8688971Z E       ^
2025-05-07T20:32:43.8689438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8689938Z 
2025-05-07T20:32:43.8690362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8690872Z 
2025-05-07T20:32:43.8690977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.8691396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.8691807Z     T=2048,
2025-05-07T20:32:43.8692008Z     D=7168,
2025-05-07T20:32:43.8692202Z     scale_ub=None,
2025-05-07T20:32:43.8692430Z     contiguous=True,
2025-05-07T20:32:43.8692660Z     compiled=True,
2025-05-07T20:32:43.8692860Z )
2025-05-07T20:32:43.8693195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8693697Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.8693966Z 
2025-05-07T20:32:43.8694047Z     @given(
2025-05-07T20:32:43.8694290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8694616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8694928Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8695265Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8695601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8695897Z     )
2025-05-07T20:32:43.8696254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8696697Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8696995Z         self,
2025-05-07T20:32:43.8697207Z         T: int,
2025-05-07T20:32:43.8697405Z         D: int,
2025-05-07T20:32:43.8697631Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8697908Z         contiguous: bool,
2025-05-07T20:32:43.8698148Z         compiled: bool,
2025-05-07T20:32:43.8698391Z     ) -> None:
2025-05-07T20:32:43.8698642Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8698883Z     
2025-05-07T20:32:43.8699161Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8699511Z     
2025-05-07T20:32:43.8699755Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8700052Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8700364Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8700602Z         x0 = x[:, :D]
2025-05-07T20:32:43.8700825Z         x1 = x[:, D:]
2025-05-07T20:32:43.8701043Z     
2025-05-07T20:32:43.8701232Z         if contiguous:
2025-05-07T20:32:43.8701473Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8701781Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8702026Z     
2025-05-07T20:32:43.8702218Z         if scale_ub is not None:
2025-05-07T20:32:43.8702497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8702836Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8703144Z             )
2025-05-07T20:32:43.8703341Z         else:
2025-05-07T20:32:43.8703558Z             scale_ub_tensor = None
2025-05-07T20:32:43.8703813Z     
2025-05-07T20:32:43.8704048Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8704369Z             op = silu_mul_quant
2025-05-07T20:32:43.8704616Z             if compiled:
2025-05-07T20:32:43.8704873Z                 op = torch.compile(op)
2025-05-07T20:32:43.8705178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8705453Z     
2025-05-07T20:32:43.8705820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8705987Z 
2025-05-07T20:32:43.8706099Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8706408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8706738Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8707026Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8707588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.8708145Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.8708813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8709586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8710129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8710810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8711477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8712021Z     kernel = self.compile(
2025-05-07T20:32:43.8712558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8713214Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8713613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8713837Z 
2025-05-07T20:32:43.8714053Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e0829d0>
2025-05-07T20:32:43.8715123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8716577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e0c0ea0>}
2025-05-07T20:32:43.8717918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8718942Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e057030>
2025-05-07T20:32:43.8719227Z 
2025-05-07T20:32:43.8719396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8719916Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8720444Z                            module_map=module_map)
2025-05-07T20:32:43.8720805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8721155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8721417Z E       ^
2025-05-07T20:32:43.8721890Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8722396Z 
2025-05-07T20:32:43.8722817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8723324Z 
2025-05-07T20:32:43.9372203Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9372674Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9373093Z     T=16384,
2025-05-07T20:32:43.9373294Z     D=5120,
2025-05-07T20:32:43.9373498Z     scale_ub=None,
2025-05-07T20:32:43.9373719Z     contiguous=False,
2025-05-07T20:32:43.9373952Z     compiled=False,
2025-05-07T20:32:43.9374158Z )
2025-05-07T20:32:43.9374480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9374981Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.9375259Z 
2025-05-07T20:32:43.9375341Z     @given(
2025-05-07T20:32:43.9375578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9375899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9376208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9376536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9376865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9377156Z     )
2025-05-07T20:32:43.9377502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9378048Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9378298Z         self,
2025-05-07T20:32:43.9378518Z         T: int,
2025-05-07T20:32:43.9378745Z         D: int,
2025-05-07T20:32:43.9378967Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9379232Z         contiguous: bool,
2025-05-07T20:32:43.9379475Z         compiled: bool,
2025-05-07T20:32:43.9379701Z     ) -> None:
2025-05-07T20:32:43.9379911Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9380156Z     
2025-05-07T20:32:43.9380435Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9380782Z     
2025-05-07T20:32:43.9380974Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.9381269Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.9383301Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9385183Z 
2025-05-07T20:32:43.9385307Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.9385518Z 
2025-05-07T20:32:43.9385700Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9386115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9386521Z     T=4096,
2025-05-07T20:32:43.9386713Z     D=7168,
2025-05-07T20:32:43.9386927Z     scale_ub=1200.0,
2025-05-07T20:32:43.9387155Z     contiguous=True,
2025-05-07T20:32:43.9387379Z     compiled=True,
2025-05-07T20:32:43.9387583Z )
2025-05-07T20:32:43.9387900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9388397Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.9388742Z 
2025-05-07T20:32:43.9388828Z     @given(
2025-05-07T20:32:43.9389054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9389367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9389675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9390002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9390336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9390689Z     )
2025-05-07T20:32:43.9391042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9391480Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9391721Z         self,
2025-05-07T20:32:43.9391921Z         T: int,
2025-05-07T20:32:43.9392115Z         D: int,
2025-05-07T20:32:43.9392336Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9392611Z         contiguous: bool,
2025-05-07T20:32:43.9392849Z         compiled: bool,
2025-05-07T20:32:43.9393080Z     ) -> None:
2025-05-07T20:32:43.9393300Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9393543Z     
2025-05-07T20:32:43.9393818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9394166Z     
2025-05-07T20:32:43.9394357Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.9394653Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.9396672Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9398600Z 
2025-05-07T20:32:43.9398720Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.9398931Z 
2025-05-07T20:32:43.9399039Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9399444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9399848Z     T=16384,
2025-05-07T20:32:43.9400046Z     D=7168,
2025-05-07T20:32:43.9400241Z     scale_ub=None,
2025-05-07T20:32:43.9400462Z     contiguous=False,
2025-05-07T20:32:43.9400695Z     compiled=False,
2025-05-07T20:32:43.9400897Z )
2025-05-07T20:32:43.9401213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9401709Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.9401988Z 
2025-05-07T20:32:43.9402069Z     @given(
2025-05-07T20:32:43.9402295Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9402611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9402918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9403253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9403578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9403866Z     )
2025-05-07T20:32:43.9404216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9404652Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9404896Z         self,
2025-05-07T20:32:43.9405195Z         T: int,
2025-05-07T20:32:43.9405392Z         D: int,
2025-05-07T20:32:43.9405766Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9406043Z         contiguous: bool,
2025-05-07T20:32:43.9406278Z         compiled: bool,
2025-05-07T20:32:43.9406501Z     ) -> None:
2025-05-07T20:32:43.9406714Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9406954Z     
2025-05-07T20:32:43.9407224Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9409432Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9411364Z 
2025-05-07T20:32:43.9411482Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.9411693Z 
2025-05-07T20:32:43.9411801Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9412207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9412613Z     T=2048,
2025-05-07T20:32:43.9412806Z     D=7168,
2025-05-07T20:32:43.9413000Z     scale_ub=1200.0,
2025-05-07T20:32:43.9413222Z     contiguous=True,
2025-05-07T20:32:43.9413448Z     compiled=True,
2025-05-07T20:32:43.9413659Z )
2025-05-07T20:32:43.9413978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9414476Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.9414744Z 
2025-05-07T20:32:43.9414833Z     @given(
2025-05-07T20:32:43.9415057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9415380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9415684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9416011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9416343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9416634Z     )
2025-05-07T20:32:43.9416983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9417420Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9417734Z         self,
2025-05-07T20:32:43.9417932Z         T: int,
2025-05-07T20:32:43.9418129Z         D: int,
2025-05-07T20:32:43.9418358Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9418633Z         contiguous: bool,
2025-05-07T20:32:43.9418870Z         compiled: bool,
2025-05-07T20:32:43.9419095Z     ) -> None:
2025-05-07T20:32:43.9419310Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9419549Z     
2025-05-07T20:32:43.9419823Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9420169Z     
2025-05-07T20:32:43.9420361Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.9420653Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.9422638Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9424490Z 
2025-05-07T20:32:43.9424609Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.9424819Z 
2025-05-07T20:32:43.9424926Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9425399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9425804Z     T=2048,
2025-05-07T20:32:43.9425994Z     D=7168,
2025-05-07T20:32:43.9426181Z     scale_ub=None,
2025-05-07T20:32:43.9426397Z     contiguous=True,
2025-05-07T20:32:43.9426621Z     compiled=False,
2025-05-07T20:32:43.9426823Z )
2025-05-07T20:32:44.0290042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.0290594Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.0290966Z 
2025-05-07T20:32:44.0291055Z     @given(
2025-05-07T20:32:44.0291280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.0291592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.0291894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.0292219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.0292545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.0292839Z     )
2025-05-07T20:32:44.0293251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.0293690Z     def test_silu_mul_quant(
2025-05-07T20:32:44.0293933Z         self,
2025-05-07T20:32:44.0294131Z         T: int,
2025-05-07T20:32:44.0294323Z         D: int,
2025-05-07T20:32:44.0294542Z         scale_ub: Optional[float],
2025-05-07T20:32:44.0294818Z         contiguous: bool,
2025-05-07T20:32:44.0295055Z         compiled: bool,
2025-05-07T20:32:44.0295282Z     ) -> None:
2025-05-07T20:32:44.0295504Z         torch.manual_seed(2025)
2025-05-07T20:32:44.0295740Z     
2025-05-07T20:32:44.0296012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.0296360Z     
2025-05-07T20:32:44.0296551Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.0298528Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.0300379Z 
2025-05-07T20:32:44.0307045Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.0307443Z 
2025-05-07T20:32:44.0307557Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.0307980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.0308391Z     T=1,
2025-05-07T20:32:44.0308578Z     D=7168,
2025-05-07T20:32:44.0308781Z     scale_ub=1200.0,
2025-05-07T20:32:44.0309015Z     contiguous=True,
2025-05-07T20:32:44.0309241Z     compiled=False,
2025-05-07T20:32:44.0309453Z )
2025-05-07T20:32:44.0309785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.0310286Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.0310557Z 
2025-05-07T20:32:44.0310638Z     @given(
2025-05-07T20:32:44.0310877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.0311195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.0311502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.0311845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.0312180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.0312477Z     )
2025-05-07T20:32:44.0312837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.0313286Z     def test_silu_mul_quant(
2025-05-07T20:32:44.0313530Z         self,
2025-05-07T20:32:44.0313730Z         T: int,
2025-05-07T20:32:44.0313937Z         D: int,
2025-05-07T20:32:44.0314231Z         scale_ub: Optional[float],
2025-05-07T20:32:44.0314507Z         contiguous: bool,
2025-05-07T20:32:44.0314757Z         compiled: bool,
2025-05-07T20:32:44.0314993Z     ) -> None:
2025-05-07T20:32:44.0315211Z         torch.manual_seed(2025)
2025-05-07T20:32:44.0315457Z     
2025-05-07T20:32:44.0315736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.0316078Z     
2025-05-07T20:32:44.0316277Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.0316580Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.0316967Z         x = x_sign * x_clamp
2025-05-07T20:32:44.0317209Z         x0 = x[:, :D]
2025-05-07T20:32:44.0317432Z         x1 = x[:, D:]
2025-05-07T20:32:44.0317647Z     
2025-05-07T20:32:44.0317837Z         if contiguous:
2025-05-07T20:32:44.0318082Z             x0 = x0.contiguous()
2025-05-07T20:32:44.0318346Z             x1 = x1.contiguous()
2025-05-07T20:32:44.0318586Z     
2025-05-07T20:32:44.0318791Z         if scale_ub is not None:
2025-05-07T20:32:44.0319137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.0319483Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.0319799Z             )
2025-05-07T20:32:44.0320006Z         else:
2025-05-07T20:32:44.0320227Z             scale_ub_tensor = None
2025-05-07T20:32:44.0320480Z     
2025-05-07T20:32:44.0320718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.0321042Z             op = silu_mul_quant
2025-05-07T20:32:44.0321298Z             if compiled:
2025-05-07T20:32:44.0321559Z                 op = torch.compile(op)
2025-05-07T20:32:44.0321861Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0322137Z     
2025-05-07T20:32:44.0322335Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.0322501Z 
2025-05-07T20:32:44.0322606Z moe/activation_test.py:117: 
2025-05-07T20:32:44.0322904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0323243Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.0323533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0324236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.0324929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.0325469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.0326155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.0326868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.0327405Z     kernel = self.compile(
2025-05-07T20:32:44.0328037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.0328741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.0329136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0329366Z 
2025-05-07T20:32:44.0329572Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e29c850>
2025-05-07T20:32:44.0330645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.0332019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e230680>}
2025-05-07T20:32:44.0333352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.0334420Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e2403f0>
2025-05-07T20:32:44.0334712Z 
2025-05-07T20:32:44.0334882Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.0335403Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.0335862Z                            module_map=module_map)
2025-05-07T20:32:44.0336230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.0336581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.0336846Z E       ^
2025-05-07T20:32:44.0337353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.0337807Z 
2025-05-07T20:32:44.0338221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.0338763Z 
2025-05-07T20:32:44.0338887Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.0339345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.0339749Z     T=128,
2025-05-07T20:32:44.0339945Z     D=5120,
2025-05-07T20:32:44.0340144Z     scale_ub=None,
2025-05-07T20:32:44.0340358Z     contiguous=True,
2025-05-07T20:32:44.0340585Z     compiled=False,
2025-05-07T20:32:44.0340791Z )
2025-05-07T20:32:44.2555644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2556645Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.2557167Z 
2025-05-07T20:32:44.2557317Z     @given(
2025-05-07T20:32:44.2557742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2558308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2558751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2559083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2559417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2559701Z     )
2025-05-07T20:32:44.2560059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2560503Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2560746Z         self,
2025-05-07T20:32:44.2560943Z         T: int,
2025-05-07T20:32:44.2561143Z         D: int,
2025-05-07T20:32:44.2561358Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2561633Z         contiguous: bool,
2025-05-07T20:32:44.2561873Z         compiled: bool,
2025-05-07T20:32:44.2562221Z     ) -> None:
2025-05-07T20:32:44.2562440Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2562685Z     
2025-05-07T20:32:44.2562956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2563302Z     
2025-05-07T20:32:44.2563500Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2563792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2564094Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2564334Z         x0 = x[:, :D]
2025-05-07T20:32:44.2564558Z         x1 = x[:, D:]
2025-05-07T20:32:44.2564768Z     
2025-05-07T20:32:44.2564956Z         if contiguous:
2025-05-07T20:32:44.2565188Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2565440Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2565689Z     
2025-05-07T20:32:44.2565880Z         if scale_ub is not None:
2025-05-07T20:32:44.2566148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2566486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2566797Z             )
2025-05-07T20:32:44.2566991Z         else:
2025-05-07T20:32:44.2567211Z             scale_ub_tensor = None
2025-05-07T20:32:44.2567465Z     
2025-05-07T20:32:44.2567772Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2568089Z             op = silu_mul_quant
2025-05-07T20:32:44.2568341Z             if compiled:
2025-05-07T20:32:44.2568591Z                 op = torch.compile(op)
2025-05-07T20:32:44.2568885Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2569235Z     
2025-05-07T20:32:44.2569433Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2569600Z 
2025-05-07T20:32:44.2569702Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2569996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2570329Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2570610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2571294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2572066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2572601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2573281Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2573938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2574528Z     kernel = self.compile(
2025-05-07T20:32:44.2575066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2575714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2576113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2576342Z 
2025-05-07T20:32:44.2576550Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999feb8d0>
2025-05-07T20:32:44.2577628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2578995Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e2318a0>}
2025-05-07T20:32:44.2580325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2581352Z context = <triton._C.libtriton.ir.context object at 0x7f5999f97ef0>
2025-05-07T20:32:44.2581635Z 
2025-05-07T20:32:44.2581811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2582379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2582840Z                            module_map=module_map)
2025-05-07T20:32:44.2583208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2583562Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2583819Z E       ^
2025-05-07T20:32:44.2584281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2584731Z 
2025-05-07T20:32:44.2585147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2585657Z 
2025-05-07T20:32:44.2585766Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2586177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2586578Z     T=128,
2025-05-07T20:32:44.2586766Z     D=7168,
2025-05-07T20:32:44.2586963Z     scale_ub=None,
2025-05-07T20:32:44.2587175Z     contiguous=True,
2025-05-07T20:32:44.2587397Z     compiled=False,
2025-05-07T20:32:44.2587597Z )
2025-05-07T20:32:44.2587935Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2588453Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.2588722Z 
2025-05-07T20:32:44.2588809Z     @given(
2025-05-07T20:32:44.2589038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2589402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2589711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2590033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2590361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2590642Z     )
2025-05-07T20:32:44.2590987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2591421Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2591689Z         self,
2025-05-07T20:32:44.2591935Z         T: int,
2025-05-07T20:32:44.2592130Z         D: int,
2025-05-07T20:32:44.2592351Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2592619Z         contiguous: bool,
2025-05-07T20:32:44.2592856Z         compiled: bool,
2025-05-07T20:32:44.2593076Z     ) -> None:
2025-05-07T20:32:44.2593288Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2593529Z     
2025-05-07T20:32:44.2593796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2594208Z     
2025-05-07T20:32:44.2594404Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2594693Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2595003Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2595240Z         x0 = x[:, :D]
2025-05-07T20:32:44.2595455Z         x1 = x[:, D:]
2025-05-07T20:32:44.2595663Z     
2025-05-07T20:32:44.2595849Z         if contiguous:
2025-05-07T20:32:44.2596074Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2596339Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2596583Z     
2025-05-07T20:32:44.2596772Z         if scale_ub is not None:
2025-05-07T20:32:44.2597043Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2597375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2597682Z             )
2025-05-07T20:32:44.2597875Z         else:
2025-05-07T20:32:44.2598085Z             scale_ub_tensor = None
2025-05-07T20:32:44.2598336Z     
2025-05-07T20:32:44.2598570Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2598891Z             op = silu_mul_quant
2025-05-07T20:32:44.2599150Z             if compiled:
2025-05-07T20:32:44.2599397Z                 op = torch.compile(op)
2025-05-07T20:32:44.2599690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2599965Z     
2025-05-07T20:32:44.2600155Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2600321Z 
2025-05-07T20:32:44.2600474Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2600775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2601107Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2601387Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2602074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2602764Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2603305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2603986Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2604650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2605180Z     kernel = self.compile(
2025-05-07T20:32:44.2605899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2606562Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2606957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2607187Z 
2025-05-07T20:32:44.2607391Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999f2a8d0>
2025-05-07T20:32:44.2608662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2610039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e2327a0>}
2025-05-07T20:32:44.2611391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2612479Z context = <triton._C.libtriton.ir.context object at 0x7f5999feeef0>
2025-05-07T20:32:44.2612765Z 
2025-05-07T20:32:44.2612929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2613452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2613914Z                            module_map=module_map)
2025-05-07T20:32:44.2614335Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2614692Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2614954Z E       ^
2025-05-07T20:32:44.2615419Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2615870Z 
2025-05-07T20:32:44.2616281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2616793Z 
2025-05-07T20:32:44.2616900Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2617311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2617710Z     T=2048,
2025-05-07T20:32:44.2617894Z     D=7168,
2025-05-07T20:32:44.2618087Z     scale_ub=1200.0,
2025-05-07T20:32:44.2618307Z     contiguous=True,
2025-05-07T20:32:44.2618523Z     compiled=False,
2025-05-07T20:32:44.2618726Z )
2025-05-07T20:32:44.3287835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3288647Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3288967Z 
2025-05-07T20:32:44.3289046Z     @given(
2025-05-07T20:32:44.3289276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3289583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3289885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3290305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3290635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3290918Z     )
2025-05-07T20:32:44.3291261Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3291698Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3291940Z         self,
2025-05-07T20:32:44.3292128Z         T: int,
2025-05-07T20:32:44.3292327Z         D: int,
2025-05-07T20:32:44.3292545Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3292811Z         contiguous: bool,
2025-05-07T20:32:44.3293053Z         compiled: bool,
2025-05-07T20:32:44.3293273Z     ) -> None:
2025-05-07T20:32:44.3293486Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3293731Z     
2025-05-07T20:32:44.3294001Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3296048Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3297905Z 
2025-05-07T20:32:44.3298098Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3298324Z 
2025-05-07T20:32:44.3298438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3298893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3299465Z     T=1,
2025-05-07T20:32:44.3299695Z     D=5120,
2025-05-07T20:32:44.3299946Z     scale_ub=1200.0,
2025-05-07T20:32:44.3300229Z     contiguous=True,
2025-05-07T20:32:44.3300507Z     compiled=False,
2025-05-07T20:32:44.3300754Z )
2025-05-07T20:32:44.3301078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3301656Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3301925Z 
2025-05-07T20:32:44.3302005Z     @given(
2025-05-07T20:32:44.3302240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3302554Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3302877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3303264Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3303595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3303876Z     )
2025-05-07T20:32:44.3304217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3304656Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3304895Z         self,
2025-05-07T20:32:44.3305088Z         T: int,
2025-05-07T20:32:44.3305289Z         D: int,
2025-05-07T20:32:44.3305504Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3306009Z         contiguous: bool,
2025-05-07T20:32:44.3306252Z         compiled: bool,
2025-05-07T20:32:44.3306479Z     ) -> None:
2025-05-07T20:32:44.3306684Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3306925Z     
2025-05-07T20:32:44.3307192Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3307534Z     
2025-05-07T20:32:44.3307729Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3308053Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3308383Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3308620Z         x0 = x[:, :D]
2025-05-07T20:32:44.3308834Z         x1 = x[:, D:]
2025-05-07T20:32:44.3309041Z     
2025-05-07T20:32:44.3309225Z         if contiguous:
2025-05-07T20:32:44.3309457Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3309711Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3310021Z     
2025-05-07T20:32:44.3310387Z         if scale_ub is not None:
2025-05-07T20:32:44.3310661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3310991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3311299Z             )
2025-05-07T20:32:44.3311495Z         else:
2025-05-07T20:32:44.3311702Z             scale_ub_tensor = None
2025-05-07T20:32:44.3311953Z     
2025-05-07T20:32:44.3312184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3312492Z             op = silu_mul_quant
2025-05-07T20:32:44.3312739Z             if compiled:
2025-05-07T20:32:44.3312986Z                 op = torch.compile(op)
2025-05-07T20:32:44.3313282Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3313555Z     
2025-05-07T20:32:44.3313750Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3313912Z 
2025-05-07T20:32:44.3314013Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3314305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3314638Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3314922Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3315605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3316293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3316824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3317578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3318285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3318814Z     kernel = self.compile(
2025-05-07T20:32:44.3319354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3320004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3320402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3320702Z 
2025-05-07T20:32:44.3320903Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999da2750>
2025-05-07T20:32:44.3321977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3323399Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e233b00>}
2025-05-07T20:32:44.3324734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3325753Z context = <triton._C.libtriton.ir.context object at 0x7f5999db6db0>
2025-05-07T20:32:44.3326064Z 
2025-05-07T20:32:44.3326233Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3326751Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3327216Z                            module_map=module_map)
2025-05-07T20:32:44.3327640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3327996Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3328262Z E       ^
2025-05-07T20:32:44.3328770Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3329227Z 
2025-05-07T20:32:44.3329642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3330153Z 
2025-05-07T20:32:44.3330258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3330715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3331113Z     T=2048,
2025-05-07T20:32:44.3331298Z     D=5120,
2025-05-07T20:32:44.3331488Z     scale_ub=None,
2025-05-07T20:32:44.3331698Z     contiguous=True,
2025-05-07T20:32:44.3331925Z     compiled=False,
2025-05-07T20:32:44.3332131Z )
2025-05-07T20:32:44.3332446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3332936Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3333202Z 
2025-05-07T20:32:44.3333288Z     @given(
2025-05-07T20:32:44.3333518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3333826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3334127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3334455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3334777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3335066Z     )
2025-05-07T20:32:44.3335411Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3335841Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3336082Z         self,
2025-05-07T20:32:44.3336275Z         T: int,
2025-05-07T20:32:44.3336466Z         D: int,
2025-05-07T20:32:44.3336683Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3342593Z         contiguous: bool,
2025-05-07T20:32:44.3342852Z         compiled: bool,
2025-05-07T20:32:44.3343150Z     ) -> None:
2025-05-07T20:32:44.3343376Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3343626Z     
2025-05-07T20:32:44.3343898Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3344248Z     
2025-05-07T20:32:44.3344446Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.3346396Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3348349Z 
2025-05-07T20:32:44.3348474Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.3348735Z 
2025-05-07T20:32:44.3348844Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3349262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3349664Z     T=16384,
2025-05-07T20:32:44.3349858Z     D=5120,
2025-05-07T20:32:44.3350058Z     scale_ub=None,
2025-05-07T20:32:44.3350275Z     contiguous=True,
2025-05-07T20:32:44.3350502Z     compiled=False,
2025-05-07T20:32:44.3350714Z )
2025-05-07T20:32:44.4045700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4046529Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.4046947Z 
2025-05-07T20:32:44.4047062Z     @given(
2025-05-07T20:32:44.4047395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4047860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4048184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4048533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4048880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4049182Z     )
2025-05-07T20:32:44.4049535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4049988Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4050246Z         self,
2025-05-07T20:32:44.4050450Z         T: int,
2025-05-07T20:32:44.4050664Z         D: int,
2025-05-07T20:32:44.4051183Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4051459Z         contiguous: bool,
2025-05-07T20:32:44.4051715Z         compiled: bool,
2025-05-07T20:32:44.4051957Z     ) -> None:
2025-05-07T20:32:44.4052179Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4052441Z     
2025-05-07T20:32:44.4052734Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4054820Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4056693Z 
2025-05-07T20:32:44.4056825Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4057044Z 
2025-05-07T20:32:44.4057153Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4057575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4057988Z     T=4096,
2025-05-07T20:32:44.4058182Z     D=5120,
2025-05-07T20:32:44.4058397Z     scale_ub=None,
2025-05-07T20:32:44.4058623Z     contiguous=True,
2025-05-07T20:32:44.4058859Z     compiled=False,
2025-05-07T20:32:44.4059686Z )
2025-05-07T20:32:44.4060025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4060532Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.4060802Z 
2025-05-07T20:32:44.4060886Z     @given(
2025-05-07T20:32:44.4061132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4061460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4061766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4062114Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4062541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4062831Z     )
2025-05-07T20:32:44.4063193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4063654Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4063913Z         self,
2025-05-07T20:32:44.4064115Z         T: int,
2025-05-07T20:32:44.4064327Z         D: int,
2025-05-07T20:32:44.4064644Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4064926Z         contiguous: bool,
2025-05-07T20:32:44.4065182Z         compiled: bool,
2025-05-07T20:32:44.4065419Z     ) -> None:
2025-05-07T20:32:44.4065645Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4065905Z     
2025-05-07T20:32:44.4066195Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4068248Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4070113Z 
2025-05-07T20:32:44.4070247Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4070467Z 
2025-05-07T20:32:44.4070577Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4071000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4071422Z     T=2048,
2025-05-07T20:32:44.4071616Z     D=5120,
2025-05-07T20:32:44.4071828Z     scale_ub=None,
2025-05-07T20:32:44.4072067Z     contiguous=False,
2025-05-07T20:32:44.4072348Z     compiled=False,
2025-05-07T20:32:44.4072573Z )
2025-05-07T20:32:44.4072911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4073411Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.4073696Z 
2025-05-07T20:32:44.4073783Z     @given(
2025-05-07T20:32:44.4074029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4074355Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4074672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4075016Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4075360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4075654Z     )
2025-05-07T20:32:44.4076017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4076470Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4076718Z         self,
2025-05-07T20:32:44.4076933Z         T: int,
2025-05-07T20:32:44.4077144Z         D: int,
2025-05-07T20:32:44.4077372Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4077654Z         contiguous: bool,
2025-05-07T20:32:44.4077926Z         compiled: bool,
2025-05-07T20:32:44.4078181Z     ) -> None:
2025-05-07T20:32:44.4078412Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4078667Z     
2025-05-07T20:32:44.4078948Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4081041Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4082938Z 
2025-05-07T20:32:44.4083057Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4083279Z 
2025-05-07T20:32:44.4083382Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4083796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4084200Z     T=4096,
2025-05-07T20:32:44.4084395Z     D=7168,
2025-05-07T20:32:44.4084592Z     scale_ub=None,
2025-05-07T20:32:44.4084812Z     contiguous=True,
2025-05-07T20:32:44.4085071Z     compiled=True,
2025-05-07T20:32:44.4085280Z )
2025-05-07T20:32:44.4085604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4086089Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.4086369Z 
2025-05-07T20:32:44.4086451Z     @given(
2025-05-07T20:32:44.4086706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4087028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4087334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4087741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4088091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4088383Z     )
2025-05-07T20:32:44.4088740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4089188Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4089432Z         self,
2025-05-07T20:32:44.4089633Z         T: int,
2025-05-07T20:32:44.4089846Z         D: int,
2025-05-07T20:32:44.4090078Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4090351Z         contiguous: bool,
2025-05-07T20:32:44.4090600Z         compiled: bool,
2025-05-07T20:32:44.4090834Z     ) -> None:
2025-05-07T20:32:44.4091053Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4091305Z     
2025-05-07T20:32:44.4091587Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4093689Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4095545Z 
2025-05-07T20:32:44.4095678Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4095892Z 
2025-05-07T20:32:44.4095998Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4096422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4096835Z     T=2048,
2025-05-07T20:32:44.4097028Z     D=5120,
2025-05-07T20:32:44.4097232Z     scale_ub=1200.0,
2025-05-07T20:32:44.4097466Z     contiguous=False,
2025-05-07T20:32:44.4097695Z     compiled=False,
2025-05-07T20:32:44.4097911Z )
2025-05-07T20:32:44.4098240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4098739Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.4099015Z 
2025-05-07T20:32:44.4099097Z     @given(
2025-05-07T20:32:44.4099343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4099713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4100023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4100366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4100701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4100990Z     )
2025-05-07T20:32:44.4101347Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4101795Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4102052Z         self,
2025-05-07T20:32:44.4102292Z         T: int,
2025-05-07T20:32:44.4102501Z         D: int,
2025-05-07T20:32:44.4102731Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4103005Z         contiguous: bool,
2025-05-07T20:32:44.4103255Z         compiled: bool,
2025-05-07T20:32:44.4103490Z     ) -> None:
2025-05-07T20:32:44.4103710Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4103964Z     
2025-05-07T20:32:44.4104246Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4106659Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4108570Z 
2025-05-07T20:32:44.4108697Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4108922Z 
2025-05-07T20:32:44.4109031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4109454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4109859Z     T=4096,
2025-05-07T20:32:44.4110050Z     D=7168,
2025-05-07T20:32:44.4110255Z     scale_ub=1200.0,
2025-05-07T20:32:44.4110492Z     contiguous=True,
2025-05-07T20:32:44.4110717Z     compiled=False,
2025-05-07T20:32:44.4110929Z )
2025-05-07T20:32:44.5025471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5026245Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.5026640Z 
2025-05-07T20:32:44.5026752Z     @given(
2025-05-07T20:32:44.5027075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5027630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5027978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5028360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5028734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5029059Z     )
2025-05-07T20:32:44.5029468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5030001Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5030271Z         self,
2025-05-07T20:32:44.5030479Z         T: int,
2025-05-07T20:32:44.5030685Z         D: int,
2025-05-07T20:32:44.5030918Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5031224Z         contiguous: bool,
2025-05-07T20:32:44.5031486Z         compiled: bool,
2025-05-07T20:32:44.5031742Z     ) -> None:
2025-05-07T20:32:44.5031970Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5032217Z     
2025-05-07T20:32:44.5032491Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5034658Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.5036557Z 
2025-05-07T20:32:44.5036677Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.5036892Z 
2025-05-07T20:32:44.5037008Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5037426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5037847Z     T=16384,
2025-05-07T20:32:44.5038055Z     D=7168,
2025-05-07T20:32:44.5038340Z     scale_ub=None,
2025-05-07T20:32:44.5038556Z     contiguous=False,
2025-05-07T20:32:44.5038798Z     compiled=True,
2025-05-07T20:32:44.5039018Z )
2025-05-07T20:32:44.5039339Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5039851Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.5040133Z 
2025-05-07T20:32:44.5040221Z     @given(
2025-05-07T20:32:44.5040533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5040862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5041180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5041515Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5041857Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5042156Z     )
2025-05-07T20:32:44.5042517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5042963Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5043222Z         self,
2025-05-07T20:32:44.5043429Z         T: int,
2025-05-07T20:32:44.5043632Z         D: int,
2025-05-07T20:32:44.5043860Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5044143Z         contiguous: bool,
2025-05-07T20:32:44.5044390Z         compiled: bool,
2025-05-07T20:32:44.5044621Z     ) -> None:
2025-05-07T20:32:44.5044848Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5045093Z     
2025-05-07T20:32:44.5045377Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5047430Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.5049491Z 
2025-05-07T20:32:44.5049617Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.5049834Z 
2025-05-07T20:32:44.5049950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5050370Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5050786Z     T=4096,
2025-05-07T20:32:44.5050989Z     D=7168,
2025-05-07T20:32:44.5051181Z     scale_ub=None,
2025-05-07T20:32:44.5051405Z     contiguous=True,
2025-05-07T20:32:44.5051637Z     compiled=False,
2025-05-07T20:32:44.5051847Z )
2025-05-07T20:32:44.5052169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5052673Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.5052946Z 
2025-05-07T20:32:44.5053045Z     @given(
2025-05-07T20:32:44.5053280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5053607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5053926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5054258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5054601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5054895Z     )
2025-05-07T20:32:44.5055295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5055750Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5056009Z         self,
2025-05-07T20:32:44.5056209Z         T: int,
2025-05-07T20:32:44.5056413Z         D: int,
2025-05-07T20:32:44.5056641Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5056917Z         contiguous: bool,
2025-05-07T20:32:44.5057174Z         compiled: bool,
2025-05-07T20:32:44.5057409Z     ) -> None:
2025-05-07T20:32:44.5057634Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5057889Z     
2025-05-07T20:32:44.5058225Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5060325Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.5062181Z 
2025-05-07T20:32:44.5062314Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.5062527Z 
2025-05-07T20:32:44.5062633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5063059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5063484Z     T=16384,
2025-05-07T20:32:44.5063694Z     D=7168,
2025-05-07T20:32:44.5063893Z     scale_ub=None,
2025-05-07T20:32:44.5064130Z     contiguous=True,
2025-05-07T20:32:44.5064358Z     compiled=False,
2025-05-07T20:32:44.5064574Z )
2025-05-07T20:32:44.5064899Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5065396Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.5065686Z 
2025-05-07T20:32:44.5065774Z     @given(
2025-05-07T20:32:44.5066018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5066360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5066670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5067015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5067354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5067642Z     )
2025-05-07T20:32:44.5068050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5068499Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5068744Z         self,
2025-05-07T20:32:44.5068954Z         T: int,
2025-05-07T20:32:44.5069165Z         D: int,
2025-05-07T20:32:44.5069386Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5069671Z         contiguous: bool,
2025-05-07T20:32:44.5069925Z         compiled: bool,
2025-05-07T20:32:44.5070150Z     ) -> None:
2025-05-07T20:32:44.5070386Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5070645Z     
2025-05-07T20:32:44.5070930Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5072973Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.5074841Z 
2025-05-07T20:32:44.5074968Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.5075193Z 
2025-05-07T20:32:44.5075305Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5075781Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5076194Z     T=16384,
2025-05-07T20:32:44.5076403Z     D=7168,
2025-05-07T20:32:44.5076610Z     scale_ub=1200.0,
2025-05-07T20:32:44.5076851Z     contiguous=True,
2025-05-07T20:32:44.5077079Z     compiled=False,
2025-05-07T20:32:44.5077297Z )
2025-05-07T20:32:44.5077628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5078126Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.5078462Z 
2025-05-07T20:32:44.5078544Z     @given(
2025-05-07T20:32:44.5078786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5079099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5079412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5079751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5080079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5080376Z     )
2025-05-07T20:32:44.5080778Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5081225Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5081470Z         self,
2025-05-07T20:32:44.5081680Z         T: int,
2025-05-07T20:32:44.5081893Z         D: int,
2025-05-07T20:32:44.5082115Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5082399Z         contiguous: bool,
2025-05-07T20:32:44.5082650Z         compiled: bool,
2025-05-07T20:32:44.5082881Z     ) -> None:
2025-05-07T20:32:44.5083117Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5083373Z     
2025-05-07T20:32:44.5083647Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5085707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.5087645Z 
2025-05-07T20:32:44.5087774Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.5088020Z 
2025-05-07T20:32:44.5088137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5088621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5089026Z     T=128,
2025-05-07T20:32:44.5089229Z     D=5120,
2025-05-07T20:32:44.5089436Z     scale_ub=1200.0,
2025-05-07T20:32:44.5089666Z     contiguous=False,
2025-05-07T20:32:44.5089907Z     compiled=False,
2025-05-07T20:32:44.5090124Z )
2025-05-07T20:32:44.6109880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.6110656Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.6111053Z 
2025-05-07T20:32:44.6111166Z     @given(
2025-05-07T20:32:44.6111495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.6111922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.6112245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.6112589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.6112921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.6113236Z     )
2025-05-07T20:32:44.6113595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.6114040Z     def test_silu_mul_quant(
2025-05-07T20:32:44.6114295Z         self,
2025-05-07T20:32:44.6114506Z         T: int,
2025-05-07T20:32:44.6114736Z         D: int,
2025-05-07T20:32:44.6114957Z         scale_ub: Optional[float],
2025-05-07T20:32:44.6115242Z         contiguous: bool,
2025-05-07T20:32:44.6115754Z         compiled: bool,
2025-05-07T20:32:44.6116002Z     ) -> None:
2025-05-07T20:32:44.6116226Z         torch.manual_seed(2025)
2025-05-07T20:32:44.6116479Z     
2025-05-07T20:32:44.6116764Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.6117113Z     
2025-05-07T20:32:44.6117319Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.6125678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.6126031Z         x = x_sign * x_clamp
2025-05-07T20:32:44.6126305Z         x0 = x[:, :D]
2025-05-07T20:32:44.6126722Z         x1 = x[:, D:]
2025-05-07T20:32:44.6126939Z     
2025-05-07T20:32:44.6127148Z         if contiguous:
2025-05-07T20:32:44.6127400Z             x0 = x0.contiguous()
2025-05-07T20:32:44.6127786Z             x1 = x1.contiguous()
2025-05-07T20:32:44.6128055Z     
2025-05-07T20:32:44.6128291Z         if scale_ub is not None:
2025-05-07T20:32:44.6128580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.6129004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.6129333Z             )
2025-05-07T20:32:44.6129553Z         else:
2025-05-07T20:32:44.6129774Z             scale_ub_tensor = None
2025-05-07T20:32:44.6130044Z     
2025-05-07T20:32:44.6130297Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.6130624Z             op = silu_mul_quant
2025-05-07T20:32:44.6130893Z             if compiled:
2025-05-07T20:32:44.6131165Z                 op = torch.compile(op)
2025-05-07T20:32:44.6131474Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.6131771Z     
2025-05-07T20:32:44.6131990Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.6132166Z 
2025-05-07T20:32:44.6132277Z moe/activation_test.py:117: 
2025-05-07T20:32:44.6132604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.6132965Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.6133268Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.6133978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.6134692Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.6135246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.6135939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.6136709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.6137253Z     kernel = self.compile(
2025-05-07T20:32:44.6137813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.6138485Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.6138889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.6139135Z 
2025-05-07T20:32:44.6139348Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999c50f50>
2025-05-07T20:32:44.6140437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.6141839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5999d5a700>}
2025-05-07T20:32:44.6143205Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.6144235Z context = <triton._C.libtriton.ir.context object at 0x7f5999c983f0>
2025-05-07T20:32:44.6144535Z 
2025-05-07T20:32:44.6144754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.6145290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.6145772Z                            module_map=module_map)
2025-05-07T20:32:44.6146140Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.6146507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.6146779Z E       ^
2025-05-07T20:32:44.6147242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.6147748Z 
2025-05-07T20:32:44.6148167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.6148689Z 
2025-05-07T20:32:44.6148797Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.6149220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.6149624Z     T=2048,
2025-05-07T20:32:44.6149831Z     D=7168,
2025-05-07T20:32:44.6150076Z     scale_ub=None,
2025-05-07T20:32:44.6150292Z     contiguous=False,
2025-05-07T20:32:44.6150528Z     compiled=False,
2025-05-07T20:32:44.6150745Z )
2025-05-07T20:32:44.6151063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.6151563Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.6151835Z 
2025-05-07T20:32:44.6151925Z     @given(
2025-05-07T20:32:44.6152163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.6152489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.6152800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.6153136Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.6153464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.6153755Z     )
2025-05-07T20:32:44.6154107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.6154550Z     def test_silu_mul_quant(
2025-05-07T20:32:44.6154799Z         self,
2025-05-07T20:32:44.6155005Z         T: int,
2025-05-07T20:32:44.6155203Z         D: int,
2025-05-07T20:32:44.6155423Z         scale_ub: Optional[float],
2025-05-07T20:32:44.6155694Z         contiguous: bool,
2025-05-07T20:32:44.6155929Z         compiled: bool,
2025-05-07T20:32:44.6156155Z     ) -> None:
2025-05-07T20:32:44.6156372Z         torch.manual_seed(2025)
2025-05-07T20:32:44.6156657Z     
2025-05-07T20:32:44.6156935Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.6158995Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.6160851Z 
2025-05-07T20:32:44.6160970Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.6161185Z 
2025-05-07T20:32:44.6161303Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.6161712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.6162128Z     T=128,
2025-05-07T20:32:44.6162321Z     D=7168,
2025-05-07T20:32:44.6162516Z     scale_ub=1200.0,
2025-05-07T20:32:44.6162743Z     contiguous=True,
2025-05-07T20:32:44.6162976Z     compiled=True,
2025-05-07T20:32:44.6163184Z )
2025-05-07T20:32:44.6455339Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.6456028Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.6456371Z 
2025-05-07T20:32:44.6456679Z     @given(
2025-05-07T20:32:44.6456926Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.6457248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.6457551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.6457884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.6458269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.6458558Z     )
2025-05-07T20:32:44.6458916Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.6459451Z     def test_silu_mul_quant(
2025-05-07T20:32:44.6459697Z         self,
2025-05-07T20:32:44.6459908Z         T: int,
2025-05-07T20:32:44.6460119Z         D: int,
2025-05-07T20:32:44.6460343Z         scale_ub: Optional[float],
2025-05-07T20:32:44.6460627Z         contiguous: bool,
2025-05-07T20:32:44.6460880Z         compiled: bool,
2025-05-07T20:32:44.6461117Z     ) -> None:
2025-05-07T20:32:44.6461335Z         torch.manual_seed(2025)
2025-05-07T20:32:44.6461588Z     
2025-05-07T20:32:44.6461946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.6462297Z     
2025-05-07T20:32:44.6462505Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.6462809Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.6463118Z         x = x_sign * x_clamp
2025-05-07T20:32:44.6463369Z         x0 = x[:, :D]
2025-05-07T20:32:44.6463595Z         x1 = x[:, D:]
2025-05-07T20:32:44.6463812Z     
2025-05-07T20:32:44.6464008Z         if contiguous:
2025-05-07T20:32:44.6464252Z             x0 = x0.contiguous()
2025-05-07T20:32:44.6464514Z             x1 = x1.contiguous()
2025-05-07T20:32:44.6464766Z     
2025-05-07T20:32:44.6464968Z         if scale_ub is not None:
2025-05-07T20:32:44.6465241Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.6465585Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.6465908Z             )
2025-05-07T20:32:44.6466115Z         else:
2025-05-07T20:32:44.6466337Z             scale_ub_tensor = None
2025-05-07T20:32:44.6466604Z     
2025-05-07T20:32:44.6466846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.6467173Z             op = silu_mul_quant
2025-05-07T20:32:44.6467428Z             if compiled:
2025-05-07T20:32:44.6467684Z                 op = torch.compile(op)
2025-05-07T20:32:44.6467987Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.6468301Z     
2025-05-07T20:32:44.6468598Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.6468769Z 
2025-05-07T20:32:44.6468878Z moe/activation_test.py:117: 
2025-05-07T20:32:44.6469173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.6469513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.6469800Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.6470357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.6470929Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.6471601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.6472293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.6472832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.6473515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.6474185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.6474723Z     kernel = self.compile(
2025-05-07T20:32:44.6475265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.6475924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.6476374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.6476606Z 
2025-05-07T20:32:44.6476812Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999ee0290>
2025-05-07T20:32:44.6477893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.6479285Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5999d5bf60>}
2025-05-07T20:32:44.6480676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.6481705Z context = <triton._C.libtriton.ir.context object at 0x7f5999ee4c30>
2025-05-07T20:32:44.6481995Z 
2025-05-07T20:32:44.6482208Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.6482735Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.6483206Z                            module_map=module_map)
2025-05-07T20:32:44.6483576Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.6483930Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.6484201Z E       ^
2025-05-07T20:32:44.6484669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.6485122Z 
2025-05-07T20:32:44.6485541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.6486059Z 
2025-05-07T20:32:44.6486166Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.6486586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.6486997Z     T=128,
2025-05-07T20:32:44.6487187Z     D=7168,
2025-05-07T20:32:44.6487390Z     scale_ub=1200.0,
2025-05-07T20:32:44.6487770Z     contiguous=True,
2025-05-07T20:32:44.6487990Z     compiled=False,
2025-05-07T20:32:44.6488201Z )
2025-05-07T20:32:44.6488522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.6489004Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.6489328Z 
2025-05-07T20:32:44.6489405Z     @given(
2025-05-07T20:32:44.6489642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.6489948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.6490257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.6490585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.6490913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.6491193Z     )
2025-05-07T20:32:44.6491549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.6491987Z     def test_silu_mul_quant(
2025-05-07T20:32:44.6492226Z         self,
2025-05-07T20:32:44.6492429Z         T: int,
2025-05-07T20:32:44.6492630Z         D: int,
2025-05-07T20:32:44.6492842Z         scale_ub: Optional[float],
2025-05-07T20:32:44.6493117Z         contiguous: bool,
2025-05-07T20:32:44.6493354Z         compiled: bool,
2025-05-07T20:32:44.6493570Z     ) -> None:
2025-05-07T20:32:44.6493794Z         torch.manual_seed(2025)
2025-05-07T20:32:44.6494046Z     
2025-05-07T20:32:44.6494321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.6494671Z     
2025-05-07T20:32:44.6494873Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.6495172Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.6497257Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.6499161Z 
2025-05-07T20:32:44.6499285Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.6499553Z 
2025-05-07T20:32:44.6499662Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.6500088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.6500490Z     T=128,
2025-05-07T20:32:44.6500687Z     D=5120,
2025-05-07T20:32:44.6500887Z     scale_ub=1200.0,
2025-05-07T20:32:44.6501115Z     contiguous=True,
2025-05-07T20:32:44.6501336Z     compiled=True,
2025-05-07T20:32:44.6501540Z )
2025-05-07T20:32:44.6501908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.6502398Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.6502670Z 
2025-05-07T20:32:44.6502752Z     @given(
2025-05-07T20:32:44.6502989Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.6503303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.6503616Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.6503950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.6504279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.6504566Z     )
2025-05-07T20:32:44.6504920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.6505369Z     def test_silu_mul_quant(
2025-05-07T20:32:44.6505949Z         self,
2025-05-07T20:32:44.6506163Z         T: int,
2025-05-07T20:32:44.6506369Z         D: int,
2025-05-07T20:32:44.6506590Z         scale_ub: Optional[float],
2025-05-07T20:32:44.6506877Z         contiguous: bool,
2025-05-07T20:32:44.6507127Z         compiled: bool,
2025-05-07T20:32:44.6507349Z     ) -> None:
2025-05-07T20:32:44.6507572Z         torch.manual_seed(2025)
2025-05-07T20:32:44.6507827Z     
2025-05-07T20:32:44.6508102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.6508448Z     
2025-05-07T20:32:44.6508652Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.6509053Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.6511051Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.6512906Z 
2025-05-07T20:32:44.6513030Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.6513247Z 
2025-05-07T20:32:44.6513353Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.6513768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.6514170Z     T=128,
2025-05-07T20:32:44.6514370Z     D=7168,
2025-05-07T20:32:44.6514567Z     scale_ub=None,
2025-05-07T20:32:44.6514783Z     contiguous=True,
2025-05-07T20:32:44.6515007Z     compiled=True,
2025-05-07T20:32:44.6515215Z )
2025-05-07T20:32:44.8441632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8442145Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8442450Z 
2025-05-07T20:32:44.8442559Z     @given(
2025-05-07T20:32:44.8443171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8443507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8443818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8444159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8444499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8444791Z     )
2025-05-07T20:32:44.8445150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8445611Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8445940Z         self,
2025-05-07T20:32:44.8446145Z         T: int,
2025-05-07T20:32:44.8446354Z         D: int,
2025-05-07T20:32:44.8446577Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8446862Z         contiguous: bool,
2025-05-07T20:32:44.8447118Z         compiled: bool,
2025-05-07T20:32:44.8447349Z     ) -> None:
2025-05-07T20:32:44.8447667Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8447914Z     
2025-05-07T20:32:44.8448268Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8450332Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.8452215Z 
2025-05-07T20:32:44.8452338Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.8452559Z 
2025-05-07T20:32:44.8461465Z FAILED
2025-05-07T20:32:44.8461607Z 
2025-05-07T20:32:44.8461777Z =================================== FAILURES ===================================
2025-05-07T20:32:44.8462371Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:44.8463000Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:44.8463842Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:44.8464598Z   |     yield
2025-05-07T20:32:44.8465181Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:44.8466034Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:44.8466804Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:44.8467558Z   |     if method() is not None:
2025-05-07T20:32:44.8467889Z   |        ^^^^^^^^
2025-05-07T20:32:44.8468833Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:44.8469849Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8470262Z   |            ^^^^^^^
2025-05-07T20:32:44.8471021Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:44.8471874Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:44.8472327Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:44.8472750Z   +-+---------------- 1 ----------------
2025-05-07T20:32:44.8473064Z     | Traceback (most recent call last):
2025-05-07T20:32:44.8474035Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.8475136Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8475990Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8478832Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.8481614Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.8482211Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8482782Z     |     T=2048,
2025-05-07T20:32:44.8483113Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:44.8483572Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:44.8484081Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:44.8484655Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:44.8485084Z     | )
2025-05-07T20:32:44.8485331Z     | 
2025-05-07T20:32:44.8486051Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:44.8486899Z     +---------------- 2 ----------------
2025-05-07T20:32:44.8487296Z     | Traceback (most recent call last):
2025-05-07T20:32:44.8488468Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.8489572Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8490088Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8492830Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.8495609Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.8496245Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8496808Z     |     T=128,
2025-05-07T20:32:44.8497095Z     |     D=7168,
2025-05-07T20:32:44.8497385Z     |     scale_ub=None,
2025-05-07T20:32:44.8497743Z     |     contiguous=True,
2025-05-07T20:32:44.8498079Z     |     compiled=True,
2025-05-07T20:32:44.8498388Z     | )
2025-05-07T20:32:44.8498643Z     | 
2025-05-07T20:32:44.8499384Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.8500215Z     +---------------- 3 ----------------
2025-05-07T20:32:44.8500621Z     | Traceback (most recent call last):
2025-05-07T20:32:44.8501644Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.8502726Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8503230Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8505327Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.8507587Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.8527076Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8527844Z     |     T=128,
2025-05-07T20:32:44.8528207Z     |     D=5120,
2025-05-07T20:32:44.8528716Z     |     scale_ub=1200.0,
2025-05-07T20:32:44.8529073Z     |     contiguous=True,
2025-05-07T20:32:44.8529414Z     |     compiled=True,
2025-05-07T20:32:44.8529740Z     | )
2025-05-07T20:32:44.8530004Z     | 
2025-05-07T20:32:44.8530750Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.8531619Z     +---------------- 4 ----------------
2025-05-07T20:32:44.8532137Z     | Traceback (most recent call last):
2025-05-07T20:32:44.8533150Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:44.8534146Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8534551Z     |                              ^^^^^^^^
2025-05-07T20:32:44.8535454Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:44.8536446Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8536938Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8538078Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:44.8539200Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8540049Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:44.8541074Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8541694Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8542582Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:44.8543813Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8544481Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8545417Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:44.8546548Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8547172Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8548097Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:44.8549111Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8549647Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8550493Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:44.8551285Z     |     fn()
2025-05-07T20:32:44.8552085Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:44.8552976Z     |     self.fn.run(
2025-05-07T20:32:44.8553813Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:44.8554633Z     |     kernel = self.compile(
2025-05-07T20:32:44.8555006Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:44.8555810Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:44.8556762Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8557305Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8558249Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.8559354Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8560033Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8560660Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8561155Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8561510Z     | ^
2025-05-07T20:32:44.8562160Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8562952Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.8563469Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:44.8564148Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8564719Z     |     T=1,  # or any other generated value
2025-05-07T20:32:44.8565130Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:44.8565571Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:44.8566049Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:44.8566541Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:44.8566942Z     | )
2025-05-07T20:32:44.8567192Z     | 
2025-05-07T20:32:44.8568019Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.8568824Z     +------------------------------------
2025-05-07T20:32:44.8569308Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:44.8569872Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8570416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8570949Z     T=1,
2025-05-07T20:32:44.8571203Z     D=5120,
2025-05-07T20:32:44.8571464Z     scale_ub=None,
2025-05-07T20:32:44.8571746Z     contiguous=True,
2025-05-07T20:32:44.8572054Z     compiled=True,
2025-05-07T20:32:44.8572325Z )
2025-05-07T20:32:44.8572734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8573365Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8573707Z 
2025-05-07T20:32:44.8573823Z     @given(
2025-05-07T20:32:44.8574111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8574520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8574920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8575352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8575782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8576171Z     )
2025-05-07T20:32:44.8576646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8577237Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8577563Z         self,
2025-05-07T20:32:44.8577828Z         T: int,
2025-05-07T20:32:44.8578135Z         D: int,
2025-05-07T20:32:44.8578430Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8578790Z         contiguous: bool,
2025-05-07T20:32:44.8579175Z         compiled: bool,
2025-05-07T20:32:44.8579484Z     ) -> None:
2025-05-07T20:32:44.8579775Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8580094Z     
2025-05-07T20:32:44.8580456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8580913Z     
2025-05-07T20:32:44.8581166Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8581552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8581986Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8582318Z         x0 = x[:, :D]
2025-05-07T20:32:44.8582674Z         x1 = x[:, D:]
2025-05-07T20:32:44.8582962Z     
2025-05-07T20:32:44.8583231Z         if contiguous:
2025-05-07T20:32:44.8583548Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8583932Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8584289Z     
2025-05-07T20:32:44.8584565Z         if scale_ub is not None:
2025-05-07T20:32:44.8584938Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8585440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8585849Z             )
2025-05-07T20:32:44.8586111Z         else:
2025-05-07T20:32:44.8586399Z             scale_ub_tensor = None
2025-05-07T20:32:44.8586731Z     
2025-05-07T20:32:44.8587036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8587459Z             op = silu_mul_quant
2025-05-07T20:32:44.8587799Z             if compiled:
2025-05-07T20:32:44.8588121Z                 op = torch.compile(op)
2025-05-07T20:32:44.8588519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8588891Z     
2025-05-07T20:32:44.8589139Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8589525Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8589919Z     
2025-05-07T20:32:44.8590229Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8590675Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8591084Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8591497Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8591994Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8592432Z     
2025-05-07T20:32:44.8592692Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8592968Z 
2025-05-07T20:32:44.8593110Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8593528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8594069Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8594521Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8595608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8596641Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8597396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8598328Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8599259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8600250Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8601288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8602337Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8603370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8604271Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8605160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8606173Z     fn()
2025-05-07T20:32:44.8606867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8607743Z     self.fn.run(
2025-05-07T20:32:44.8608392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8609112Z     kernel = self.compile(
2025-05-07T20:32:44.8609856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8610857Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8611393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8611714Z 
2025-05-07T20:32:44.8611987Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df67f7310>
2025-05-07T20:32:44.8613545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8615454Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df65e13a0>}
2025-05-07T20:32:44.8617300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8618818Z context = <triton._C.libtriton.ir.context object at 0x7f5df7c71930>
2025-05-07T20:32:44.8619234Z 
2025-05-07T20:32:44.8619469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8620209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8620865Z                            module_map=module_map)
2025-05-07T20:32:44.8621362Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8621847Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8622219Z E       ^
2025-05-07T20:32:44.8622838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8623451Z 
2025-05-07T20:32:44.8624010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8624819Z 
2025-05-07T20:32:44.8624965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8625526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8626047Z     T=2048,
2025-05-07T20:32:44.8626313Z     D=5120,
2025-05-07T20:32:44.8626578Z     scale_ub=1200.0,
2025-05-07T20:32:44.8626876Z     contiguous=True,
2025-05-07T20:32:44.8627194Z     compiled=False,
2025-05-07T20:32:44.8627490Z )
2025-05-07T20:32:44.8627915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8628659Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.8629040Z 
2025-05-07T20:32:44.8629166Z     @given(
2025-05-07T20:32:44.8629488Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8629929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8630347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8630807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8631243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8631636Z     )
2025-05-07T20:32:44.8632124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8632718Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8633037Z         self,
2025-05-07T20:32:44.8633297Z         T: int,
2025-05-07T20:32:44.8633555Z         D: int,
2025-05-07T20:32:44.8633939Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8634306Z         contiguous: bool,
2025-05-07T20:32:44.8634634Z         compiled: bool,
2025-05-07T20:32:44.8634945Z     ) -> None:
2025-05-07T20:32:44.8635237Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8635574Z     
2025-05-07T20:32:44.8635962Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8636450Z     
2025-05-07T20:32:44.8636726Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8637138Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8637644Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8637991Z         x0 = x[:, :D]
2025-05-07T20:32:44.8638341Z         x1 = x[:, D:]
2025-05-07T20:32:44.8638632Z     
2025-05-07T20:32:44.8638898Z         if contiguous:
2025-05-07T20:32:44.8639212Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8639580Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8639920Z     
2025-05-07T20:32:44.8640191Z         if scale_ub is not None:
2025-05-07T20:32:44.8640655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8641107Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8641509Z             )
2025-05-07T20:32:44.8641773Z         else:
2025-05-07T20:32:44.8642056Z             scale_ub_tensor = None
2025-05-07T20:32:44.8642391Z     
2025-05-07T20:32:44.8642691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8643122Z             op = silu_mul_quant
2025-05-07T20:32:44.8643470Z             if compiled:
2025-05-07T20:32:44.8643809Z                 op = torch.compile(op)
2025-05-07T20:32:44.8644194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8644570Z     
2025-05-07T20:32:44.8644837Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8645074Z 
2025-05-07T20:32:44.8645216Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8645628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8646085Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8646471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8647398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8648418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8649123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8650115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8651006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8651717Z     kernel = self.compile(
2025-05-07T20:32:44.8652452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8653351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8653907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8654230Z 
2025-05-07T20:32:44.8654513Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df675d390>
2025-05-07T20:32:44.8655939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8657331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df62902c0>}
2025-05-07T20:32:44.8658674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8659769Z context = <triton._C.libtriton.ir.context object at 0x7f5df67398f0>
2025-05-07T20:32:44.8660116Z 
2025-05-07T20:32:44.8660303Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8660923Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8661480Z                            module_map=module_map)
2025-05-07T20:32:44.8661886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8662294Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8662629Z E       ^
2025-05-07T20:32:44.8663172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8663734Z 
2025-05-07T20:32:44.8664241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8664882Z 
2025-05-07T20:32:44.8664992Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8665518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8665919Z     T=2048,
2025-05-07T20:32:44.8666116Z     D=5120,
2025-05-07T20:32:44.8666313Z     scale_ub=1200.0,
2025-05-07T20:32:44.8666537Z     contiguous=True,
2025-05-07T20:32:44.8666758Z     compiled=True,
2025-05-07T20:32:44.8666976Z )
2025-05-07T20:32:44.8667299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8667791Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.8668069Z 
2025-05-07T20:32:44.8668150Z     @given(
2025-05-07T20:32:44.8668383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8668690Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8668999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8669329Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8669651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8669946Z     )
2025-05-07T20:32:44.8670194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8670297Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8670378Z         self,
2025-05-07T20:32:44.8670464Z         T: int,
2025-05-07T20:32:44.8670544Z         D: int,
2025-05-07T20:32:44.8670647Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8670743Z         contiguous: bool,
2025-05-07T20:32:44.8670883Z         compiled: bool,
2025-05-07T20:32:44.8670965Z     ) -> None:
2025-05-07T20:32:44.8671071Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8671146Z     
2025-05-07T20:32:44.8671318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8671400Z     
2025-05-07T20:32:44.8671497Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8671625Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8671724Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8671810Z         x0 = x[:, :D]
2025-05-07T20:32:44.8671901Z         x1 = x[:, D:]
2025-05-07T20:32:44.8671980Z     
2025-05-07T20:32:44.8672067Z         if contiguous:
2025-05-07T20:32:44.8672171Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8672265Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8672340Z     
2025-05-07T20:32:44.8672440Z         if scale_ub is not None:
2025-05-07T20:32:44.8672548Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8672685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8672775Z             )
2025-05-07T20:32:44.8672857Z         else:
2025-05-07T20:32:44.8672953Z             scale_ub_tensor = None
2025-05-07T20:32:44.8673039Z     
2025-05-07T20:32:44.8673169Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8673267Z             op = silu_mul_quant
2025-05-07T20:32:44.8673355Z             if compiled:
2025-05-07T20:32:44.8673457Z                 op = torch.compile(op)
2025-05-07T20:32:44.8673616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8673693Z     
2025-05-07T20:32:44.8673782Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8673908Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8673979Z     
2025-05-07T20:32:44.8674114Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8674224Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8674322Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8674447Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8674643Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8674716Z     
2025-05-07T20:32:44.8674826Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8674831Z 
2025-05-07T20:32:44.8674926Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8675054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8675167Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8675341Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8675896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8676005Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8676363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8676591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8676955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8677211Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8677610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8677867Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8678289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8678454Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8678793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8678926Z     fn()
2025-05-07T20:32:44.8679319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8679410Z     self.fn.run(
2025-05-07T20:32:44.8679744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8679836Z     kernel = self.compile(
2025-05-07T20:32:44.8680220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8680398Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8680529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8680534Z 
2025-05-07T20:32:44.8680739Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4c14190>
2025-05-07T20:32:44.8681510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8682022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df6291440>}
2025-05-07T20:32:44.8682805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8683003Z context = <triton._C.libtriton.ir.context object at 0x7f5df4c18730>
2025-05-07T20:32:44.8683007Z 
2025-05-07T20:32:44.8683170Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8683436Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8683542Z                            module_map=module_map)
2025-05-07T20:32:44.8683708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8683858Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8683936Z E       ^
2025-05-07T20:32:44.8684289Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8684293Z 
2025-05-07T20:32:44.8684710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8684715Z 
2025-05-07T20:32:44.8684857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8685083Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8685161Z     T=16384,
2025-05-07T20:32:44.8685236Z     D=7168,
2025-05-07T20:32:44.8685325Z     scale_ub=1200.0,
2025-05-07T20:32:44.8685408Z     contiguous=False,
2025-05-07T20:32:44.8685491Z     compiled=False,
2025-05-07T20:32:44.8685570Z )
2025-05-07T20:32:44.8685791Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8685973Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.8685987Z 
2025-05-07T20:32:44.8686065Z     @given(
2025-05-07T20:32:44.8686182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8686288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8686402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8686521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8686641Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8686712Z     )
2025-05-07T20:32:44.8686956Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8687055Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8687132Z         self,
2025-05-07T20:32:44.8687209Z         T: int,
2025-05-07T20:32:44.8687291Z         D: int,
2025-05-07T20:32:44.8687435Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8687530Z         contiguous: bool,
2025-05-07T20:32:44.8687706Z         compiled: bool,
2025-05-07T20:32:44.8687787Z     ) -> None:
2025-05-07T20:32:44.8687892Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8687969Z     
2025-05-07T20:32:44.8688144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8688228Z     
2025-05-07T20:32:44.8688325Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8688457Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8688565Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8688654Z         x0 = x[:, :D]
2025-05-07T20:32:44.8688737Z         x1 = x[:, D:]
2025-05-07T20:32:44.8688819Z     
2025-05-07T20:32:44.8688906Z         if contiguous:
2025-05-07T20:32:44.8689010Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8689102Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8689180Z     
2025-05-07T20:32:44.8689280Z         if scale_ub is not None:
2025-05-07T20:32:44.8689393Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8689533Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8689616Z             )
2025-05-07T20:32:44.8689695Z         else:
2025-05-07T20:32:44.8689794Z             scale_ub_tensor = None
2025-05-07T20:32:44.8689881Z     
2025-05-07T20:32:44.8690012Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8690105Z             op = silu_mul_quant
2025-05-07T20:32:44.8690252Z             if compiled:
2025-05-07T20:32:44.8690359Z                 op = torch.compile(op)
2025-05-07T20:32:44.8690477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8690554Z     
2025-05-07T20:32:44.8690647Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8690652Z 
2025-05-07T20:32:44.8690759Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8690890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8690996Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8691108Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8691676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8691775Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8692140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8692403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8692752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8692849Z     kernel = self.compile(
2025-05-07T20:32:44.8693229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8693411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8693542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8693550Z 
2025-05-07T20:32:44.8693761Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4ca2190>
2025-05-07T20:32:44.8694532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8695039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df50ba980>}
2025-05-07T20:32:44.8695790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8695980Z context = <triton._C.libtriton.ir.context object at 0x7f5df4fd83f0>
2025-05-07T20:32:44.8696030Z 
2025-05-07T20:32:44.8696204Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8696466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8696577Z                            module_map=module_map)
2025-05-07T20:32:44.8696744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8696847Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8696932Z E       ^
2025-05-07T20:32:44.8697290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8697295Z 
2025-05-07T20:32:44.8697709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8697714Z 
2025-05-07T20:32:44.8697826Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8698052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8698142Z     T=1,
2025-05-07T20:32:44.8698223Z     D=7168,
2025-05-07T20:32:44.8698308Z     scale_ub=None,
2025-05-07T20:32:44.8698405Z     contiguous=True,
2025-05-07T20:32:44.8698493Z     compiled=True,
2025-05-07T20:32:44.8698568Z )
2025-05-07T20:32:44.8698791Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8698956Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8699006Z 
2025-05-07T20:32:44.8699092Z     @given(
2025-05-07T20:32:44.8699219Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8699320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8699438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8699563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8699677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8699766Z     )
2025-05-07T20:32:44.8700010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8700152Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8700242Z         self,
2025-05-07T20:32:44.8700324Z         T: int,
2025-05-07T20:32:44.8700404Z         D: int,
2025-05-07T20:32:44.8700514Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8700605Z         contiguous: bool,
2025-05-07T20:32:44.8700691Z         compiled: bool,
2025-05-07T20:32:44.8700780Z     ) -> None:
2025-05-07T20:32:44.8700924Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8701001Z     
2025-05-07T20:32:44.8701177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8701256Z     
2025-05-07T20:32:44.8701359Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8701486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8701577Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8701667Z         x0 = x[:, :D]
2025-05-07T20:32:44.8701755Z         x1 = x[:, D:]
2025-05-07T20:32:44.8701830Z     
2025-05-07T20:32:44.8701924Z         if contiguous:
2025-05-07T20:32:44.8702018Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8702111Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8702197Z     
2025-05-07T20:32:44.8702291Z         if scale_ub is not None:
2025-05-07T20:32:44.8702399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8702544Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8702627Z             )
2025-05-07T20:32:44.8702712Z         else:
2025-05-07T20:32:44.8702808Z             scale_ub_tensor = None
2025-05-07T20:32:44.8702886Z     
2025-05-07T20:32:44.8703025Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8703118Z             op = silu_mul_quant
2025-05-07T20:32:44.8703207Z             if compiled:
2025-05-07T20:32:44.8703316Z                 op = torch.compile(op)
2025-05-07T20:32:44.8703424Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8703546Z     
2025-05-07T20:32:44.8703649Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8703772Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8703850Z     
2025-05-07T20:32:44.8703993Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8704098Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8704206Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8704333Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8704477Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8704562Z     
2025-05-07T20:32:44.8704665Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8704670Z 
2025-05-07T20:32:44.8704773Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8704914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8705021Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8705159Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8706041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8706170Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8706537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8706882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8707254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8707514Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8707917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8708220Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8708658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8716834Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8717226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8717312Z     fn()
2025-05-07T20:32:44.8717863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8717955Z     self.fn.run(
2025-05-07T20:32:44.8718304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8718413Z     kernel = self.compile(
2025-05-07T20:32:44.8718801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8718991Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8719129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8719135Z 
2025-05-07T20:32:44.8719347Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4bc9ad0>
2025-05-07T20:32:44.8720147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8720657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4f3ac00>}
2025-05-07T20:32:44.8721419Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8721689Z context = <triton._C.libtriton.ir.context object at 0x7f5df4bdfbb0>
2025-05-07T20:32:44.8721694Z 
2025-05-07T20:32:44.8721871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8722140Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8722252Z                            module_map=module_map)
2025-05-07T20:32:44.8722429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8722539Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8722622Z E       ^
2025-05-07T20:32:44.8722988Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8722993Z 
2025-05-07T20:32:44.8723411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8723418Z 
2025-05-07T20:32:44.8723535Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8723768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8723851Z     T=4096,
2025-05-07T20:32:44.8723939Z     D=5120,
2025-05-07T20:32:44.8724027Z     scale_ub=None,
2025-05-07T20:32:44.8724119Z     contiguous=False,
2025-05-07T20:32:44.8724215Z     compiled=False,
2025-05-07T20:32:44.8724295Z )
2025-05-07T20:32:44.8724560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8724751Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8724756Z 
2025-05-07T20:32:44.8724838Z     @given(
2025-05-07T20:32:44.8724972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8725077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8725197Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8725325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8725445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8725573Z     )
2025-05-07T20:32:44.8725830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8725928Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8726011Z         self,
2025-05-07T20:32:44.8726101Z         T: int,
2025-05-07T20:32:44.8726186Z         D: int,
2025-05-07T20:32:44.8726299Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8726397Z         contiguous: bool,
2025-05-07T20:32:44.8726527Z         compiled: bool,
2025-05-07T20:32:44.8726622Z     ) -> None:
2025-05-07T20:32:44.8726722Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8726802Z     
2025-05-07T20:32:44.8726983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8727062Z     
2025-05-07T20:32:44.8727162Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8727299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8727397Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8727489Z         x0 = x[:, :D]
2025-05-07T20:32:44.8727662Z         x1 = x[:, D:]
2025-05-07T20:32:44.8727738Z     
2025-05-07T20:32:44.8727834Z         if contiguous:
2025-05-07T20:32:44.8727942Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8728049Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8728148Z     
2025-05-07T20:32:44.8728253Z         if scale_ub is not None:
2025-05-07T20:32:44.8728364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8728518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8728599Z             )
2025-05-07T20:32:44.8728681Z         else:
2025-05-07T20:32:44.8728789Z             scale_ub_tensor = None
2025-05-07T20:32:44.8728868Z     
2025-05-07T20:32:44.8729002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8729107Z             op = silu_mul_quant
2025-05-07T20:32:44.8729199Z             if compiled:
2025-05-07T20:32:44.8729364Z                 op = torch.compile(op)
2025-05-07T20:32:44.8729479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8729557Z     
2025-05-07T20:32:44.8729659Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8729664Z 
2025-05-07T20:32:44.8729768Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8729903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8730017Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8730125Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8730633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8730745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8731108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8731341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8731690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8731795Z     kernel = self.compile(
2025-05-07T20:32:44.8732188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8732366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8732576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8732583Z 
2025-05-07T20:32:44.8732795Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4a59310>
2025-05-07T20:32:44.8733574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8734089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4f14180>}
2025-05-07T20:32:44.8734888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8735091Z context = <triton._C.libtriton.ir.context object at 0x7f5df4a7d8f0>
2025-05-07T20:32:44.8735096Z 
2025-05-07T20:32:44.8735309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8735576Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8735695Z                            module_map=module_map)
2025-05-07T20:32:44.8735866Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8735979Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8736062Z E       ^
2025-05-07T20:32:44.8736426Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8736434Z 
2025-05-07T20:32:44.8736857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8736862Z 
2025-05-07T20:32:44.8736972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8737209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8737295Z     T=4096,
2025-05-07T20:32:44.8737382Z     D=7168,
2025-05-07T20:32:44.8737476Z     scale_ub=None,
2025-05-07T20:32:44.8737569Z     contiguous=False,
2025-05-07T20:32:44.8737660Z     compiled=False,
2025-05-07T20:32:44.8737747Z )
2025-05-07T20:32:44.8737969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8738147Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8738196Z 
2025-05-07T20:32:44.8738285Z     @given(
2025-05-07T20:32:44.8738412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8738527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8738646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8738768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8738893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8738974Z     )
2025-05-07T20:32:44.8739227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8739334Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8739416Z         self,
2025-05-07T20:32:44.8739498Z         T: int,
2025-05-07T20:32:44.8739586Z         D: int,
2025-05-07T20:32:44.8739690Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8739785Z         contiguous: bool,
2025-05-07T20:32:44.8739882Z         compiled: bool,
2025-05-07T20:32:44.8739966Z     ) -> None:
2025-05-07T20:32:44.8740076Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8740157Z     
2025-05-07T20:32:44.8740333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8740419Z     
2025-05-07T20:32:44.8740516Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8740646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8740746Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8740833Z         x0 = x[:, :D]
2025-05-07T20:32:44.8740919Z         x1 = x[:, D:]
2025-05-07T20:32:44.8741003Z     
2025-05-07T20:32:44.8741139Z         if contiguous:
2025-05-07T20:32:44.8741237Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8741337Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8741416Z     
2025-05-07T20:32:44.8741513Z         if scale_ub is not None:
2025-05-07T20:32:44.8741631Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8741770Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8741857Z             )
2025-05-07T20:32:44.8741943Z         else:
2025-05-07T20:32:44.8742043Z             scale_ub_tensor = None
2025-05-07T20:32:44.8742171Z     
2025-05-07T20:32:44.8742307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8742403Z             op = silu_mul_quant
2025-05-07T20:32:44.8742503Z             if compiled:
2025-05-07T20:32:44.8742608Z                 op = torch.compile(op)
2025-05-07T20:32:44.8742720Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8742807Z     
2025-05-07T20:32:44.8742905Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8742950Z 
2025-05-07T20:32:44.8743063Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8743198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8743305Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8743418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8743921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8744030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8744402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8744628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8744981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8745083Z     kernel = self.compile(
2025-05-07T20:32:44.8745471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8745657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8745790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8745795Z 
2025-05-07T20:32:44.8746003Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df4af3c90>
2025-05-07T20:32:44.8746833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8747346Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4f16020>}
2025-05-07T20:32:44.8748108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8748303Z context = <triton._C.libtriton.ir.context object at 0x7f5df4acc2f0>
2025-05-07T20:32:44.8748307Z 
2025-05-07T20:32:44.8748482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8748749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8748867Z                            module_map=module_map)
2025-05-07T20:32:44.8749041Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8749146Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8749229Z E       ^
2025-05-07T20:32:44.8749596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8749600Z 
2025-05-07T20:32:44.8750062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8750067Z 
2025-05-07T20:32:44.8750187Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8750415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8750499Z     T=128,
2025-05-07T20:32:44.8750590Z     D=7168,
2025-05-07T20:32:44.8750681Z     scale_ub=None,
2025-05-07T20:32:44.8750777Z     contiguous=False,
2025-05-07T20:32:44.8750873Z     compiled=True,
2025-05-07T20:32:44.8750996Z )
2025-05-07T20:32:44.8751225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8751399Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.8751404Z 
2025-05-07T20:32:44.8751485Z     @given(
2025-05-07T20:32:44.8751618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8751726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8751891Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8752022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8752140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8752226Z     )
2025-05-07T20:32:44.8752476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8752575Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8752663Z         self,
2025-05-07T20:32:44.8752751Z         T: int,
2025-05-07T20:32:44.8752838Z         D: int,
2025-05-07T20:32:44.8752954Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8753049Z         contiguous: bool,
2025-05-07T20:32:44.8753140Z         compiled: bool,
2025-05-07T20:32:44.8753232Z     ) -> None:
2025-05-07T20:32:44.8753332Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8753411Z     
2025-05-07T20:32:44.8753591Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8753673Z     
2025-05-07T20:32:44.8753772Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8753911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8754005Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8754099Z         x0 = x[:, :D]
2025-05-07T20:32:44.8754185Z         x1 = x[:, D:]
2025-05-07T20:32:44.8754266Z     
2025-05-07T20:32:44.8754364Z         if contiguous:
2025-05-07T20:32:44.8754462Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8754603Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8754689Z     
2025-05-07T20:32:44.8754788Z         if scale_ub is not None:
2025-05-07T20:32:44.8754899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8755046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8755127Z             )
2025-05-07T20:32:44.8755209Z         else:
2025-05-07T20:32:44.8755318Z             scale_ub_tensor = None
2025-05-07T20:32:44.8755396Z     
2025-05-07T20:32:44.8755539Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8755639Z             op = silu_mul_quant
2025-05-07T20:32:44.8755730Z             if compiled:
2025-05-07T20:32:44.8755842Z                 op = torch.compile(op)
2025-05-07T20:32:44.8755953Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8756031Z     
2025-05-07T20:32:44.8756132Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8756257Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8756339Z     
2025-05-07T20:32:44.8756486Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8756598Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8756703Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8756839Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8756984Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8757069Z     
2025-05-07T20:32:44.8757176Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8757225Z 
2025-05-07T20:32:44.8757332Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8757474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8757585Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8757723Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8758293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8758403Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8758817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8759043Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8759414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8759717Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8760120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8760386Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8760763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8760941Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8761290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8761372Z     fn()
2025-05-07T20:32:44.8761779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8761867Z     self.fn.run(
2025-05-07T20:32:44.8762212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8762317Z     kernel = self.compile(
2025-05-07T20:32:44.8762698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8762884Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8763018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8763092Z 
2025-05-07T20:32:44.8763303Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df672bcd0>
2025-05-07T20:32:44.8764089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8764599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4a07060>}
2025-05-07T20:32:44.8765351Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8765545Z context = <triton._C.libtriton.ir.context object at 0x7f5df4856cf0>
2025-05-07T20:32:44.8765550Z 
2025-05-07T20:32:44.8765723Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8765992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8766105Z                            module_map=module_map)
2025-05-07T20:32:44.8766275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8766382Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8766464Z E       ^
2025-05-07T20:32:44.8766872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8766877Z 
2025-05-07T20:32:44.8767293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8767298Z 
2025-05-07T20:32:44.8767414Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8767694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8767780Z     T=128,
2025-05-07T20:32:44.8767869Z     D=7168,
2025-05-07T20:32:44.8768001Z     scale_ub=None,
2025-05-07T20:32:44.8768092Z     contiguous=False,
2025-05-07T20:32:44.8768187Z     compiled=False,
2025-05-07T20:32:44.8768266Z )
2025-05-07T20:32:44.8768486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8768666Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8768671Z 
2025-05-07T20:32:44.8768752Z     @given(
2025-05-07T20:32:44.8768922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8769025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8769142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8769265Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8769378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8769457Z     )
2025-05-07T20:32:44.8769708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8769807Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8769896Z         self,
2025-05-07T20:32:44.8769976Z         T: int,
2025-05-07T20:32:44.8770055Z         D: int,
2025-05-07T20:32:44.8770162Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8770252Z         contiguous: bool,
2025-05-07T20:32:44.8770340Z         compiled: bool,
2025-05-07T20:32:44.8770424Z     ) -> None:
2025-05-07T20:32:44.8770522Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8770598Z     
2025-05-07T20:32:44.8770776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8770856Z     
2025-05-07T20:32:44.8770949Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8771081Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8771173Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8771255Z         x0 = x[:, :D]
2025-05-07T20:32:44.8771344Z         x1 = x[:, D:]
2025-05-07T20:32:44.8771419Z     
2025-05-07T20:32:44.8771555Z         if contiguous:
2025-05-07T20:32:44.8771647Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8771742Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8771819Z     
2025-05-07T20:32:44.8771911Z         if scale_ub is not None:
2025-05-07T20:32:44.8772018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8772163Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8772244Z             )
2025-05-07T20:32:44.8772323Z         else:
2025-05-07T20:32:44.8772428Z             scale_ub_tensor = None
2025-05-07T20:32:44.8772507Z     
2025-05-07T20:32:44.8772636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8772736Z             op = silu_mul_quant
2025-05-07T20:32:44.8772824Z             if compiled:
2025-05-07T20:32:44.8772932Z                 op = torch.compile(op)
2025-05-07T20:32:44.8773040Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8773116Z     
2025-05-07T20:32:44.8773216Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8773224Z 
2025-05-07T20:32:44.8773324Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8773457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8773568Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8773670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8774167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8774318Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8774680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8774911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8775252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8775349Z     kernel = self.compile(
2025-05-07T20:32:44.8775741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8775956Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8776091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8776096Z 
2025-05-07T20:32:44.8776302Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df48cba10>
2025-05-07T20:32:44.8777118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8777628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df43d8cc0>}
2025-05-07T20:32:44.8778423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8778628Z context = <triton._C.libtriton.ir.context object at 0x7f5df48f00b0>
2025-05-07T20:32:44.8778633Z 
2025-05-07T20:32:44.8778798Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8779060Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8779179Z                            module_map=module_map)
2025-05-07T20:32:44.8779342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8779452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8779534Z E       ^
2025-05-07T20:32:44.8779889Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8779894Z 
2025-05-07T20:32:44.8780312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8780361Z 
2025-05-07T20:32:44.8780469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8780697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8780777Z     T=4096,
2025-05-07T20:32:44.8780855Z     D=5120,
2025-05-07T20:32:44.8780948Z     scale_ub=1200.0,
2025-05-07T20:32:44.8781034Z     contiguous=True,
2025-05-07T20:32:44.8781123Z     compiled=False,
2025-05-07T20:32:44.8781207Z )
2025-05-07T20:32:44.8781428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8781604Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.8781609Z 
2025-05-07T20:32:44.8781692Z     @given(
2025-05-07T20:32:44.8781813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8781920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8782040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8782159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8782280Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8782358Z     )
2025-05-07T20:32:44.8782601Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8782702Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8782781Z         self,
2025-05-07T20:32:44.8782862Z         T: int,
2025-05-07T20:32:44.8782992Z         D: int,
2025-05-07T20:32:44.8783100Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8783191Z         contiguous: bool,
2025-05-07T20:32:44.8783283Z         compiled: bool,
2025-05-07T20:32:44.8783365Z     ) -> None:
2025-05-07T20:32:44.8783467Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8783544Z     
2025-05-07T20:32:44.8783715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8783798Z     
2025-05-07T20:32:44.8783895Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8784087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8784186Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8784267Z         x0 = x[:, :D]
2025-05-07T20:32:44.8784350Z         x1 = x[:, D:]
2025-05-07T20:32:44.8784434Z     
2025-05-07T20:32:44.8784520Z         if contiguous:
2025-05-07T20:32:44.8784614Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8784713Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8784789Z     
2025-05-07T20:32:44.8784924Z         if scale_ub is not None:
2025-05-07T20:32:44.8785037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8785172Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8785255Z             )
2025-05-07T20:32:44.8785332Z         else:
2025-05-07T20:32:44.8785428Z             scale_ub_tensor = None
2025-05-07T20:32:44.8785510Z     
2025-05-07T20:32:44.8785642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8785736Z             op = silu_mul_quant
2025-05-07T20:32:44.8785832Z             if compiled:
2025-05-07T20:32:44.8785934Z                 op = torch.compile(op)
2025-05-07T20:32:44.8786042Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8786123Z     
2025-05-07T20:32:44.8786220Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8786225Z 
2025-05-07T20:32:44.8786330Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8786464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8786568Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8786673Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8787170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8787269Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8787632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8787900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8788252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8788347Z     kernel = self.compile(
2025-05-07T20:32:44.8788728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8788914Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8789044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8789049Z 
2025-05-07T20:32:44.8789258Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df489c4d0>
2025-05-07T20:32:44.8790038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8790545Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df43d9f80>}
2025-05-07T20:32:44.8791298Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8791533Z context = <triton._C.libtriton.ir.context object at 0x7f5df48acaf0>
2025-05-07T20:32:44.8791539Z 
2025-05-07T20:32:44.8791714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8791975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8792084Z                            module_map=module_map)
2025-05-07T20:32:44.8792256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8792362Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8792485Z E       ^
2025-05-07T20:32:44.8792848Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8792853Z 
2025-05-07T20:32:44.8793264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8793269Z 
2025-05-07T20:32:44.8793382Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8793666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8793750Z     T=1,
2025-05-07T20:32:44.8793841Z     D=5120,
2025-05-07T20:32:44.8793931Z     scale_ub=None,
2025-05-07T20:32:44.8794021Z     contiguous=True,
2025-05-07T20:32:44.8794114Z     compiled=True,
2025-05-07T20:32:44.8794194Z )
2025-05-07T20:32:44.8794416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8794580Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8794588Z 
2025-05-07T20:32:44.8794667Z     @given(
2025-05-07T20:32:44.8794795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8794896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8795013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8795137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8795254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8795334Z     )
2025-05-07T20:32:44.8795584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8795681Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8795764Z         self,
2025-05-07T20:32:44.8795844Z         T: int,
2025-05-07T20:32:44.8795920Z         D: int,
2025-05-07T20:32:44.8796027Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8796120Z         contiguous: bool,
2025-05-07T20:32:44.8796255Z         compiled: bool,
2025-05-07T20:32:44.8796344Z     ) -> None:
2025-05-07T20:32:44.8796439Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8796518Z     
2025-05-07T20:32:44.8796694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8796773Z     
2025-05-07T20:32:44.8796867Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8796997Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8797088Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8797177Z         x0 = x[:, :D]
2025-05-07T20:32:44.8797263Z         x1 = x[:, D:]
2025-05-07T20:32:44.8797339Z     
2025-05-07T20:32:44.8797430Z         if contiguous:
2025-05-07T20:32:44.8797523Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8797613Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8797691Z     
2025-05-07T20:32:44.8797785Z         if scale_ub is not None:
2025-05-07T20:32:44.8797891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8798035Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8798119Z             )
2025-05-07T20:32:44.8798198Z         else:
2025-05-07T20:32:44.8798302Z             scale_ub_tensor = None
2025-05-07T20:32:44.8798380Z     
2025-05-07T20:32:44.8798516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8798608Z             op = silu_mul_quant
2025-05-07T20:32:44.8798695Z             if compiled:
2025-05-07T20:32:44.8798803Z                 op = torch.compile(op)
2025-05-07T20:32:44.8798956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8799033Z     
2025-05-07T20:32:44.8799132Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8799254Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8799330Z     
2025-05-07T20:32:44.8799474Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8799579Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8799680Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8799811Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8799994Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8800075Z     
2025-05-07T20:32:44.8800177Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8800182Z 
2025-05-07T20:32:44.8800284Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8800419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8800529Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8800704Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8801274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8801379Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8801744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8801970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8802338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8802600Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8802999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8803260Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8803634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8803801Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8804150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8804295Z     fn()
2025-05-07T20:32:44.8804696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8804785Z     self.fn.run(
2025-05-07T20:32:44.8805122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8805223Z     kernel = self.compile(
2025-05-07T20:32:44.8805813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8806062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8806231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8806237Z 
2025-05-07T20:32:44.8806445Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3fb3a410>
2025-05-07T20:32:44.8807226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8807776Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df43dafc0>}
2025-05-07T20:32:44.8808676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8808880Z context = <triton._C.libtriton.ir.context object at 0x7f5d3fb469f0>
2025-05-07T20:32:44.8808885Z 
2025-05-07T20:32:44.8809050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8809320Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8809431Z                            module_map=module_map)
2025-05-07T20:32:44.8809660Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8809769Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8809849Z E       ^
2025-05-07T20:32:44.8810206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8810216Z 
2025-05-07T20:32:44.8810633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8810695Z 
2025-05-07T20:32:44.8810804Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8811034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8811116Z     T=2048,
2025-05-07T20:32:44.8811198Z     D=5120,
2025-05-07T20:32:44.8811293Z     scale_ub=None,
2025-05-07T20:32:44.8811380Z     contiguous=True,
2025-05-07T20:32:44.8811467Z     compiled=True,
2025-05-07T20:32:44.8811552Z )
2025-05-07T20:32:44.8811771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8811951Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8811956Z 
2025-05-07T20:32:44.8812034Z     @given(
2025-05-07T20:32:44.8812155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8812264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8812382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8812505Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8812625Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8812701Z     )
2025-05-07T20:32:44.8812945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8813047Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8813129Z         self,
2025-05-07T20:32:44.8813213Z         T: int,
2025-05-07T20:32:44.8813292Z         D: int,
2025-05-07T20:32:44.8813464Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8813565Z         contiguous: bool,
2025-05-07T20:32:44.8813652Z         compiled: bool,
2025-05-07T20:32:44.8813730Z     ) -> None:
2025-05-07T20:32:44.8813833Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8813911Z     
2025-05-07T20:32:44.8814087Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8814163Z     
2025-05-07T20:32:44.8814257Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8814395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8814486Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8814570Z         x0 = x[:, :D]
2025-05-07T20:32:44.8814658Z         x1 = x[:, D:]
2025-05-07T20:32:44.8814733Z     
2025-05-07T20:32:44.8814827Z         if contiguous:
2025-05-07T20:32:44.8814926Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8815018Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8815102Z     
2025-05-07T20:32:44.8815195Z         if scale_ub is not None:
2025-05-07T20:32:44.8815306Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8815452Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8815531Z             )
2025-05-07T20:32:44.8815610Z         else:
2025-05-07T20:32:44.8815713Z             scale_ub_tensor = None
2025-05-07T20:32:44.8815788Z     
2025-05-07T20:32:44.8815920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8816018Z             op = silu_mul_quant
2025-05-07T20:32:44.8816160Z             if compiled:
2025-05-07T20:32:44.8816272Z                 op = torch.compile(op)
2025-05-07T20:32:44.8816380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8816456Z     
2025-05-07T20:32:44.8816557Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8816679Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8816755Z     
2025-05-07T20:32:44.8816903Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8817009Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8817155Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8817286Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8817428Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8817501Z     
2025-05-07T20:32:44.8817608Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8817613Z 
2025-05-07T20:32:44.8817714Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8817893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8818002Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8818137Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8818753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8818858Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8819221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8819453Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8819818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8820081Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8820481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8820735Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8821118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8821287Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8821681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8821765Z     fn()
2025-05-07T20:32:44.8822163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8822255Z     self.fn.run(
2025-05-07T20:32:44.8822592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8822688Z     kernel = self.compile(
2025-05-07T20:32:44.8823076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8823253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8823390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8823395Z 
2025-05-07T20:32:44.8823601Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ffd7a90>
2025-05-07T20:32:44.8824375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8824887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5df4023740>}
2025-05-07T20:32:44.8825674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8825871Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ffe7c70>
2025-05-07T20:32:44.8825876Z 
2025-05-07T20:32:44.8826040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8826307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8826487Z                            module_map=module_map)
2025-05-07T20:32:44.8826649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8826756Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8826837Z E       ^
2025-05-07T20:32:44.8827193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8827198Z 
2025-05-07T20:32:44.8827659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8827664Z 
2025-05-07T20:32:44.8827772Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8828003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8828084Z     T=128,
2025-05-07T20:32:44.8828164Z     D=5120,
2025-05-07T20:32:44.8828255Z     scale_ub=None,
2025-05-07T20:32:44.8828342Z     contiguous=True,
2025-05-07T20:32:44.8828426Z     compiled=True,
2025-05-07T20:32:44.8828510Z )
2025-05-07T20:32:44.8828728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8828897Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8828907Z 
2025-05-07T20:32:44.8828984Z     @given(
2025-05-07T20:32:44.8829104Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8829213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8829331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8829451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8829570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8829647Z     )
2025-05-07T20:32:44.8829893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8829996Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8830123Z         self,
2025-05-07T20:32:44.8830201Z         T: int,
2025-05-07T20:32:44.8830291Z         D: int,
2025-05-07T20:32:44.8830390Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8830486Z         contiguous: bool,
2025-05-07T20:32:44.8830572Z         compiled: bool,
2025-05-07T20:32:44.8830652Z     ) -> None:
2025-05-07T20:32:44.8830755Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8830834Z     
2025-05-07T20:32:44.8831003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8831088Z     
2025-05-07T20:32:44.8831185Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8831311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8831406Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8831490Z         x0 = x[:, :D]
2025-05-07T20:32:44.8831574Z         x1 = x[:, D:]
2025-05-07T20:32:44.8831653Z     
2025-05-07T20:32:44.8831738Z         if contiguous:
2025-05-07T20:32:44.8831838Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8831932Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8832011Z     
2025-05-07T20:32:44.8832108Z         if scale_ub is not None:
2025-05-07T20:32:44.8832214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8832352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8832439Z             )
2025-05-07T20:32:44.8832516Z         else:
2025-05-07T20:32:44.8832612Z             scale_ub_tensor = None
2025-05-07T20:32:44.8832693Z     
2025-05-07T20:32:44.8832870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8832968Z             op = silu_mul_quant
2025-05-07T20:32:44.8833061Z             if compiled:
2025-05-07T20:32:44.8833162Z                 op = torch.compile(op)
2025-05-07T20:32:44.8833273Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8833349Z     
2025-05-07T20:32:44.8833442Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8833571Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8833649Z     
2025-05-07T20:32:44.8833784Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8833940Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8834040Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8834163Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8834310Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8834384Z     
2025-05-07T20:32:44.8834487Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8834500Z 
2025-05-07T20:32:44.8834637Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8834769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8834883Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8835019Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8835576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8835692Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8836052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8836282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8836649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8836907Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8837307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8837560Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8837932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8838149Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8838492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8838577Z     fn()
2025-05-07T20:32:44.8838975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8839059Z     self.fn.run(
2025-05-07T20:32:44.8839406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8839503Z     kernel = self.compile(
2025-05-07T20:32:44.8839880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8840059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8840191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8840199Z 
2025-05-07T20:32:44.8840416Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f9dced0>
2025-05-07T20:32:44.8841193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8841744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3fc91e40>}
2025-05-07T20:32:44.8842490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8842683Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f9bd430>
2025-05-07T20:32:44.8842688Z 
2025-05-07T20:32:44.8842862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8843175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8843294Z                            module_map=module_map)
2025-05-07T20:32:44.8843456Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8843559Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8843645Z E       ^
2025-05-07T20:32:44.8844042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8844047Z 
2025-05-07T20:32:44.8844460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8844469Z 
2025-05-07T20:32:44.8844574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8844798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8844891Z     T=4096,
2025-05-07T20:32:44.8844969Z     D=5120,
2025-05-07T20:32:44.8845071Z     scale_ub=None,
2025-05-07T20:32:44.8845162Z     contiguous=True,
2025-05-07T20:32:44.8857146Z     compiled=True,
2025-05-07T20:32:44.8857245Z )
2025-05-07T20:32:44.8857487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8857665Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8857670Z 
2025-05-07T20:32:44.8857762Z     @given(
2025-05-07T20:32:44.8857900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8858014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8858133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8858254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8858378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8858457Z     )
2025-05-07T20:32:44.8858709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8858915Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8858999Z         self,
2025-05-07T20:32:44.8859081Z         T: int,
2025-05-07T20:32:44.8859170Z         D: int,
2025-05-07T20:32:44.8859273Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8859374Z         contiguous: bool,
2025-05-07T20:32:44.8859467Z         compiled: bool,
2025-05-07T20:32:44.8859552Z     ) -> None:
2025-05-07T20:32:44.8859659Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8859737Z     
2025-05-07T20:32:44.8859917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8860003Z     
2025-05-07T20:32:44.8860101Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8860231Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8860335Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8860421Z         x0 = x[:, :D]
2025-05-07T20:32:44.8860507Z         x1 = x[:, D:]
2025-05-07T20:32:44.8860595Z     
2025-05-07T20:32:44.8860691Z         if contiguous:
2025-05-07T20:32:44.8860790Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8860894Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8860974Z     
2025-05-07T20:32:44.8861077Z         if scale_ub is not None:
2025-05-07T20:32:44.8861189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8861331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8861421Z             )
2025-05-07T20:32:44.8861502Z         else:
2025-05-07T20:32:44.8861648Z             scale_ub_tensor = None
2025-05-07T20:32:44.8861736Z     
2025-05-07T20:32:44.8861871Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8861968Z             op = silu_mul_quant
2025-05-07T20:32:44.8862065Z             if compiled:
2025-05-07T20:32:44.8862171Z                 op = torch.compile(op)
2025-05-07T20:32:44.8862282Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8862367Z     
2025-05-07T20:32:44.8862464Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8862601Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8862723Z     
2025-05-07T20:32:44.8862863Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8862976Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8863081Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8863208Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8863360Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8863481Z     
2025-05-07T20:32:44.8863588Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8863593Z 
2025-05-07T20:32:44.8863701Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8863837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8863947Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8864084Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8864661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8864772Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8865138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8865373Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8865749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8866016Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8866419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8866674Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8867103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8867277Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8867629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8867711Z     fn()
2025-05-07T20:32:44.8868130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8868244Z     self.fn.run(
2025-05-07T20:32:44.8868611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8868709Z     kernel = self.compile(
2025-05-07T20:32:44.8869103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8869284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8869427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8869434Z 
2025-05-07T20:32:44.8869646Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f5e3c10>
2025-05-07T20:32:44.8870430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8870993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3fc79ee0>}
2025-05-07T20:32:44.8871746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8871947Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f5f0130>
2025-05-07T20:32:44.8871994Z 
2025-05-07T20:32:44.8872161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8872427Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8872547Z                            module_map=module_map)
2025-05-07T20:32:44.8872716Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8872837Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8872958Z E       ^
2025-05-07T20:32:44.8873317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8873322Z 
2025-05-07T20:32:44.8873746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8873751Z 
2025-05-07T20:32:44.8873858Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8874098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8874183Z     T=16384,
2025-05-07T20:32:44.8874269Z     D=5120,
2025-05-07T20:32:44.8874365Z     scale_ub=None,
2025-05-07T20:32:44.8874455Z     contiguous=True,
2025-05-07T20:32:44.8874544Z     compiled=True,
2025-05-07T20:32:44.8874631Z )
2025-05-07T20:32:44.8874853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8875032Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8875038Z 
2025-05-07T20:32:44.8875127Z     @given(
2025-05-07T20:32:44.8875253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8875361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8875480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8875600Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8875723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8875846Z     )
2025-05-07T20:32:44.8876101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8876210Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8876292Z         self,
2025-05-07T20:32:44.8876374Z         T: int,
2025-05-07T20:32:44.8876460Z         D: int,
2025-05-07T20:32:44.8876563Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8876657Z         contiguous: bool,
2025-05-07T20:32:44.8876754Z         compiled: bool,
2025-05-07T20:32:44.8876839Z     ) -> None:
2025-05-07T20:32:44.8876948Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8877026Z     
2025-05-07T20:32:44.8877197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8877283Z     
2025-05-07T20:32:44.8877380Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8877509Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8877608Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8877693Z         x0 = x[:, :D]
2025-05-07T20:32:44.8877784Z         x1 = x[:, D:]
2025-05-07T20:32:44.8877868Z     
2025-05-07T20:32:44.8877957Z         if contiguous:
2025-05-07T20:32:44.8878053Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8878154Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8878231Z     
2025-05-07T20:32:44.8878326Z         if scale_ub is not None:
2025-05-07T20:32:44.8878442Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8878606Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8878761Z             )
2025-05-07T20:32:44.8878842Z         else:
2025-05-07T20:32:44.8878946Z             scale_ub_tensor = None
2025-05-07T20:32:44.8879028Z     
2025-05-07T20:32:44.8879161Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8879256Z             op = silu_mul_quant
2025-05-07T20:32:44.8879351Z             if compiled:
2025-05-07T20:32:44.8879456Z                 op = torch.compile(op)
2025-05-07T20:32:44.8879566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8879652Z     
2025-05-07T20:32:44.8879788Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8879912Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8879994Z     
2025-05-07T20:32:44.8880135Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8880247Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8880350Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8880480Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8880672Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8880751Z     
2025-05-07T20:32:44.8880856Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8880861Z 
2025-05-07T20:32:44.8880976Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8881109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8881225Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8881370Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8881936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8882050Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8882411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8882640Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8883015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8883272Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8883678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8883975Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8884356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8884534Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8884881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8884973Z     fn()
2025-05-07T20:32:44.8885379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8885468Z     self.fn.run(
2025-05-07T20:32:44.8885816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8885913Z     kernel = self.compile(
2025-05-07T20:32:44.8886299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8886485Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8886620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8886625Z 
2025-05-07T20:32:44.8886840Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f4c5910>
2025-05-07T20:32:44.8887757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8888272Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f91d080>}
2025-05-07T20:32:44.8889030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8889265Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f4d9eb0>
2025-05-07T20:32:44.8889270Z 
2025-05-07T20:32:44.8889446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8889711Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8889824Z                            module_map=module_map)
2025-05-07T20:32:44.8890062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8890170Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8890257Z E       ^
2025-05-07T20:32:44.8890614Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8890619Z 
2025-05-07T20:32:44.8891035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8891042Z 
2025-05-07T20:32:44.8891156Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8891382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8891470Z     T=1,
2025-05-07T20:32:44.8891553Z     D=5120,
2025-05-07T20:32:44.8891642Z     scale_ub=1200.0,
2025-05-07T20:32:44.8891738Z     contiguous=True,
2025-05-07T20:32:44.8891825Z     compiled=True,
2025-05-07T20:32:44.8891902Z )
2025-05-07T20:32:44.8892130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8892304Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.8892309Z 
2025-05-07T20:32:44.8892389Z     @given(
2025-05-07T20:32:44.8892522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8892626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8892750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8892870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8893032Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8893118Z     )
2025-05-07T20:32:44.8893369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8893467Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8893557Z         self,
2025-05-07T20:32:44.8893637Z         T: int,
2025-05-07T20:32:44.8893718Z         D: int,
2025-05-07T20:32:44.8893829Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8893925Z         contiguous: bool,
2025-05-07T20:32:44.8894016Z         compiled: bool,
2025-05-07T20:32:44.8894106Z     ) -> None:
2025-05-07T20:32:44.8894206Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8894283Z     
2025-05-07T20:32:44.8894468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8894547Z     
2025-05-07T20:32:44.8894650Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8894780Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8894879Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8894972Z         x0 = x[:, :D]
2025-05-07T20:32:44.8895057Z         x1 = x[:, D:]
2025-05-07T20:32:44.8895134Z     
2025-05-07T20:32:44.8895229Z         if contiguous:
2025-05-07T20:32:44.8895323Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8895417Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8895500Z     
2025-05-07T20:32:44.8895596Z         if scale_ub is not None:
2025-05-07T20:32:44.8895748Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8895894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8895975Z             )
2025-05-07T20:32:44.8896061Z         else:
2025-05-07T20:32:44.8896158Z             scale_ub_tensor = None
2025-05-07T20:32:44.8896235Z     
2025-05-07T20:32:44.8896376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8896472Z             op = silu_mul_quant
2025-05-07T20:32:44.8896560Z             if compiled:
2025-05-07T20:32:44.8896672Z                 op = torch.compile(op)
2025-05-07T20:32:44.8896822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8896898Z     
2025-05-07T20:32:44.8897000Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8897005Z 
2025-05-07T20:32:44.8897107Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8897246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8897350Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8897457Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8897874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8897976Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8898473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8898586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8898948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8899180Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8899518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8899624Z     kernel = self.compile(
2025-05-07T20:32:44.8900011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8900188Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8900324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8900329Z 
2025-05-07T20:32:44.8900536Z self = <triton.compiler.compiler.ASTSource object at 0x7f5df479b5d0>
2025-05-07T20:32:44.8901322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8901877Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f2ddee0>}
2025-05-07T20:32:44.8902631Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8902832Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f4f03f0>
2025-05-07T20:32:44.8902836Z 
2025-05-07T20:32:44.8903002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8903275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8903385Z                            module_map=module_map)
2025-05-07T20:32:44.8903552Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8903670Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8903749Z E       ^
2025-05-07T20:32:44.8904112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8904117Z 
2025-05-07T20:32:44.8904571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8904577Z 
2025-05-07T20:32:44.8904686Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8904918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8904998Z     T=1,
2025-05-07T20:32:44.8905081Z     D=5120,
2025-05-07T20:32:44.8905172Z     scale_ub=None,
2025-05-07T20:32:44.8905261Z     contiguous=False,
2025-05-07T20:32:44.8905351Z     compiled=True,
2025-05-07T20:32:44.8905428Z )
2025-05-07T20:32:44.8905978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8906295Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.8906300Z 
2025-05-07T20:32:44.8906381Z     @given(
2025-05-07T20:32:44.8906503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8906610Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8906727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8906852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8907056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8907137Z     )
2025-05-07T20:32:44.8907388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8907485Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8907566Z         self,
2025-05-07T20:32:44.8907651Z         T: int,
2025-05-07T20:32:44.8907732Z         D: int,
2025-05-07T20:32:44.8907836Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8907933Z         contiguous: bool,
2025-05-07T20:32:44.8908027Z         compiled: bool,
2025-05-07T20:32:44.8908108Z     ) -> None:
2025-05-07T20:32:44.8908210Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8908286Z     
2025-05-07T20:32:44.8908458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8908541Z     
2025-05-07T20:32:44.8908638Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8908772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8908867Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8908952Z         x0 = x[:, :D]
2025-05-07T20:32:44.8909039Z         x1 = x[:, D:]
2025-05-07T20:32:44.8909115Z     
2025-05-07T20:32:44.8909202Z         if contiguous:
2025-05-07T20:32:44.8909300Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8909393Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8909468Z     
2025-05-07T20:32:44.8909568Z         if scale_ub is not None:
2025-05-07T20:32:44.8909747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8909889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8909973Z             )
2025-05-07T20:32:44.8910052Z         else:
2025-05-07T20:32:44.8910150Z             scale_ub_tensor = None
2025-05-07T20:32:44.8910233Z     
2025-05-07T20:32:44.8910364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8910461Z             op = silu_mul_quant
2025-05-07T20:32:44.8910552Z             if compiled:
2025-05-07T20:32:44.8910658Z                 op = torch.compile(op)
2025-05-07T20:32:44.8910772Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8910851Z     
2025-05-07T20:32:44.8910950Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8911079Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8911154Z     
2025-05-07T20:32:44.8911293Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8911406Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8911509Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8911635Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8911783Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8911859Z     
2025-05-07T20:32:44.8911970Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8911975Z 
2025-05-07T20:32:44.8912076Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8912273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8912388Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8912526Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8913088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8913200Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8913565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8913837Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8914205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8914463Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8914907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8915160Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8915541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8915710Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8916058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8916146Z     fn()
2025-05-07T20:32:44.8916548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8916634Z     self.fn.run(
2025-05-07T20:32:44.8916980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8917081Z     kernel = self.compile(
2025-05-07T20:32:44.8917468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8917645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8917777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8917781Z 
2025-05-07T20:32:44.8917994Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eba6e90>
2025-05-07T20:32:44.8918822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8919338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f2f9f80>}
2025-05-07T20:32:44.8920091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8920289Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eb4b470>
2025-05-07T20:32:44.8920294Z 
2025-05-07T20:32:44.8920459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8920725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8920848Z                            module_map=module_map)
2025-05-07T20:32:44.8921013Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8921117Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8921201Z E       ^
2025-05-07T20:32:44.8921556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8921561Z 
2025-05-07T20:32:44.8922029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8922034Z 
2025-05-07T20:32:44.8922145Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8922368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8922452Z     T=1,
2025-05-07T20:32:44.8922535Z     D=5120,
2025-05-07T20:32:44.8922621Z     scale_ub=None,
2025-05-07T20:32:44.8922717Z     contiguous=True,
2025-05-07T20:32:44.8922804Z     compiled=False,
2025-05-07T20:32:44.8922946Z )
2025-05-07T20:32:44.8923173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8923340Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.8923345Z 
2025-05-07T20:32:44.8923433Z     @given(
2025-05-07T20:32:44.8923555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8923660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8923830Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8923952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8924067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8924149Z     )
2025-05-07T20:32:44.8924396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8924500Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8924580Z         self,
2025-05-07T20:32:44.8924663Z         T: int,
2025-05-07T20:32:44.8924751Z         D: int,
2025-05-07T20:32:44.8924857Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8924948Z         contiguous: bool,
2025-05-07T20:32:44.8925044Z         compiled: bool,
2025-05-07T20:32:44.8925126Z     ) -> None:
2025-05-07T20:32:44.8925224Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8925306Z     
2025-05-07T20:32:44.8925476Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8925553Z     
2025-05-07T20:32:44.8925660Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8925787Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8925885Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8925968Z         x0 = x[:, :D]
2025-05-07T20:32:44.8926053Z         x1 = x[:, D:]
2025-05-07T20:32:44.8926133Z     
2025-05-07T20:32:44.8926220Z         if contiguous:
2025-05-07T20:32:44.8926314Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8926413Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8926534Z     
2025-05-07T20:32:44.8926631Z         if scale_ub is not None:
2025-05-07T20:32:44.8926744Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8926882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8926961Z             )
2025-05-07T20:32:44.8927046Z         else:
2025-05-07T20:32:44.8927144Z             scale_ub_tensor = None
2025-05-07T20:32:44.8927220Z     
2025-05-07T20:32:44.8927364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8927461Z             op = silu_mul_quant
2025-05-07T20:32:44.8927623Z             if compiled:
2025-05-07T20:32:44.8927723Z                 op = torch.compile(op)
2025-05-07T20:32:44.8927829Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8927908Z     
2025-05-07T20:32:44.8928008Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8928014Z 
2025-05-07T20:32:44.8928127Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8928286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8928396Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8928502Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8929004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8929103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8929508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8929731Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8930072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8930176Z     kernel = self.compile(
2025-05-07T20:32:44.8930556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8930738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8930946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8930951Z 
2025-05-07T20:32:44.8931156Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eb03250>
2025-05-07T20:32:44.8931981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8932483Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f2fb9c0>}
2025-05-07T20:32:44.8933230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8933426Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eb7b7b0>
2025-05-07T20:32:44.8933431Z 
2025-05-07T20:32:44.8933592Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8933859Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8933968Z                            module_map=module_map)
2025-05-07T20:32:44.8934137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8934240Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8934321Z E       ^
2025-05-07T20:32:44.8934682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8934686Z 
2025-05-07T20:32:44.8935100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8935146Z 
2025-05-07T20:32:44.8935257Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8935479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8935556Z     T=128,
2025-05-07T20:32:44.8935639Z     D=5120,
2025-05-07T20:32:44.8935727Z     scale_ub=None,
2025-05-07T20:32:44.8935814Z     contiguous=False,
2025-05-07T20:32:44.8935904Z     compiled=True,
2025-05-07T20:32:44.8935979Z )
2025-05-07T20:32:44.8936196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8936375Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.8936380Z 
2025-05-07T20:32:44.8936459Z     @given(
2025-05-07T20:32:44.8936583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8936683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8936798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8936920Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8937035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8937114Z     )
2025-05-07T20:32:44.8937360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8937455Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8937534Z         self,
2025-05-07T20:32:44.8937617Z         T: int,
2025-05-07T20:32:44.8937696Z         D: int,
2025-05-07T20:32:44.8937796Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8937934Z         contiguous: bool,
2025-05-07T20:32:44.8938024Z         compiled: bool,
2025-05-07T20:32:44.8938107Z     ) -> None:
2025-05-07T20:32:44.8938203Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8938279Z     
2025-05-07T20:32:44.8938453Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8938527Z     
2025-05-07T20:32:44.8938618Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8938746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8938839Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8938963Z         x0 = x[:, :D]
2025-05-07T20:32:44.8939052Z         x1 = x[:, D:]
2025-05-07T20:32:44.8939124Z     
2025-05-07T20:32:44.8939209Z         if contiguous:
2025-05-07T20:32:44.8939305Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8939393Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8939470Z     
2025-05-07T20:32:44.8939562Z         if scale_ub is not None:
2025-05-07T20:32:44.8939668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8939850Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8939930Z             )
2025-05-07T20:32:44.8940006Z         else:
2025-05-07T20:32:44.8940106Z             scale_ub_tensor = None
2025-05-07T20:32:44.8940181Z     
2025-05-07T20:32:44.8940310Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8940406Z             op = silu_mul_quant
2025-05-07T20:32:44.8940491Z             if compiled:
2025-05-07T20:32:44.8940593Z                 op = torch.compile(op)
2025-05-07T20:32:44.8940706Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8940780Z     
2025-05-07T20:32:44.8940870Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8940879Z 
2025-05-07T20:32:44.8940978Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8941112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8941218Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8941319Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8941685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8941782Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8942273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8942377Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8942731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8942999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8943341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8943437Z     kernel = self.compile(
2025-05-07T20:32:44.8943819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8943996Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8944124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8944128Z 
2025-05-07T20:32:44.8944338Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eb16b90>
2025-05-07T20:32:44.8945108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8945614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f0a1120>}
2025-05-07T20:32:44.8946403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8946595Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eb13170>
2025-05-07T20:32:44.8946599Z 
2025-05-07T20:32:44.8946766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8947027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8947134Z                            module_map=module_map)
2025-05-07T20:32:44.8947302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8947444Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8947530Z E       ^
2025-05-07T20:32:44.8947881Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8947886Z 
2025-05-07T20:32:44.8948331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8948340Z 
2025-05-07T20:32:44.8948501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8948723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8948804Z     T=128,
2025-05-07T20:32:44.8948885Z     D=7168,
2025-05-07T20:32:44.8948970Z     scale_ub=1200.0,
2025-05-07T20:32:44.8949059Z     contiguous=False,
2025-05-07T20:32:44.8949147Z     compiled=False,
2025-05-07T20:32:44.8949221Z )
2025-05-07T20:32:44.8949443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8949620Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.8949625Z 
2025-05-07T20:32:44.8949704Z     @given(
2025-05-07T20:32:44.8949828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8949927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8950046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8950164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8950278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8950357Z     )
2025-05-07T20:32:44.8950600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8950693Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8950774Z         self,
2025-05-07T20:32:44.8950850Z         T: int,
2025-05-07T20:32:44.8950927Z         D: int,
2025-05-07T20:32:44.8951029Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8951164Z         contiguous: bool,
2025-05-07T20:32:44.8951255Z         compiled: bool,
2025-05-07T20:32:44.8951340Z     ) -> None:
2025-05-07T20:32:44.8951435Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8951517Z     
2025-05-07T20:32:44.8951686Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8951763Z     
2025-05-07T20:32:44.8951856Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8951979Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8952072Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8952158Z         x0 = x[:, :D]
2025-05-07T20:32:44.8952239Z         x1 = x[:, D:]
2025-05-07T20:32:44.8952318Z     
2025-05-07T20:32:44.8952403Z         if contiguous:
2025-05-07T20:32:44.8952494Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8952589Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8952660Z     
2025-05-07T20:32:44.8952749Z         if scale_ub is not None:
2025-05-07T20:32:44.8952862Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8953000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8953085Z             )
2025-05-07T20:32:44.8953162Z         else:
2025-05-07T20:32:44.8953257Z             scale_ub_tensor = None
2025-05-07T20:32:44.8953339Z     
2025-05-07T20:32:44.8953466Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8953556Z             op = silu_mul_quant
2025-05-07T20:32:44.8953646Z             if compiled:
2025-05-07T20:32:44.8953816Z                 op = torch.compile(op)
2025-05-07T20:32:44.8953923Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8954000Z     
2025-05-07T20:32:44.8954093Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8954097Z 
2025-05-07T20:32:44.8954192Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8954327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8954428Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8954536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8955073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8955171Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8955531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8955754Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8956131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8956231Z     kernel = self.compile(
2025-05-07T20:32:44.8956609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8956784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8956914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8956921Z 
2025-05-07T20:32:44.8957129Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ec46b90>
2025-05-07T20:32:44.8957902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8958453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f0a0360>}
2025-05-07T20:32:44.8959203Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8959393Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ecff1b0>
2025-05-07T20:32:44.8959438Z 
2025-05-07T20:32:44.8959606Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8959873Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8959982Z                            module_map=module_map)
2025-05-07T20:32:44.8960150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8960251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8960331Z E       ^
2025-05-07T20:32:44.8960696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8960700Z 
2025-05-07T20:32:44.8961111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8961116Z 
2025-05-07T20:32:44.8961225Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8961446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8961531Z     T=128,
2025-05-07T20:32:44.8961615Z     D=5120,
2025-05-07T20:32:44.8961703Z     scale_ub=None,
2025-05-07T20:32:44.8961788Z     contiguous=False,
2025-05-07T20:32:44.8961878Z     compiled=False,
2025-05-07T20:32:44.8961953Z )
2025-05-07T20:32:44.8962168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8962344Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8962392Z 
2025-05-07T20:32:44.8962478Z     @given(
2025-05-07T20:32:44.8962601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8962702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8962817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8962941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8963055Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8963130Z     )
2025-05-07T20:32:44.8963377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8963514Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8963597Z         self,
2025-05-07T20:32:44.8963675Z         T: int,
2025-05-07T20:32:44.8963748Z         D: int,
2025-05-07T20:32:44.8963850Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8963941Z         contiguous: bool,
2025-05-07T20:32:44.8964028Z         compiled: bool,
2025-05-07T20:32:44.8964112Z     ) -> None:
2025-05-07T20:32:44.8964210Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8964328Z     
2025-05-07T20:32:44.8964504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8964580Z     
2025-05-07T20:32:44.8964672Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8964801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8964890Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8964974Z         x0 = x[:, :D]
2025-05-07T20:32:44.8965063Z         x1 = x[:, D:]
2025-05-07T20:32:44.8965137Z     
2025-05-07T20:32:44.8965228Z         if contiguous:
2025-05-07T20:32:44.8965321Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8965413Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8965492Z     
2025-05-07T20:32:44.8965583Z         if scale_ub is not None:
2025-05-07T20:32:44.8965693Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8965834Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8965914Z             )
2025-05-07T20:32:44.8965997Z         else:
2025-05-07T20:32:44.8966097Z             scale_ub_tensor = None
2025-05-07T20:32:44.8966171Z     
2025-05-07T20:32:44.8966302Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8966397Z             op = silu_mul_quant
2025-05-07T20:32:44.8966484Z             if compiled:
2025-05-07T20:32:44.8966590Z                 op = torch.compile(op)
2025-05-07T20:32:44.8966696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8966814Z     
2025-05-07T20:32:44.8966912Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8966919Z 
2025-05-07T20:32:44.8967016Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8967146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8967252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8967351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8967898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8968008Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8968360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8968582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8968919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8969020Z     kernel = self.compile(
2025-05-07T20:32:44.8969406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8969579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8969710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8969714Z 
2025-05-07T20:32:44.8969965Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ec81f90>
2025-05-07T20:32:44.8970738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8971242Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ec28720>}
2025-05-07T20:32:44.8971987Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8972219Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ec065f0>
2025-05-07T20:32:44.8972224Z 
2025-05-07T20:32:44.8972389Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8972691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8972805Z                            module_map=module_map)
2025-05-07T20:32:44.8972964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8973066Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8973145Z E       ^
2025-05-07T20:32:44.8973495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8973503Z 
2025-05-07T20:32:44.8973918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8973926Z 
2025-05-07T20:32:44.8974030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8974254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8974334Z     T=128,
2025-05-07T20:32:44.8974413Z     D=5120,
2025-05-07T20:32:44.8974501Z     scale_ub=1200.0,
2025-05-07T20:32:44.8974589Z     contiguous=True,
2025-05-07T20:32:44.8974675Z     compiled=False,
2025-05-07T20:32:44.8974756Z )
2025-05-07T20:32:44.8974971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8975138Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.8975143Z 
2025-05-07T20:32:44.8975222Z     @given(
2025-05-07T20:32:44.8975341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8975490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8975606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8975723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8975841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8975916Z     )
2025-05-07T20:32:44.8976158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8976257Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8976336Z         self,
2025-05-07T20:32:44.8976415Z         T: int,
2025-05-07T20:32:44.8976495Z         D: int,
2025-05-07T20:32:44.8976592Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8976681Z         contiguous: bool,
2025-05-07T20:32:44.8976772Z         compiled: bool,
2025-05-07T20:32:44.8976848Z     ) -> None:
2025-05-07T20:32:44.8976945Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8977020Z     
2025-05-07T20:32:44.8977188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8977280Z     
2025-05-07T20:32:44.8977375Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8977500Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8977597Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8977679Z         x0 = x[:, :D]
2025-05-07T20:32:44.8977760Z         x1 = x[:, D:]
2025-05-07T20:32:44.8977837Z     
2025-05-07T20:32:44.8977923Z         if contiguous:
2025-05-07T20:32:44.8978014Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8978155Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8978232Z     
2025-05-07T20:32:44.8978325Z         if scale_ub is not None:
2025-05-07T20:32:44.8978436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8978570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8978649Z             )
2025-05-07T20:32:44.8978725Z         else:
2025-05-07T20:32:44.8978818Z             scale_ub_tensor = None
2025-05-07T20:32:44.8978897Z     
2025-05-07T20:32:44.8979029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8979168Z             op = silu_mul_quant
2025-05-07T20:32:44.8983196Z             if compiled:
2025-05-07T20:32:44.8983314Z                 op = torch.compile(op)
2025-05-07T20:32:44.8983427Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8983501Z     
2025-05-07T20:32:44.8983594Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8983599Z 
2025-05-07T20:32:44.8983706Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8983913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8984016Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8984117Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8984629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8984736Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8985099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8985325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8985670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8985765Z     kernel = self.compile(
2025-05-07T20:32:44.8986151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8986333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8986461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8986465Z 
2025-05-07T20:32:44.8986678Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ec0c110>
2025-05-07T20:32:44.8987455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8988027Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ec298a0>}
2025-05-07T20:32:44.8988782Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8988975Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ec01770>
2025-05-07T20:32:44.8988980Z 
2025-05-07T20:32:44.8989146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8989406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8989513Z                            module_map=module_map)
2025-05-07T20:32:44.8989679Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8989780Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8989864Z E       ^
2025-05-07T20:32:44.8990218Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8990222Z 
2025-05-07T20:32:44.8990631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8990676Z 
2025-05-07T20:32:44.8990792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8991015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8991099Z     T=1,
2025-05-07T20:32:44.8991179Z     D=7168,
2025-05-07T20:32:44.8991262Z     scale_ub=1200.0,
2025-05-07T20:32:44.8991351Z     contiguous=True,
2025-05-07T20:32:44.8991434Z     compiled=True,
2025-05-07T20:32:44.8991506Z )
2025-05-07T20:32:44.8991728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8991933Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.8991938Z 
2025-05-07T20:32:44.8992015Z     @given(
2025-05-07T20:32:44.8992143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8992243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8992362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8992481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8992629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8992713Z     )
2025-05-07T20:32:44.8992953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8993051Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8993137Z         self,
2025-05-07T20:32:44.8993214Z         T: int,
2025-05-07T20:32:44.8993290Z         D: int,
2025-05-07T20:32:44.8993390Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8993482Z         contiguous: bool,
2025-05-07T20:32:44.8993567Z         compiled: bool,
2025-05-07T20:32:44.8993654Z     ) -> None:
2025-05-07T20:32:44.8993749Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8993826Z     
2025-05-07T20:32:44.8993991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8994071Z     
2025-05-07T20:32:44.8994161Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8994286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8994377Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8994458Z         x0 = x[:, :D]
2025-05-07T20:32:44.8994535Z         x1 = x[:, D:]
2025-05-07T20:32:44.8994613Z     
2025-05-07T20:32:44.8994696Z         if contiguous:
2025-05-07T20:32:44.8994786Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8994877Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8994953Z     
2025-05-07T20:32:44.8995043Z         if scale_ub is not None:
2025-05-07T20:32:44.8995196Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8995332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8995418Z             )
2025-05-07T20:32:44.8995494Z         else:
2025-05-07T20:32:44.8995590Z             scale_ub_tensor = None
2025-05-07T20:32:44.8995667Z     
2025-05-07T20:32:44.8995796Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8995891Z             op = silu_mul_quant
2025-05-07T20:32:44.8995982Z             if compiled:
2025-05-07T20:32:44.8996088Z                 op = torch.compile(op)
2025-05-07T20:32:44.8996193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8996269Z     
2025-05-07T20:32:44.8996360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8996364Z 
2025-05-07T20:32:44.8996459Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8996589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8996689Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8996796Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8997170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8997264Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8997759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8997857Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8998262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8998488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8998825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8998923Z     kernel = self.compile(
2025-05-07T20:32:44.8999302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8999518Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8999654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8999659Z 
2025-05-07T20:32:44.8999862Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ee60290>
2025-05-07T20:32:44.9000678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9001179Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ec2ae80>}
2025-05-07T20:32:44.9001922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9002120Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eeb8930>
2025-05-07T20:32:44.9002124Z 
2025-05-07T20:32:44.9002285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9002547Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9002655Z                            module_map=module_map)
2025-05-07T20:32:44.9002818Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9002919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9002996Z E       ^
2025-05-07T20:32:44.9003353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9003358Z 
2025-05-07T20:32:44.9003767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9003817Z 
2025-05-07T20:32:44.9003922Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9004143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9004224Z     T=1,
2025-05-07T20:32:44.9004304Z     D=7168,
2025-05-07T20:32:44.9004392Z     scale_ub=1200.0,
2025-05-07T20:32:44.9004480Z     contiguous=False,
2025-05-07T20:32:44.9004571Z     compiled=True,
2025-05-07T20:32:44.9004645Z )
2025-05-07T20:32:44.9004865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9005037Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9005041Z 
2025-05-07T20:32:44.9005121Z     @given(
2025-05-07T20:32:44.9005240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9005342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9005456Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9005574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9005952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9006059Z     )
2025-05-07T20:32:44.9006325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9006420Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9006497Z         self,
2025-05-07T20:32:44.9006576Z         T: int,
2025-05-07T20:32:44.9006653Z         D: int,
2025-05-07T20:32:44.9006841Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9006936Z         contiguous: bool,
2025-05-07T20:32:44.9007021Z         compiled: bool,
2025-05-07T20:32:44.9007098Z     ) -> None:
2025-05-07T20:32:44.9007195Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9007268Z     
2025-05-07T20:32:44.9007433Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9007512Z     
2025-05-07T20:32:44.9007706Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9007838Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9007991Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9008068Z         x0 = x[:, :D]
2025-05-07T20:32:44.9008152Z         x1 = x[:, D:]
2025-05-07T20:32:44.9008224Z     
2025-05-07T20:32:44.9008308Z         if contiguous:
2025-05-07T20:32:44.9008401Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9008489Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9008566Z     
2025-05-07T20:32:44.9008662Z         if scale_ub is not None:
2025-05-07T20:32:44.9008829Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9008965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9009042Z             )
2025-05-07T20:32:44.9009118Z         else:
2025-05-07T20:32:44.9009211Z             scale_ub_tensor = None
2025-05-07T20:32:44.9009287Z     
2025-05-07T20:32:44.9009416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9009512Z             op = silu_mul_quant
2025-05-07T20:32:44.9009598Z             if compiled:
2025-05-07T20:32:44.9009699Z                 op = torch.compile(op)
2025-05-07T20:32:44.9009811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9009887Z     
2025-05-07T20:32:44.9009978Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9009982Z 
2025-05-07T20:32:44.9010081Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9010214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9010319Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9010426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9010793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9010888Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9011374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9011538Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9011893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9012118Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9012460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9012556Z     kernel = self.compile(
2025-05-07T20:32:44.9012938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9013115Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9013245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9013250Z 
2025-05-07T20:32:44.9013455Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ee35f90>
2025-05-07T20:32:44.9014223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9014730Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ee9c680>}
2025-05-07T20:32:44.9015516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9015709Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ee1a5f0>
2025-05-07T20:32:44.9015714Z 
2025-05-07T20:32:44.9015884Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9016145Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9016254Z                            module_map=module_map)
2025-05-07T20:32:44.9016459Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9016559Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9016639Z E       ^
2025-05-07T20:32:44.9016996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9017000Z 
2025-05-07T20:32:44.9017473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9017478Z 
2025-05-07T20:32:44.9017585Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9017806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9017884Z     T=1,
2025-05-07T20:32:44.9017966Z     D=7168,
2025-05-07T20:32:44.9018052Z     scale_ub=None,
2025-05-07T20:32:44.9018139Z     contiguous=False,
2025-05-07T20:32:44.9018223Z     compiled=True,
2025-05-07T20:32:44.9018305Z )
2025-05-07T20:32:44.9018530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9018695Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9018700Z 
2025-05-07T20:32:44.9018777Z     @given(
2025-05-07T20:32:44.9018897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9018995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9019108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9019233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9019345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9019421Z     )
2025-05-07T20:32:44.9019664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9019758Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9020238Z         self,
2025-05-07T20:32:44.9020317Z         T: int,
2025-05-07T20:32:44.9020440Z         D: int,
2025-05-07T20:32:44.9020544Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9020634Z         contiguous: bool,
2025-05-07T20:32:44.9020719Z         compiled: bool,
2025-05-07T20:32:44.9020802Z     ) -> None:
2025-05-07T20:32:44.9020897Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9020972Z     
2025-05-07T20:32:44.9021142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9021219Z     
2025-05-07T20:32:44.9021308Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9021440Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9021528Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9021610Z         x0 = x[:, :D]
2025-05-07T20:32:44.9021694Z         x1 = x[:, D:]
2025-05-07T20:32:44.9021766Z     
2025-05-07T20:32:44.9021851Z         if contiguous:
2025-05-07T20:32:44.9021942Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9022031Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9022111Z     
2025-05-07T20:32:44.9022206Z         if scale_ub is not None:
2025-05-07T20:32:44.9022316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9022453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9022527Z             )
2025-05-07T20:32:44.9022604Z         else:
2025-05-07T20:32:44.9022700Z             scale_ub_tensor = None
2025-05-07T20:32:44.9022774Z     
2025-05-07T20:32:44.9022908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9022999Z             op = silu_mul_quant
2025-05-07T20:32:44.9023136Z             if compiled:
2025-05-07T20:32:44.9023244Z                 op = torch.compile(op)
2025-05-07T20:32:44.9023350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9023425Z     
2025-05-07T20:32:44.9023518Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.9023636Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.9023711Z     
2025-05-07T20:32:44.9023850Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9023956Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.9024102Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.9024223Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.9024367Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.9024441Z     
2025-05-07T20:32:44.9024541Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.9024546Z 
2025-05-07T20:32:44.9024651Z moe/activation_test.py:126: 
2025-05-07T20:32:44.9024823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9024931Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.9025067Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.9025623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.9025727Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.9026088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9026310Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9026678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.9026932Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.9027333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.9027591Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.9027962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.9028143Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.9028560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.9028639Z     fn()
2025-05-07T20:32:44.9029039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.9029123Z     self.fn.run(
2025-05-07T20:32:44.9029459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9029563Z     kernel = self.compile(
2025-05-07T20:32:44.9029940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9030115Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9030241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9030246Z 
2025-05-07T20:32:44.9030451Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e90c790>
2025-05-07T20:32:44.9031237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9031742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ee9d580>}
2025-05-07T20:32:44.9032536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9032729Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e918cf0>
2025-05-07T20:32:44.9032734Z 
2025-05-07T20:32:44.9032898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9033164Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9033310Z                            module_map=module_map)
2025-05-07T20:32:44.9033473Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9033575Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.9033653Z E       ^
2025-05-07T20:32:44.9034007Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9034015Z 
2025-05-07T20:32:44.9034463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9034468Z 
2025-05-07T20:32:44.9034574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9034793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9034873Z     T=1,
2025-05-07T20:32:44.9034955Z     D=5120,
2025-05-07T20:32:44.9035042Z     scale_ub=1200.0,
2025-05-07T20:32:44.9035134Z     contiguous=False,
2025-05-07T20:32:44.9035225Z     compiled=True,
2025-05-07T20:32:44.9035299Z )
2025-05-07T20:32:44.9035515Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9035683Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9035687Z 
2025-05-07T20:32:44.9035764Z     @given(
2025-05-07T20:32:44.9035885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9035985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9036100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9036219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9036329Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9036407Z     )
2025-05-07T20:32:44.9036652Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9036747Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9036872Z         self,
2025-05-07T20:32:44.9036948Z         T: int,
2025-05-07T20:32:44.9037027Z         D: int,
2025-05-07T20:32:44.9037129Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9037218Z         contiguous: bool,
2025-05-07T20:32:44.9037304Z         compiled: bool,
2025-05-07T20:32:44.9037385Z     ) -> None:
2025-05-07T20:32:44.9037478Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9037552Z     
2025-05-07T20:32:44.9037722Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9037799Z     
2025-05-07T20:32:44.9037890Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9038016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9038103Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9038183Z         x0 = x[:, :D]
2025-05-07T20:32:44.9038265Z         x1 = x[:, D:]
2025-05-07T20:32:44.9038336Z     
2025-05-07T20:32:44.9038426Z         if contiguous:
2025-05-07T20:32:44.9038517Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9038608Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9038683Z     
2025-05-07T20:32:44.9038774Z         if scale_ub is not None:
2025-05-07T20:32:44.9038880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9039019Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9039097Z             )
2025-05-07T20:32:44.9039173Z         else:
2025-05-07T20:32:44.9039272Z             scale_ub_tensor = None
2025-05-07T20:32:44.9039346Z     
2025-05-07T20:32:44.9039521Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9039614Z             op = silu_mul_quant
2025-05-07T20:32:44.9039697Z             if compiled:
2025-05-07T20:32:44.9039802Z                 op = torch.compile(op)
2025-05-07T20:32:44.9039906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9039979Z     
2025-05-07T20:32:44.9040073Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9040077Z 
2025-05-07T20:32:44.9040172Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9040304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9040446Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9040543Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9040906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9041006Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9041535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9041637Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9041991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9042210Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9042554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9042650Z     kernel = self.compile(
2025-05-07T20:32:44.9043031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9043202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9043330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9043335Z 
2025-05-07T20:32:44.9043547Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e985150>
2025-05-07T20:32:44.9044317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9044818Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ee9eb60>}
2025-05-07T20:32:44.9045602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9045789Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e9ed7b0>
2025-05-07T20:32:44.9045794Z 
2025-05-07T20:32:44.9045959Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9046223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9046332Z                            module_map=module_map)
2025-05-07T20:32:44.9046491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9046591Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9046677Z E       ^
2025-05-07T20:32:44.9047029Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9047037Z 
2025-05-07T20:32:44.9047454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9047459Z 
2025-05-07T20:32:44.9047608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9047826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9047908Z     T=1,
2025-05-07T20:32:44.9047985Z     D=5120,
2025-05-07T20:32:44.9048111Z     scale_ub=1200.0,
2025-05-07T20:32:44.9048204Z     contiguous=False,
2025-05-07T20:32:44.9048290Z     compiled=False,
2025-05-07T20:32:44.9048364Z )
2025-05-07T20:32:44.9048583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9048749Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9048754Z 
2025-05-07T20:32:44.9048838Z     @given(
2025-05-07T20:32:44.9048955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9049057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9049244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9049357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9049467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9049548Z     )
2025-05-07T20:32:44.9049790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9049881Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9050003Z         self,
2025-05-07T20:32:44.9050081Z         T: int,
2025-05-07T20:32:44.9050163Z         D: int,
2025-05-07T20:32:44.9050259Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9050346Z         contiguous: bool,
2025-05-07T20:32:44.9050434Z         compiled: bool,
2025-05-07T20:32:44.9050513Z     ) -> None:
2025-05-07T20:32:44.9050608Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9050682Z     
2025-05-07T20:32:44.9050848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9050928Z     
2025-05-07T20:32:44.9051021Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9051142Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9051234Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9051317Z         x0 = x[:, :D]
2025-05-07T20:32:44.9051397Z         x1 = x[:, D:]
2025-05-07T20:32:44.9051469Z     
2025-05-07T20:32:44.9051552Z         if contiguous:
2025-05-07T20:32:44.9051645Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9051738Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9051808Z     
2025-05-07T20:32:44.9051899Z         if scale_ub is not None:
2025-05-07T20:32:44.9052006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9052138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9052214Z             )
2025-05-07T20:32:44.9052294Z         else:
2025-05-07T20:32:44.9052387Z             scale_ub_tensor = None
2025-05-07T20:32:44.9052505Z     
2025-05-07T20:32:44.9052636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9052730Z             op = silu_mul_quant
2025-05-07T20:32:44.9052814Z             if compiled:
2025-05-07T20:32:44.9052915Z                 op = torch.compile(op)
2025-05-07T20:32:44.9053020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9053095Z     
2025-05-07T20:32:44.9053184Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9053188Z 
2025-05-07T20:32:44.9053285Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9053418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9053519Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9053618Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9054114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9054209Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9054570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9054791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9055127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9055223Z     kernel = self.compile(
2025-05-07T20:32:44.9055654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9055827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9055962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9055966Z 
2025-05-07T20:32:44.9056170Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e6f6c50>
2025-05-07T20:32:44.9056945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9057490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3ee9f2e0>}
2025-05-07T20:32:44.9058308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9058514Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e6b32b0>
2025-05-07T20:32:44.9058519Z 
2025-05-07T20:32:44.9058679Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9058944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9059051Z                            module_map=module_map)
2025-05-07T20:32:44.9059215Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9059318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9059397Z E       ^
2025-05-07T20:32:44.9059750Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9059754Z 
2025-05-07T20:32:44.9060164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9060173Z 
2025-05-07T20:32:44.9060274Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9060494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9060573Z     T=16384,
2025-05-07T20:32:44.9060653Z     D=5120,
2025-05-07T20:32:44.9060734Z     scale_ub=1200.0,
2025-05-07T20:32:44.9060819Z     contiguous=False,
2025-05-07T20:32:44.9060908Z     compiled=True,
2025-05-07T20:32:44.9061036Z )
2025-05-07T20:32:44.9061250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9061433Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9061438Z 
2025-05-07T20:32:44.9061514Z     @given(
2025-05-07T20:32:44.9061633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9061735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9061847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9061969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9062078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9062151Z     )
2025-05-07T20:32:44.9062396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9062490Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9062565Z         self,
2025-05-07T20:32:44.9062647Z         T: int,
2025-05-07T20:32:44.9062727Z         D: int,
2025-05-07T20:32:44.9062826Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9062920Z         contiguous: bool,
2025-05-07T20:32:44.9063004Z         compiled: bool,
2025-05-07T20:32:44.9063082Z     ) -> None:
2025-05-07T20:32:44.9063180Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9063252Z     
2025-05-07T20:32:44.9063422Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9063494Z     
2025-05-07T20:32:44.9063584Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9063759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9063847Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9063927Z         x0 = x[:, :D]
2025-05-07T20:32:44.9064010Z         x1 = x[:, D:]
2025-05-07T20:32:44.9064083Z     
2025-05-07T20:32:44.9064165Z         if contiguous:
2025-05-07T20:32:44.9064256Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9064344Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9064416Z     
2025-05-07T20:32:44.9064512Z         if scale_ub is not None:
2025-05-07T20:32:44.9064616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9064807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9064883Z             )
2025-05-07T20:32:44.9064959Z         else:
2025-05-07T20:32:44.9065053Z             scale_ub_tensor = None
2025-05-07T20:32:44.9065127Z     
2025-05-07T20:32:44.9065254Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9065345Z             op = silu_mul_quant
2025-05-07T20:32:44.9065469Z             if compiled:
2025-05-07T20:32:44.9065567Z                 op = torch.compile(op)
2025-05-07T20:32:44.9065673Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9065746Z     
2025-05-07T20:32:44.9065836Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9065840Z 
2025-05-07T20:32:44.9065939Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9066067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9066173Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9066279Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9066642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9066734Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9067222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9067322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9067680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9067897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9068234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9068326Z     kernel = self.compile(
2025-05-07T20:32:44.9068746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9068924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9069051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9069055Z 
2025-05-07T20:32:44.9069260Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e60de10>
2025-05-07T20:32:44.9070034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9070534Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e69cfe0>}
2025-05-07T20:32:44.9071275Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9071467Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e63e430>
2025-05-07T20:32:44.9071471Z 
2025-05-07T20:32:44.9071636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9071896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9072046Z                            module_map=module_map)
2025-05-07T20:32:44.9072213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9072312Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9072393Z E       ^
2025-05-07T20:32:44.9072745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9072749Z 
2025-05-07T20:32:44.9073157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9073204Z 
2025-05-07T20:32:44.9073309Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9073528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9073606Z     T=2048,
2025-05-07T20:32:44.9073687Z     D=7168,
2025-05-07T20:32:44.9073769Z     scale_ub=1200.0,
2025-05-07T20:32:44.9073856Z     contiguous=False,
2025-05-07T20:32:44.9073938Z     compiled=True,
2025-05-07T20:32:44.9074014Z )
2025-05-07T20:32:44.9074271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9074444Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9074449Z 
2025-05-07T20:32:44.9074524Z     @given(
2025-05-07T20:32:44.9074645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9074742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9074855Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9074974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9075085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9075165Z     )
2025-05-07T20:32:44.9075404Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9075498Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9075580Z         self,
2025-05-07T20:32:44.9075658Z         T: int,
2025-05-07T20:32:44.9075735Z         D: int,
2025-05-07T20:32:44.9075838Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9075926Z         contiguous: bool,
2025-05-07T20:32:44.9076010Z         compiled: bool,
2025-05-07T20:32:44.9076093Z     ) -> None:
2025-05-07T20:32:44.9076185Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9076260Z     
2025-05-07T20:32:44.9076430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9076502Z     
2025-05-07T20:32:44.9076640Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9076762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9076854Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9076933Z         x0 = x[:, :D]
2025-05-07T20:32:44.9077012Z         x1 = x[:, D:]
2025-05-07T20:32:44.9077087Z     
2025-05-07T20:32:44.9077170Z         if contiguous:
2025-05-07T20:32:44.9077260Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9077351Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9077423Z     
2025-05-07T20:32:44.9077517Z         if scale_ub is not None:
2025-05-07T20:32:44.9077625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9077756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9077832Z             )
2025-05-07T20:32:44.9077912Z         else:
2025-05-07T20:32:44.9078006Z             scale_ub_tensor = None
2025-05-07T20:32:44.9078080Z     
2025-05-07T20:32:44.9078205Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9078294Z             op = silu_mul_quant
2025-05-07T20:32:44.9078381Z             if compiled:
2025-05-07T20:32:44.9078476Z                 op = torch.compile(op)
2025-05-07T20:32:44.9078578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9078651Z     
2025-05-07T20:32:44.9078737Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9078742Z 
2025-05-07T20:32:44.9078833Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9078959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9079125Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9079221Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9079585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9079674Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9080161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9080259Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9080647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9080865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9081198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9081290Z     kernel = self.compile(
2025-05-07T20:32:44.9081708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9081878Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9082004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9082009Z 
2025-05-07T20:32:44.9082209Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e70aed0>
2025-05-07T20:32:44.9082982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9083478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e69db20>}
2025-05-07T20:32:44.9084219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9084407Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e7134b0>
2025-05-07T20:32:44.9084411Z 
2025-05-07T20:32:44.9084569Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9084828Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9084973Z                            module_map=module_map)
2025-05-07T20:32:44.9085131Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9085229Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9085305Z E       ^
2025-05-07T20:32:44.9085655Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9085664Z 
2025-05-07T20:32:44.9086072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9086077Z 
2025-05-07T20:32:44.9086176Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9086396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9086472Z     T=1,
2025-05-07T20:32:44.9086546Z     D=5120,
2025-05-07T20:32:44.9086629Z     scale_ub=None,
2025-05-07T20:32:44.9086710Z     contiguous=False,
2025-05-07T20:32:44.9086793Z     compiled=False,
2025-05-07T20:32:44.9086864Z )
2025-05-07T20:32:44.9087074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9087238Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9087242Z 
2025-05-07T20:32:44.9087316Z     @given(
2025-05-07T20:32:44.9087430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9087525Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9087734Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9087851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9087965Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9088043Z     )
2025-05-07T20:32:44.9088287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9088383Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9088459Z         self,
2025-05-07T20:32:44.9088545Z         T: int,
2025-05-07T20:32:44.9088621Z         D: int,
2025-05-07T20:32:44.9088759Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9088848Z         contiguous: bool,
2025-05-07T20:32:44.9088933Z         compiled: bool,
2025-05-07T20:32:44.9089010Z     ) -> None:
2025-05-07T20:32:44.9089106Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9089179Z     
2025-05-07T20:32:44.9089346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9089422Z     
2025-05-07T20:32:44.9089514Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9089679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9089772Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9089852Z         x0 = x[:, :D]
2025-05-07T20:32:44.9089934Z         x1 = x[:, D:]
2025-05-07T20:32:44.9090006Z     
2025-05-07T20:32:44.9090089Z         if contiguous:
2025-05-07T20:32:44.9090183Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9090272Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9090349Z     
2025-05-07T20:32:44.9090441Z         if scale_ub is not None:
2025-05-07T20:32:44.9090547Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9090679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9090759Z             )
2025-05-07T20:32:44.9090835Z         else:
2025-05-07T20:32:44.9090927Z             scale_ub_tensor = None
2025-05-07T20:32:44.9091004Z     
2025-05-07T20:32:44.9091133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9091225Z             op = silu_mul_quant
2025-05-07T20:32:44.9091314Z             if compiled:
2025-05-07T20:32:44.9091411Z                 op = torch.compile(op)
2025-05-07T20:32:44.9091517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9091589Z     
2025-05-07T20:32:44.9091678Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9091683Z 
2025-05-07T20:32:44.9091782Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9091909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9092056Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9092154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9092645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9092745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9093102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9093322Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9093663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9093755Z     kernel = self.compile(
2025-05-07T20:32:44.9094138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9094314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9094441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9094446Z 
2025-05-07T20:32:44.9094652Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e72c850>
2025-05-07T20:32:44.9095464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9095963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e69ee80>}
2025-05-07T20:32:44.9096708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9096900Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e790e70>
2025-05-07T20:32:44.9096943Z 
2025-05-07T20:32:44.9097109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9097369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9097477Z                            module_map=module_map)
2025-05-07T20:32:44.9097638Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9097776Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9097858Z E       ^
2025-05-07T20:32:44.9098210Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9098215Z 
2025-05-07T20:32:44.9098623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9098628Z 
2025-05-07T20:32:44.9098735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9098955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9099035Z     T=4096,
2025-05-07T20:32:44.9099112Z     D=7168,
2025-05-07T20:32:44.9099192Z     scale_ub=1200.0,
2025-05-07T20:32:44.9099280Z     contiguous=False,
2025-05-07T20:32:44.9099364Z     compiled=False,
2025-05-07T20:32:44.9099437Z )
2025-05-07T20:32:44.9099652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9099828Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9099833Z 
2025-05-07T20:32:44.9099915Z     @given(
2025-05-07T20:32:44.9100034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9100132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9100249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9100363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9100519Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9100600Z     )
2025-05-07T20:32:44.9100841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9100934Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9101016Z         self,
2025-05-07T20:32:44.9101093Z         T: int,
2025-05-07T20:32:44.9101169Z         D: int,
2025-05-07T20:32:44.9101267Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9101355Z         contiguous: bool,
2025-05-07T20:32:44.9101443Z         compiled: bool,
2025-05-07T20:32:44.9101524Z     ) -> None:
2025-05-07T20:32:44.9101616Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9101691Z     
2025-05-07T20:32:44.9101860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9101934Z     
2025-05-07T20:32:44.9102030Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9102150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9102241Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9102323Z         x0 = x[:, :D]
2025-05-07T20:32:44.9102405Z         x1 = x[:, D:]
2025-05-07T20:32:44.9102476Z     
2025-05-07T20:32:44.9102563Z         if contiguous:
2025-05-07T20:32:44.9102653Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9102747Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9106808Z     
2025-05-07T20:32:44.9106921Z         if scale_ub is not None:
2025-05-07T20:32:44.9107031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9107287Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9107367Z             )
2025-05-07T20:32:44.9107449Z         else:
2025-05-07T20:32:44.9107542Z             scale_ub_tensor = None
2025-05-07T20:32:44.9107612Z     
2025-05-07T20:32:44.9107746Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9107837Z             op = silu_mul_quant
2025-05-07T20:32:44.9107924Z             if compiled:
2025-05-07T20:32:44.9108031Z                 op = torch.compile(op)
2025-05-07T20:32:44.9108137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9108279Z     
2025-05-07T20:32:44.9108371Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9108376Z 
2025-05-07T20:32:44.9108476Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9108608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9108708Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9108806Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9109381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9109480Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9109847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9110067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9110410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9110509Z     kernel = self.compile(
2025-05-07T20:32:44.9110888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9111061Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9111193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9111198Z 
2025-05-07T20:32:44.9111405Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e74f750>
2025-05-07T20:32:44.9112178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9112678Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e7dc040>}
2025-05-07T20:32:44.9113517Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9113708Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e75bdb0>
2025-05-07T20:32:44.9113713Z 
2025-05-07T20:32:44.9113890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9114158Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9114265Z                            module_map=module_map)
2025-05-07T20:32:44.9114427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9114524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9114602Z E       ^
2025-05-07T20:32:44.9114959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9114966Z 
2025-05-07T20:32:44.9115375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9115380Z 
2025-05-07T20:32:44.9115482Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9115707Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9115831Z     T=16384,
2025-05-07T20:32:44.9115914Z     D=7168,
2025-05-07T20:32:44.9115997Z     scale_ub=None,
2025-05-07T20:32:44.9116079Z     contiguous=True,
2025-05-07T20:32:44.9116165Z     compiled=True,
2025-05-07T20:32:44.9116235Z )
2025-05-07T20:32:44.9116450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9116630Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.9116635Z 
2025-05-07T20:32:44.9116719Z     @given(
2025-05-07T20:32:44.9116838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9116981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9117093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9117210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9117320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9117396Z     )
2025-05-07T20:32:44.9117645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9117776Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9117860Z         self,
2025-05-07T20:32:44.9117936Z         T: int,
2025-05-07T20:32:44.9118018Z         D: int,
2025-05-07T20:32:44.9118137Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9118238Z         contiguous: bool,
2025-05-07T20:32:44.9118332Z         compiled: bool,
2025-05-07T20:32:44.9118413Z     ) -> None:
2025-05-07T20:32:44.9118507Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9118583Z     
2025-05-07T20:32:44.9118753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9118828Z     
2025-05-07T20:32:44.9118920Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9119044Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9119131Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9119217Z         x0 = x[:, :D]
2025-05-07T20:32:44.9119295Z         x1 = x[:, D:]
2025-05-07T20:32:44.9119369Z     
2025-05-07T20:32:44.9119456Z         if contiguous:
2025-05-07T20:32:44.9119549Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9119635Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9119714Z     
2025-05-07T20:32:44.9119806Z         if scale_ub is not None:
2025-05-07T20:32:44.9119909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9120044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9120120Z             )
2025-05-07T20:32:44.9120245Z         else:
2025-05-07T20:32:44.9120337Z             scale_ub_tensor = None
2025-05-07T20:32:44.9120414Z     
2025-05-07T20:32:44.9120660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9120983Z             op = silu_mul_quant
2025-05-07T20:32:44.9121228Z             if compiled:
2025-05-07T20:32:44.9121473Z                 op = torch.compile(op)
2025-05-07T20:32:44.9121764Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9122031Z     
2025-05-07T20:32:44.9122225Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9122391Z 
2025-05-07T20:32:44.9122491Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9122782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9123117Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9123395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9123948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9124506Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9125169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9125862Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9126398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9127120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9127850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9128375Z     kernel = self.compile(
2025-05-07T20:32:44.9128911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9129563Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9129957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9130232Z 
2025-05-07T20:32:44.9130438Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3fece810>
2025-05-07T20:32:44.9131510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9132939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e7dd260>}
2025-05-07T20:32:44.9134283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9135304Z context = <triton._C.libtriton.ir.context object at 0x7f5d3fe42e30>
2025-05-07T20:32:44.9135596Z 
2025-05-07T20:32:44.9135765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9136282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9136740Z                            module_map=module_map)
2025-05-07T20:32:44.9137102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9137449Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9137706Z E       ^
2025-05-07T20:32:44.9138173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9138619Z 
2025-05-07T20:32:44.9139034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9139539Z 
2025-05-07T20:32:44.9139642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9140052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9140500Z     T=4096,
2025-05-07T20:32:44.9140687Z     D=5120,
2025-05-07T20:32:44.9140878Z     scale_ub=None,
2025-05-07T20:32:44.9141090Z     contiguous=False,
2025-05-07T20:32:44.9141310Z     compiled=True,
2025-05-07T20:32:44.9141511Z )
2025-05-07T20:32:44.9141825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9142318Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9142588Z 
2025-05-07T20:32:44.9142668Z     @given(
2025-05-07T20:32:44.9142898Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9143210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9143512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9143838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9144166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9144442Z     )
2025-05-07T20:32:44.9144789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9145231Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9145470Z         self,
2025-05-07T20:32:44.9145658Z         T: int,
2025-05-07T20:32:44.9145851Z         D: int,
2025-05-07T20:32:44.9146069Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9146332Z         contiguous: bool,
2025-05-07T20:32:44.9146570Z         compiled: bool,
2025-05-07T20:32:44.9146791Z     ) -> None:
2025-05-07T20:32:44.9147046Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9147292Z     
2025-05-07T20:32:44.9147565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9147902Z     
2025-05-07T20:32:44.9148123Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9148428Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9148729Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9148968Z         x0 = x[:, :D]
2025-05-07T20:32:44.9149187Z         x1 = x[:, D:]
2025-05-07T20:32:44.9149393Z     
2025-05-07T20:32:44.9149577Z         if contiguous:
2025-05-07T20:32:44.9149850Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9150101Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9150338Z     
2025-05-07T20:32:44.9150529Z         if scale_ub is not None:
2025-05-07T20:32:44.9150798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9151127Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9151434Z             )
2025-05-07T20:32:44.9151630Z         else:
2025-05-07T20:32:44.9151879Z             scale_ub_tensor = None
2025-05-07T20:32:44.9152126Z     
2025-05-07T20:32:44.9152354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9152661Z             op = silu_mul_quant
2025-05-07T20:32:44.9152909Z             if compiled:
2025-05-07T20:32:44.9153154Z                 op = torch.compile(op)
2025-05-07T20:32:44.9153442Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9153713Z     
2025-05-07T20:32:44.9153901Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9154067Z 
2025-05-07T20:32:44.9154163Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9154455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9154783Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9155058Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9155618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9156178Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9156838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9157522Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9158079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9158791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9159503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9160029Z     kernel = self.compile(
2025-05-07T20:32:44.9160565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9161219Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9161613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9161841Z 
2025-05-07T20:32:44.9162046Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3fee7fd0>
2025-05-07T20:32:44.9163127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9164502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e7ddda0>}
2025-05-07T20:32:44.9165846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9166927Z context = <triton._C.libtriton.ir.context object at 0x7f5d3fe9beb0>
2025-05-07T20:32:44.9167217Z 
2025-05-07T20:32:44.9167382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9167945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9168456Z                            module_map=module_map)
2025-05-07T20:32:44.9168813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9169164Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9169426Z E       ^
2025-05-07T20:32:44.9169885Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9170383Z 
2025-05-07T20:32:44.9170799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9171310Z 
2025-05-07T20:32:44.9171412Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9171823Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9172285Z     T=4096,
2025-05-07T20:32:44.9172473Z     D=5120,
2025-05-07T20:32:44.9172663Z     scale_ub=1200.0,
2025-05-07T20:32:44.9172879Z     contiguous=False,
2025-05-07T20:32:44.9173101Z     compiled=False,
2025-05-07T20:32:44.9173300Z )
2025-05-07T20:32:44.9173610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9174105Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9174384Z 
2025-05-07T20:32:44.9174461Z     @given(
2025-05-07T20:32:44.9174691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9174993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9175294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9175616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9175934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9176219Z     )
2025-05-07T20:32:44.9176566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9176999Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9177244Z         self,
2025-05-07T20:32:44.9177434Z         T: int,
2025-05-07T20:32:44.9177623Z         D: int,
2025-05-07T20:32:44.9177838Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9178104Z         contiguous: bool,
2025-05-07T20:32:44.9178339Z         compiled: bool,
2025-05-07T20:32:44.9178601Z     ) -> None:
2025-05-07T20:32:44.9178813Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9179052Z     
2025-05-07T20:32:44.9179318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9179656Z     
2025-05-07T20:32:44.9179845Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9180130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9180433Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9180667Z         x0 = x[:, :D]
2025-05-07T20:32:44.9180882Z         x1 = x[:, D:]
2025-05-07T20:32:44.9181092Z     
2025-05-07T20:32:44.9181269Z         if contiguous:
2025-05-07T20:32:44.9181491Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9181746Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9181980Z     
2025-05-07T20:32:44.9182163Z         if scale_ub is not None:
2025-05-07T20:32:44.9182433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9182762Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9183073Z             )
2025-05-07T20:32:44.9183265Z         else:
2025-05-07T20:32:44.9183470Z             scale_ub_tensor = None
2025-05-07T20:32:44.9183715Z     
2025-05-07T20:32:44.9183938Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9184247Z             op = silu_mul_quant
2025-05-07T20:32:44.9184497Z             if compiled:
2025-05-07T20:32:44.9184739Z                 op = torch.compile(op)
2025-05-07T20:32:44.9185033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9185358Z     
2025-05-07T20:32:44.9185552Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9185719Z 
2025-05-07T20:32:44.9185820Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9186110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9186433Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9186712Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9187396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9188156Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9188711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9189385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9190048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9190616Z     kernel = self.compile(
2025-05-07T20:32:44.9191152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9191803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9192190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9192417Z 
2025-05-07T20:32:44.9192626Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f1f13d0>
2025-05-07T20:32:44.9193703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9195079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e7df420>}
2025-05-07T20:32:44.9196421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9197445Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f189a30>
2025-05-07T20:32:44.9197732Z 
2025-05-07T20:32:44.9197895Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9198510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9198972Z                            module_map=module_map)
2025-05-07T20:32:44.9199334Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9199681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9199936Z E       ^
2025-05-07T20:32:44.9200398Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9200845Z 
2025-05-07T20:32:44.9201257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9201765Z 
2025-05-07T20:32:44.9201868Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9202274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9202671Z     T=4096,
2025-05-07T20:32:44.9202856Z     D=5120,
2025-05-07T20:32:44.9203048Z     scale_ub=1200.0,
2025-05-07T20:32:44.9203268Z     contiguous=False,
2025-05-07T20:32:44.9203486Z     compiled=True,
2025-05-07T20:32:44.9203691Z )
2025-05-07T20:32:44.9204003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9204489Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9204762Z 
2025-05-07T20:32:44.9204841Z     @given(
2025-05-07T20:32:44.9205070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9205425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9205974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9206320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9206648Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9206928Z     )
2025-05-07T20:32:44.9207280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9207765Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9208009Z         self,
2025-05-07T20:32:44.9208290Z         T: int,
2025-05-07T20:32:44.9208493Z         D: int,
2025-05-07T20:32:44.9208708Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9208978Z         contiguous: bool,
2025-05-07T20:32:44.9209218Z         compiled: bool,
2025-05-07T20:32:44.9209437Z     ) -> None:
2025-05-07T20:32:44.9209651Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9209897Z     
2025-05-07T20:32:44.9210170Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9210580Z     
2025-05-07T20:32:44.9210785Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9211075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9211384Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9211628Z         x0 = x[:, :D]
2025-05-07T20:32:44.9211847Z         x1 = x[:, D:]
2025-05-07T20:32:44.9212052Z     
2025-05-07T20:32:44.9212238Z         if contiguous:
2025-05-07T20:32:44.9212473Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9212733Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9212973Z     
2025-05-07T20:32:44.9213167Z         if scale_ub is not None:
2025-05-07T20:32:44.9213439Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9213770Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9214079Z             )
2025-05-07T20:32:44.9214274Z         else:
2025-05-07T20:32:44.9214482Z             scale_ub_tensor = None
2025-05-07T20:32:44.9214733Z     
2025-05-07T20:32:44.9214970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9215280Z             op = silu_mul_quant
2025-05-07T20:32:44.9215532Z             if compiled:
2025-05-07T20:32:44.9215782Z                 op = torch.compile(op)
2025-05-07T20:32:44.9216075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9216347Z     
2025-05-07T20:32:44.9216543Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9216706Z 
2025-05-07T20:32:44.9216883Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9217177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9217510Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9217793Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9218350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9218911Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9219580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9220275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9220810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9221495Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9222164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9222699Z     kernel = self.compile(
2025-05-07T20:32:44.9223242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9223899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9224296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9224528Z 
2025-05-07T20:32:44.9224808Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f1941d0>
2025-05-07T20:32:44.9225902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9227281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f134860>}
2025-05-07T20:32:44.9228677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9229701Z context = <triton._C.libtriton.ir.context object at 0x7f5d3f1507f0>
2025-05-07T20:32:44.9229995Z 
2025-05-07T20:32:44.9230167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9230733Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9231204Z                            module_map=module_map)
2025-05-07T20:32:44.9231569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9231925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9232187Z E       ^
2025-05-07T20:32:44.9232654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9233114Z 
2025-05-07T20:32:44.9233533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9234050Z 
2025-05-07T20:32:44.9234157Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9234575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9234973Z     T=2048,
2025-05-07T20:32:44.9235166Z     D=7168,
2025-05-07T20:32:44.9235367Z     scale_ub=1200.0,
2025-05-07T20:32:44.9235588Z     contiguous=False,
2025-05-07T20:32:44.9235814Z     compiled=False,
2025-05-07T20:32:44.9236019Z )
2025-05-07T20:32:44.9236336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9236837Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9237120Z 
2025-05-07T20:32:44.9237198Z     @given(
2025-05-07T20:32:44.9237482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9237796Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9238114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9238483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9238809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9239093Z     )
2025-05-07T20:32:44.9239445Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9239892Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9240133Z         self,
2025-05-07T20:32:44.9240330Z         T: int,
2025-05-07T20:32:44.9240533Z         D: int,
2025-05-07T20:32:44.9240749Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9241021Z         contiguous: bool,
2025-05-07T20:32:44.9241260Z         compiled: bool,
2025-05-07T20:32:44.9241480Z     ) -> None:
2025-05-07T20:32:44.9241697Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9241939Z     
2025-05-07T20:32:44.9242212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9242560Z     
2025-05-07T20:32:44.9242760Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9243051Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9243363Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9243603Z         x0 = x[:, :D]
2025-05-07T20:32:44.9243817Z         x1 = x[:, D:]
2025-05-07T20:32:44.9244026Z     
2025-05-07T20:32:44.9244213Z         if contiguous:
2025-05-07T20:32:44.9244491Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9244752Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9244993Z     
2025-05-07T20:32:44.9250011Z         if scale_ub is not None:
2025-05-07T20:32:44.9250316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9250664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9250982Z             )
2025-05-07T20:32:44.9251177Z         else:
2025-05-07T20:32:44.9251402Z             scale_ub_tensor = None
2025-05-07T20:32:44.9251659Z     
2025-05-07T20:32:44.9251994Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9252315Z             op = silu_mul_quant
2025-05-07T20:32:44.9252575Z             if compiled:
2025-05-07T20:32:44.9252823Z                 op = torch.compile(op)
2025-05-07T20:32:44.9253127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9253404Z     
2025-05-07T20:32:44.9253600Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9253766Z 
2025-05-07T20:32:44.9253915Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9254217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9254556Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9254842Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9255535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9256226Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9256759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9257441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9258100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9258631Z     kernel = self.compile(
2025-05-07T20:32:44.9259174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9259830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9260232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9260459Z 
2025-05-07T20:32:44.9260669Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3f15c110>
2025-05-07T20:32:44.9261739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9263162Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f1356c0>}
2025-05-07T20:32:44.9264505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9265528Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ed057f0>
2025-05-07T20:32:44.9265815Z 
2025-05-07T20:32:44.9265985Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9266504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9266969Z                            module_map=module_map)
2025-05-07T20:32:44.9267339Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9267693Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9267954Z E       ^
2025-05-07T20:32:44.9268421Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9268867Z 
2025-05-07T20:32:44.9269336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9269848Z 
2025-05-07T20:32:44.9269958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9270369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9270768Z     T=1,
2025-05-07T20:32:44.9270953Z     D=7168,
2025-05-07T20:32:44.9271146Z     scale_ub=None,
2025-05-07T20:32:44.9271360Z     contiguous=True,
2025-05-07T20:32:44.9271582Z     compiled=False,
2025-05-07T20:32:44.9271786Z )
2025-05-07T20:32:44.9272108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9272637Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9272897Z 
2025-05-07T20:32:44.9272978Z     @given(
2025-05-07T20:32:44.9273207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9273518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9273827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9274786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9275125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9275411Z     )
2025-05-07T20:32:44.9275758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9276199Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9276449Z         self,
2025-05-07T20:32:44.9276638Z         T: int,
2025-05-07T20:32:44.9276833Z         D: int,
2025-05-07T20:32:44.9277053Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9277321Z         contiguous: bool,
2025-05-07T20:32:44.9277557Z         compiled: bool,
2025-05-07T20:32:44.9277776Z     ) -> None:
2025-05-07T20:32:44.9277984Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9278216Z     
2025-05-07T20:32:44.9278483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9278818Z     
2025-05-07T20:32:44.9279001Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9279289Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9279593Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9279827Z         x0 = x[:, :D]
2025-05-07T20:32:44.9280039Z         x1 = x[:, D:]
2025-05-07T20:32:44.9280241Z     
2025-05-07T20:32:44.9280420Z         if contiguous:
2025-05-07T20:32:44.9280647Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9280901Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9281131Z     
2025-05-07T20:32:44.9281375Z         if scale_ub is not None:
2025-05-07T20:32:44.9281644Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9281972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9282276Z             )
2025-05-07T20:32:44.9282471Z         else:
2025-05-07T20:32:44.9282680Z             scale_ub_tensor = None
2025-05-07T20:32:44.9282920Z     
2025-05-07T20:32:44.9283147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9283457Z             op = silu_mul_quant
2025-05-07T20:32:44.9283707Z             if compiled:
2025-05-07T20:32:44.9283950Z                 op = torch.compile(op)
2025-05-07T20:32:44.9284242Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9284508Z     
2025-05-07T20:32:44.9284697Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9284859Z 
2025-05-07T20:32:44.9284958Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9285246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9285575Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9285854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9286538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9287222Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9287811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9288542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9289199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9289725Z     kernel = self.compile(
2025-05-07T20:32:44.9290260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9290909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9291301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9291575Z 
2025-05-07T20:32:44.9291779Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ed2c9d0>
2025-05-07T20:32:44.9292856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9294260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f134fe0>}
2025-05-07T20:32:44.9295582Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9296594Z context = <triton._C.libtriton.ir.context object at 0x7f5d3edd1030>
2025-05-07T20:32:44.9296886Z 
2025-05-07T20:32:44.9297046Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9297553Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9297999Z                            module_map=module_map)
2025-05-07T20:32:44.9298352Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9298698Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9298947Z E       ^
2025-05-07T20:32:44.9299393Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9299837Z 
2025-05-07T20:32:44.9300242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9300743Z 
2025-05-07T20:32:44.9300845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9301291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9301683Z     T=16384,
2025-05-07T20:32:44.9301867Z     D=7168,
2025-05-07T20:32:44.9302052Z     scale_ub=1200.0,
2025-05-07T20:32:44.9302262Z     contiguous=False,
2025-05-07T20:32:44.9302477Z     compiled=True,
2025-05-07T20:32:44.9302680Z )
2025-05-07T20:32:44.9302982Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9303471Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9303751Z 
2025-05-07T20:32:44.9303826Z     @given(
2025-05-07T20:32:44.9304048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9304347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9304644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9304962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9305272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9305552Z     )
2025-05-07T20:32:44.9306238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9306771Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9307005Z         self,
2025-05-07T20:32:44.9307196Z         T: int,
2025-05-07T20:32:44.9307391Z         D: int,
2025-05-07T20:32:44.9307599Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9307866Z         contiguous: bool,
2025-05-07T20:32:44.9308199Z         compiled: bool,
2025-05-07T20:32:44.9308420Z     ) -> None:
2025-05-07T20:32:44.9308632Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9308872Z     
2025-05-07T20:32:44.9309143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9309482Z     
2025-05-07T20:32:44.9309671Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9309954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9310258Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9310501Z         x0 = x[:, :D]
2025-05-07T20:32:44.9310788Z         x1 = x[:, D:]
2025-05-07T20:32:44.9310995Z     
2025-05-07T20:32:44.9311174Z         if contiguous:
2025-05-07T20:32:44.9311404Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9311653Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9311887Z     
2025-05-07T20:32:44.9312074Z         if scale_ub is not None:
2025-05-07T20:32:44.9312340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9312672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9313038Z             )
2025-05-07T20:32:44.9313229Z         else:
2025-05-07T20:32:44.9313435Z             scale_ub_tensor = None
2025-05-07T20:32:44.9313684Z     
2025-05-07T20:32:44.9313906Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9314214Z             op = silu_mul_quant
2025-05-07T20:32:44.9314460Z             if compiled:
2025-05-07T20:32:44.9314700Z                 op = torch.compile(op)
2025-05-07T20:32:44.9314995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9315271Z     
2025-05-07T20:32:44.9315462Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9315630Z 
2025-05-07T20:32:44.9315726Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9316020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9316343Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9316614Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9317171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9317725Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9318372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9319052Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9319411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9319703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9320038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9320137Z     kernel = self.compile(
2025-05-07T20:32:44.9320514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9320693Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9320824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9320829Z 
2025-05-07T20:32:44.9321032Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ed1fd10>
2025-05-07T20:32:44.9321802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9322309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3f137b00>}
2025-05-07T20:32:44.9323052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9323310Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ed0ba30>
2025-05-07T20:32:44.9323316Z 
2025-05-07T20:32:44.9323479Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9323742Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9323848Z                            module_map=module_map)
2025-05-07T20:32:44.9324008Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9324111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9324231Z E       ^
2025-05-07T20:32:44.9324586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9324590Z 
2025-05-07T20:32:44.9324999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9325004Z 
2025-05-07T20:32:44.9325109Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9325369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9325450Z     T=1,
2025-05-07T20:32:44.9325532Z     D=7168,
2025-05-07T20:32:44.9325614Z     scale_ub=None,
2025-05-07T20:32:44.9325703Z     contiguous=False,
2025-05-07T20:32:44.9325790Z     compiled=False,
2025-05-07T20:32:44.9325867Z )
2025-05-07T20:32:44.9326084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9326253Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9326260Z 
2025-05-07T20:32:44.9326338Z     @given(
2025-05-07T20:32:44.9326456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9326559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9326676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9326797Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9326912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9326987Z     )
2025-05-07T20:32:44.9327234Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9327327Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9327404Z         self,
2025-05-07T20:32:44.9327486Z         T: int,
2025-05-07T20:32:44.9327625Z         D: int,
2025-05-07T20:32:44.9327723Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9327814Z         contiguous: bool,
2025-05-07T20:32:44.9327949Z         compiled: bool,
2025-05-07T20:32:44.9328030Z     ) -> None:
2025-05-07T20:32:44.9328127Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9328200Z     
2025-05-07T20:32:44.9328372Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9328446Z     
2025-05-07T20:32:44.9328537Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9328664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9328751Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9328840Z         x0 = x[:, :D]
2025-05-07T20:32:44.9328923Z         x1 = x[:, D:]
2025-05-07T20:32:44.9328997Z     
2025-05-07T20:32:44.9329081Z         if contiguous:
2025-05-07T20:32:44.9329175Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9329263Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9329335Z     
2025-05-07T20:32:44.9329427Z         if scale_ub is not None:
2025-05-07T20:32:44.9329531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9329672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9329752Z             )
2025-05-07T20:32:44.9329827Z         else:
2025-05-07T20:32:44.9329922Z             scale_ub_tensor = None
2025-05-07T20:32:44.9329995Z     
2025-05-07T20:32:44.9330121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9330213Z             op = silu_mul_quant
2025-05-07T20:32:44.9330297Z             if compiled:
2025-05-07T20:32:44.9330395Z                 op = torch.compile(op)
2025-05-07T20:32:44.9330551Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9330624Z     
2025-05-07T20:32:44.9330715Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9330719Z 
2025-05-07T20:32:44.9330820Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9330947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9331047Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9331142Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9331640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9331783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9332138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9332355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9332736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9332836Z     kernel = self.compile(
2025-05-07T20:32:44.9333216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9333387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9333510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9333517Z 
2025-05-07T20:32:44.9333727Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e864c50>
2025-05-07T20:32:44.9334496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9335004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e86c9a0>}
2025-05-07T20:32:44.9335743Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9335939Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e8152b0>
2025-05-07T20:32:44.9335944Z 
2025-05-07T20:32:44.9336151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9336413Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9336525Z                            module_map=module_map)
2025-05-07T20:32:44.9336685Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9336784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9336868Z E       ^
2025-05-07T20:32:44.9337224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9337229Z 
2025-05-07T20:32:44.9337644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9337649Z 
2025-05-07T20:32:44.9337754Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9337973Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9338059Z     T=2048,
2025-05-07T20:32:44.9338136Z     D=7168,
2025-05-07T20:32:44.9338223Z     scale_ub=None,
2025-05-07T20:32:44.9338309Z     contiguous=False,
2025-05-07T20:32:44.9338391Z     compiled=True,
2025-05-07T20:32:44.9338464Z )
2025-05-07T20:32:44.9338684Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9338857Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9338862Z 
2025-05-07T20:32:44.9338945Z     @given(
2025-05-07T20:32:44.9339108Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9339208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9339325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9339443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9339555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9339635Z     )
2025-05-07T20:32:44.9339874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9339973Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9340091Z         self,
2025-05-07T20:32:44.9340168Z         T: int,
2025-05-07T20:32:44.9340248Z         D: int,
2025-05-07T20:32:44.9340346Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9340436Z         contiguous: bool,
2025-05-07T20:32:44.9340527Z         compiled: bool,
2025-05-07T20:32:44.9340606Z     ) -> None:
2025-05-07T20:32:44.9340700Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9340780Z     
2025-05-07T20:32:44.9340988Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9341064Z     
2025-05-07T20:32:44.9341157Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9341281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9341372Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9341450Z         x0 = x[:, :D]
2025-05-07T20:32:44.9341530Z         x1 = x[:, D:]
2025-05-07T20:32:44.9341609Z     
2025-05-07T20:32:44.9341694Z         if contiguous:
2025-05-07T20:32:44.9341790Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9341881Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9341955Z     
2025-05-07T20:32:44.9342044Z         if scale_ub is not None:
2025-05-07T20:32:44.9342150Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9342283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9342359Z             )
2025-05-07T20:32:44.9342438Z         else:
2025-05-07T20:32:44.9342534Z             scale_ub_tensor = None
2025-05-07T20:32:44.9342608Z     
2025-05-07T20:32:44.9342738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9342827Z             op = silu_mul_quant
2025-05-07T20:32:44.9342916Z             if compiled:
2025-05-07T20:32:44.9343013Z                 op = torch.compile(op)
2025-05-07T20:32:44.9343116Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9343191Z     
2025-05-07T20:32:44.9343327Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9343332Z 
2025-05-07T20:32:44.9343429Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9343564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9343664Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9343761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9344127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9344224Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9344720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9344817Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9345169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9345392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9345732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9345830Z     kernel = self.compile(
2025-05-07T20:32:44.9346208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9346379Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9346550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9346558Z 
2025-05-07T20:32:44.9346761Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e82bd10>
2025-05-07T20:32:44.9347529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9348032Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e86dd00>}
2025-05-07T20:32:44.9348813Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9349005Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e8e0370>
2025-05-07T20:32:44.9349009Z 
2025-05-07T20:32:44.9349213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9349477Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9349582Z                            module_map=module_map)
2025-05-07T20:32:44.9349741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9349842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9349921Z E       ^
2025-05-07T20:32:44.9350275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9350282Z 
2025-05-07T20:32:44.9350699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9350704Z 
2025-05-07T20:32:44.9350807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9351028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9351108Z     T=4096,
2025-05-07T20:32:44.9351187Z     D=7168,
2025-05-07T20:32:44.9351275Z     scale_ub=None,
2025-05-07T20:32:44.9351359Z     contiguous=False,
2025-05-07T20:32:44.9351443Z     compiled=True,
2025-05-07T20:32:44.9351516Z )
2025-05-07T20:32:44.9351729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9351906Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9351977Z 
2025-05-07T20:32:44.9352053Z     @given(
2025-05-07T20:32:44.9352175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9352277Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9352391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9352506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9352620Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9352693Z     )
2025-05-07T20:32:44.9352939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9353037Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9353112Z         self,
2025-05-07T20:32:44.9353194Z         T: int,
2025-05-07T20:32:44.9353269Z         D: int,
2025-05-07T20:32:44.9353367Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9353462Z         contiguous: bool,
2025-05-07T20:32:44.9353547Z         compiled: bool,
2025-05-07T20:32:44.9353625Z     ) -> None:
2025-05-07T20:32:44.9353725Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9353797Z     
2025-05-07T20:32:44.9353970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9354043Z     
2025-05-07T20:32:44.9354132Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9354261Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9354349Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9354430Z         x0 = x[:, :D]
2025-05-07T20:32:44.9354512Z         x1 = x[:, D:]
2025-05-07T20:32:44.9354585Z     
2025-05-07T20:32:44.9354716Z         if contiguous:
2025-05-07T20:32:44.9354810Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9354901Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9354977Z     
2025-05-07T20:32:44.9355066Z         if scale_ub is not None:
2025-05-07T20:32:44.9355171Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9355307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9355383Z             )
2025-05-07T20:32:44.9355464Z         else:
2025-05-07T20:32:44.9355560Z             scale_ub_tensor = None
2025-05-07T20:32:44.9355675Z     
2025-05-07T20:32:44.9355803Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9355897Z             op = silu_mul_quant
2025-05-07T20:32:44.9355981Z             if compiled:
2025-05-07T20:32:44.9356079Z                 op = torch.compile(op)
2025-05-07T20:32:44.9356186Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9356258Z     
2025-05-07T20:32:44.9356353Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9356396Z 
2025-05-07T20:32:44.9356495Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9356623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9356725Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9356822Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9357185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9357281Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9357772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9357871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9358226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9358477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9358840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9358934Z     kernel = self.compile(
2025-05-07T20:32:44.9359310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9359484Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9359652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9359660Z 
2025-05-07T20:32:44.9359865Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e4a5390>
2025-05-07T20:32:44.9360632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9361139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e86e840>}
2025-05-07T20:32:44.9361880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9362066Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e4bd9f0>
2025-05-07T20:32:44.9362074Z 
2025-05-07T20:32:44.9362243Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9362503Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9362612Z                            module_map=module_map)
2025-05-07T20:32:44.9362771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9362869Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9362950Z E       ^
2025-05-07T20:32:44.9363344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9363350Z 
2025-05-07T20:32:44.9363758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9363766Z 
2025-05-07T20:32:44.9363870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9364089Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9364215Z     T=16384,
2025-05-07T20:32:44.9364293Z     D=5120,
2025-05-07T20:32:44.9364378Z     scale_ub=1200.0,
2025-05-07T20:32:44.9364471Z     contiguous=False,
2025-05-07T20:32:44.9364555Z     compiled=False,
2025-05-07T20:32:44.9364627Z )
2025-05-07T20:32:44.9364844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9365023Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9365031Z 
2025-05-07T20:32:44.9365148Z     @given(
2025-05-07T20:32:44.9365271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9365374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9365490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9365603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9365714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9365787Z     )
2025-05-07T20:32:44.9366032Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9366129Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9366211Z         self,
2025-05-07T20:32:44.9366290Z         T: int,
2025-05-07T20:32:44.9366369Z         D: int,
2025-05-07T20:32:44.9366472Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9366560Z         contiguous: bool,
2025-05-07T20:32:44.9366648Z         compiled: bool,
2025-05-07T20:32:44.9366725Z     ) -> None:
2025-05-07T20:32:44.9366822Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9366898Z     
2025-05-07T20:32:44.9367066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9367140Z     
2025-05-07T20:32:44.9367232Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9367356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9367444Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9367526Z         x0 = x[:, :D]
2025-05-07T20:32:44.9367707Z         x1 = x[:, D:]
2025-05-07T20:32:44.9367781Z     
2025-05-07T20:32:44.9367871Z         if contiguous:
2025-05-07T20:32:44.9367962Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9368049Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9368125Z     
2025-05-07T20:32:44.9368216Z         if scale_ub is not None:
2025-05-07T20:32:44.9368345Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9368498Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9368583Z             )
2025-05-07T20:32:44.9368663Z         else:
2025-05-07T20:32:44.9368758Z             scale_ub_tensor = None
2025-05-07T20:32:44.9368833Z     
2025-05-07T20:32:44.9368965Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9369054Z             op = silu_mul_quant
2025-05-07T20:32:44.9369137Z             if compiled:
2025-05-07T20:32:44.9369240Z                 op = torch.compile(op)
2025-05-07T20:32:44.9369347Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9369423Z     
2025-05-07T20:32:44.9369514Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9369521Z 
2025-05-07T20:32:44.9369617Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9369750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9369848Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9369948Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9370495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9370594Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9370948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9371172Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9371508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9371609Z     kernel = self.compile(
2025-05-07T20:32:44.9372027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9372201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9372329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9372333Z 
2025-05-07T20:32:44.9372539Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e4126d0>
2025-05-07T20:32:44.9373349Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9373847Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e4f4040>}
2025-05-07T20:32:44.9374586Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9374779Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e4dacf0>
2025-05-07T20:32:44.9374784Z 
2025-05-07T20:32:44.9374946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9375213Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9375319Z                            module_map=module_map)
2025-05-07T20:32:44.9375478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9375579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9375657Z E       ^
2025-05-07T20:32:44.9376013Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9376060Z 
2025-05-07T20:32:44.9376470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9376475Z 
2025-05-07T20:32:44.9376580Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9376799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9376875Z     T=16384,
2025-05-07T20:32:44.9376952Z     D=5120,
2025-05-07T20:32:44.9377036Z     scale_ub=1200.0,
2025-05-07T20:32:44.9377124Z     contiguous=True,
2025-05-07T20:32:44.9377209Z     compiled=True,
2025-05-07T20:32:44.9377284Z )
2025-05-07T20:32:44.9377500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9377676Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9377680Z 
2025-05-07T20:32:44.9377758Z     @given(
2025-05-07T20:32:44.9377874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9377977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9378093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9378209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9378320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9378394Z     )
2025-05-07T20:32:44.9378638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9378731Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9378852Z         self,
2025-05-07T20:32:44.9378935Z         T: int,
2025-05-07T20:32:44.9379012Z         D: int,
2025-05-07T20:32:44.9379109Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9379204Z         contiguous: bool,
2025-05-07T20:32:44.9379291Z         compiled: bool,
2025-05-07T20:32:44.9379367Z     ) -> None:
2025-05-07T20:32:44.9379461Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9379535Z     
2025-05-07T20:32:44.9379699Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9379782Z     
2025-05-07T20:32:44.9379915Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9380042Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9380131Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9380211Z         x0 = x[:, :D]
2025-05-07T20:32:44.9380293Z         x1 = x[:, D:]
2025-05-07T20:32:44.9380364Z     
2025-05-07T20:32:44.9380446Z         if contiguous:
2025-05-07T20:32:44.9380540Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9380699Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9380773Z     
2025-05-07T20:32:44.9380866Z         if scale_ub is not None:
2025-05-07T20:32:44.9380968Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9385171Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9385261Z             )
2025-05-07T20:32:44.9385342Z         else:
2025-05-07T20:32:44.9385444Z             scale_ub_tensor = None
2025-05-07T20:32:44.9385525Z     
2025-05-07T20:32:44.9385663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9385765Z             op = silu_mul_quant
2025-05-07T20:32:44.9385852Z             if compiled:
2025-05-07T20:32:44.9385956Z                 op = torch.compile(op)
2025-05-07T20:32:44.9386066Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9386141Z     
2025-05-07T20:32:44.9386232Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9386241Z 
2025-05-07T20:32:44.9386342Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9386482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9386588Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9386689Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9387068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9387167Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9387665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9387836Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9388198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9388423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9388774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9388873Z     kernel = self.compile(
2025-05-07T20:32:44.9389256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9389434Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9389564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9389568Z 
2025-05-07T20:32:44.9389781Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eae2990>
2025-05-07T20:32:44.9390567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9391114Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e4f5300>}
2025-05-07T20:32:44.9391872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9392065Z context = <triton._C.libtriton.ir.context object at 0x7f5d3ea19c30>
2025-05-07T20:32:44.9392069Z 
2025-05-07T20:32:44.9392240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9392509Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9392662Z                            module_map=module_map)
2025-05-07T20:32:44.9392830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9392931Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9393011Z E       ^
2025-05-07T20:32:44.9393367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9393410Z 
2025-05-07T20:32:44.9393823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9393827Z 
2025-05-07T20:32:44.9393935Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9394157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9394241Z     T=16384,
2025-05-07T20:32:44.9394323Z     D=5120,
2025-05-07T20:32:44.9394407Z     scale_ub=None,
2025-05-07T20:32:44.9394500Z     contiguous=False,
2025-05-07T20:32:44.9394585Z     compiled=True,
2025-05-07T20:32:44.9394660Z )
2025-05-07T20:32:44.9394881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9395059Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9395064Z 
2025-05-07T20:32:44.9395144Z     @given(
2025-05-07T20:32:44.9395274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9395380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9395503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9395621Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9395734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9395814Z     )
2025-05-07T20:32:44.9396060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9396202Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9396286Z         self,
2025-05-07T20:32:44.9396366Z         T: int,
2025-05-07T20:32:44.9396448Z         D: int,
2025-05-07T20:32:44.9396548Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9396638Z         contiguous: bool,
2025-05-07T20:32:44.9396727Z         compiled: bool,
2025-05-07T20:32:44.9396807Z     ) -> None:
2025-05-07T20:32:44.9396903Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9396980Z     
2025-05-07T20:32:44.9397154Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9397233Z     
2025-05-07T20:32:44.9397327Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9397453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9397548Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9397630Z         x0 = x[:, :D]
2025-05-07T20:32:44.9397713Z         x1 = x[:, D:]
2025-05-07T20:32:44.9397792Z     
2025-05-07T20:32:44.9397878Z         if contiguous:
2025-05-07T20:32:44.9397975Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9398073Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9398148Z     
2025-05-07T20:32:44.9398240Z         if scale_ub is not None:
2025-05-07T20:32:44.9398350Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9398483Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9398557Z             )
2025-05-07T20:32:44.9398639Z         else:
2025-05-07T20:32:44.9398732Z             scale_ub_tensor = None
2025-05-07T20:32:44.9398850Z     
2025-05-07T20:32:44.9398982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9399071Z             op = silu_mul_quant
2025-05-07T20:32:44.9399159Z             if compiled:
2025-05-07T20:32:44.9399259Z                 op = torch.compile(op)
2025-05-07T20:32:44.9399366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9399439Z     
2025-05-07T20:32:44.9399528Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9399536Z 
2025-05-07T20:32:44.9399631Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9399806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9399908Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9400011Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9400380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9400473Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9401010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9401109Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9401465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9401688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9402028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9402125Z     kernel = self.compile(
2025-05-07T20:32:44.9402504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9402677Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9402806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9402810Z 
2025-05-07T20:32:44.9403018Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3ea52bd0>
2025-05-07T20:32:44.9403795Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9404296Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e4f5e40>}
2025-05-07T20:32:44.9405085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9405277Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eacb1f0>
2025-05-07T20:32:44.9405281Z 
2025-05-07T20:32:44.9405446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9406019Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9406130Z                            module_map=module_map)
2025-05-07T20:32:44.9406292Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9406395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9406474Z E       ^
2025-05-07T20:32:44.9406828Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9406838Z 
2025-05-07T20:32:44.9407248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9407253Z 
2025-05-07T20:32:44.9407355Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9407625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9407705Z     T=2048,
2025-05-07T20:32:44.9407860Z     D=5120,
2025-05-07T20:32:44.9407951Z     scale_ub=None,
2025-05-07T20:32:44.9408038Z     contiguous=False,
2025-05-07T20:32:44.9408125Z     compiled=True,
2025-05-07T20:32:44.9408199Z )
2025-05-07T20:32:44.9408444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9408635Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9408640Z 
2025-05-07T20:32:44.9408717Z     @given(
2025-05-07T20:32:44.9408838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9409000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9409112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9409227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9409341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9409415Z     )
2025-05-07T20:32:44.9409658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9409810Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9409889Z         self,
2025-05-07T20:32:44.9409966Z         T: int,
2025-05-07T20:32:44.9410043Z         D: int,
2025-05-07T20:32:44.9410142Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9410230Z         contiguous: bool,
2025-05-07T20:32:44.9410313Z         compiled: bool,
2025-05-07T20:32:44.9410391Z     ) -> None:
2025-05-07T20:32:44.9410487Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9410562Z     
2025-05-07T20:32:44.9410730Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9410807Z     
2025-05-07T20:32:44.9410897Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9411025Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9411113Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9411192Z         x0 = x[:, :D]
2025-05-07T20:32:44.9411273Z         x1 = x[:, D:]
2025-05-07T20:32:44.9411345Z     
2025-05-07T20:32:44.9411428Z         if contiguous:
2025-05-07T20:32:44.9411526Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9411614Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9411686Z     
2025-05-07T20:32:44.9411779Z         if scale_ub is not None:
2025-05-07T20:32:44.9411883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9412016Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9412096Z             )
2025-05-07T20:32:44.9412172Z         else:
2025-05-07T20:32:44.9412330Z             scale_ub_tensor = None
2025-05-07T20:32:44.9412411Z     
2025-05-07T20:32:44.9412543Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9412633Z             op = silu_mul_quant
2025-05-07T20:32:44.9412719Z             if compiled:
2025-05-07T20:32:44.9412819Z                 op = torch.compile(op)
2025-05-07T20:32:44.9412924Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9412996Z     
2025-05-07T20:32:44.9413085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9413092Z 
2025-05-07T20:32:44.9413193Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9413322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9413421Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9413523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9413888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9413984Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9414477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9414576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9414936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9415153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9415539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9415634Z     kernel = self.compile(
2025-05-07T20:32:44.9416010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9416186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9416310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9416318Z 
2025-05-07T20:32:44.9416584Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3eac7fd0>
2025-05-07T20:32:44.9417363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9417905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e4f7240>}
2025-05-07T20:32:44.9418699Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9418886Z context = <triton._C.libtriton.ir.context object at 0x7f5d3eaa4670>
2025-05-07T20:32:44.9418891Z 
2025-05-07T20:32:44.9419059Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9419322Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9419428Z                            module_map=module_map)
2025-05-07T20:32:44.9419595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9419694Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9419771Z E       ^
2025-05-07T20:32:44.9420130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9420135Z 
2025-05-07T20:32:44.9420547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9420552Z 
2025-05-07T20:32:44.9420659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9420876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9420999Z     T=2048,
2025-05-07T20:32:44.9421081Z     D=5120,
2025-05-07T20:32:44.9421168Z     scale_ub=1200.0,
2025-05-07T20:32:44.9421254Z     contiguous=False,
2025-05-07T20:32:44.9421337Z     compiled=True,
2025-05-07T20:32:44.9421411Z )
2025-05-07T20:32:44.9421631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9421803Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9421808Z 
2025-05-07T20:32:44.9421885Z     @given(
2025-05-07T20:32:44.9422011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9422108Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9422220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9422336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9422448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9422522Z     )
2025-05-07T20:32:44.9422770Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9422867Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9422944Z         self,
2025-05-07T20:32:44.9423021Z         T: int,
2025-05-07T20:32:44.9423096Z         D: int,
2025-05-07T20:32:44.9423196Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9423283Z         contiguous: bool,
2025-05-07T20:32:44.9423367Z         compiled: bool,
2025-05-07T20:32:44.9423450Z     ) -> None:
2025-05-07T20:32:44.9423544Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9423661Z     
2025-05-07T20:32:44.9423836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9423911Z     
2025-05-07T20:32:44.9423998Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9424125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9424213Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9424295Z         x0 = x[:, :D]
2025-05-07T20:32:44.9424373Z         x1 = x[:, D:]
2025-05-07T20:32:44.9424448Z     
2025-05-07T20:32:44.9424536Z         if contiguous:
2025-05-07T20:32:44.9424665Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9424754Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9424829Z     
2025-05-07T20:32:44.9424918Z         if scale_ub is not None:
2025-05-07T20:32:44.9425021Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9425157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9425231Z             )
2025-05-07T20:32:44.9425305Z         else:
2025-05-07T20:32:44.9425449Z             scale_ub_tensor = None
2025-05-07T20:32:44.9425525Z     
2025-05-07T20:32:44.9425655Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9425752Z             op = silu_mul_quant
2025-05-07T20:32:44.9425836Z             if compiled:
2025-05-07T20:32:44.9425937Z                 op = torch.compile(op)
2025-05-07T20:32:44.9426041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9426113Z     
2025-05-07T20:32:44.9426211Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9426216Z 
2025-05-07T20:32:44.9426315Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9426446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9426547Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9426643Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9427010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9427107Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9427604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9427703Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9428055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9428276Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9428658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9428757Z     kernel = self.compile(
2025-05-07T20:32:44.9429138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9429309Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9429441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9429448Z 
2025-05-07T20:32:44.9429649Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e5d5310>
2025-05-07T20:32:44.9430415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9430916Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e50c720>}
2025-05-07T20:32:44.9431664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9431856Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e599970>
2025-05-07T20:32:44.9431861Z 
2025-05-07T20:32:44.9432066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9432329Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9432434Z                            module_map=module_map)
2025-05-07T20:32:44.9432595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9432695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9432776Z E       ^
2025-05-07T20:32:44.9433127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9433172Z 
2025-05-07T20:32:44.9433586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9433591Z 
2025-05-07T20:32:44.9433693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9433915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9434032Z     T=4096,
2025-05-07T20:32:44.9434109Z     D=5120,
2025-05-07T20:32:44.9434197Z     scale_ub=1200.0,
2025-05-07T20:32:44.9434281Z     contiguous=True,
2025-05-07T20:32:44.9434362Z     compiled=True,
2025-05-07T20:32:44.9434441Z )
2025-05-07T20:32:44.9434655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9434823Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9434834Z 
2025-05-07T20:32:44.9434907Z     @given(
2025-05-07T20:32:44.9435028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9435127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9435239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9435354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9435469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9435545Z     )
2025-05-07T20:32:44.9435791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9435888Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9435964Z         self,
2025-05-07T20:32:44.9436038Z         T: int,
2025-05-07T20:32:44.9436118Z         D: int,
2025-05-07T20:32:44.9436214Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9436304Z         contiguous: bool,
2025-05-07T20:32:44.9436388Z         compiled: bool,
2025-05-07T20:32:44.9436464Z     ) -> None:
2025-05-07T20:32:44.9436610Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9436685Z     
2025-05-07T20:32:44.9436852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9436929Z     
2025-05-07T20:32:44.9437019Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9437140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9437232Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9437311Z         x0 = x[:, :D]
2025-05-07T20:32:44.9437391Z         x1 = x[:, D:]
2025-05-07T20:32:44.9437466Z     
2025-05-07T20:32:44.9437552Z         if contiguous:
2025-05-07T20:32:44.9437641Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9437734Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9437805Z     
2025-05-07T20:32:44.9437895Z         if scale_ub is not None:
2025-05-07T20:32:44.9438001Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9438143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9438241Z             )
2025-05-07T20:32:44.9438324Z         else:
2025-05-07T20:32:44.9438437Z             scale_ub_tensor = None
2025-05-07T20:32:44.9438513Z     
2025-05-07T20:32:44.9438640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9438727Z             op = silu_mul_quant
2025-05-07T20:32:44.9438812Z             if compiled:
2025-05-07T20:32:44.9438913Z                 op = torch.compile(op)
2025-05-07T20:32:44.9439018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9439093Z     
2025-05-07T20:32:44.9439227Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9439232Z 
2025-05-07T20:32:44.9439336Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9439465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9439565Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9439667Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9440030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9440124Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9440658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9440753Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9441108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9441371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9441708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9441803Z     kernel = self.compile(
2025-05-07T20:32:44.9442180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9442356Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9442485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9442492Z 
2025-05-07T20:32:44.9442693Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e5aa750>
2025-05-07T20:32:44.9443463Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9443967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e50d260>}
2025-05-07T20:32:44.9444710Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9444898Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e576d70>
2025-05-07T20:32:44.9444944Z 
2025-05-07T20:32:44.9445109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9445371Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9445478Z                            module_map=module_map)
2025-05-07T20:32:44.9445639Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9445737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9445816Z E       ^
2025-05-07T20:32:44.9446170Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9446175Z 
2025-05-07T20:32:44.9446585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9446589Z 
2025-05-07T20:32:44.9446693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9446910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9446994Z     T=128,
2025-05-07T20:32:44.9447069Z     D=5120,
2025-05-07T20:32:44.9447151Z     scale_ub=1200.0,
2025-05-07T20:32:44.9447236Z     contiguous=False,
2025-05-07T20:32:44.9447320Z     compiled=True,
2025-05-07T20:32:44.9447393Z )
2025-05-07T20:32:44.9447662Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9447836Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9447912Z 
2025-05-07T20:32:44.9447994Z     @given(
2025-05-07T20:32:44.9448114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9448211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9448346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9448476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9448599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9448678Z     )
2025-05-07T20:32:44.9448922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9449070Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9449145Z         self,
2025-05-07T20:32:44.9449223Z         T: int,
2025-05-07T20:32:44.9449300Z         D: int,
2025-05-07T20:32:44.9449397Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9449488Z         contiguous: bool,
2025-05-07T20:32:44.9449574Z         compiled: bool,
2025-05-07T20:32:44.9449654Z     ) -> None:
2025-05-07T20:32:44.9449793Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9449869Z     
2025-05-07T20:32:44.9450039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9450113Z     
2025-05-07T20:32:44.9450201Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9450328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9450415Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9450494Z         x0 = x[:, :D]
2025-05-07T20:32:44.9450579Z         x1 = x[:, D:]
2025-05-07T20:32:44.9450653Z     
2025-05-07T20:32:44.9450735Z         if contiguous:
2025-05-07T20:32:44.9450828Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9450918Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9450990Z     
2025-05-07T20:32:44.9451083Z         if scale_ub is not None:
2025-05-07T20:32:44.9451186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9451322Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9451399Z             )
2025-05-07T20:32:44.9451478Z         else:
2025-05-07T20:32:44.9451573Z             scale_ub_tensor = None
2025-05-07T20:32:44.9451646Z     
2025-05-07T20:32:44.9451775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9451870Z             op = silu_mul_quant
2025-05-07T20:32:44.9451953Z             if compiled:
2025-05-07T20:32:44.9452051Z                 op = torch.compile(op)
2025-05-07T20:32:44.9452156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9452273Z     
2025-05-07T20:32:44.9452366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9452374Z 
2025-05-07T20:32:44.9452469Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9452596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9452697Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9452794Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9453161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9453256Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9453744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9453840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9454196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9454418Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9454758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9454850Z     kernel = self.compile(
2025-05-07T20:32:44.9455225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9455399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9455571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9455576Z 
2025-05-07T20:32:44.9455782Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e3d3a90>
2025-05-07T20:32:44.9456548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9457049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e50e480>}
2025-05-07T20:32:44.9457830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9458020Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e304130>
2025-05-07T20:32:44.9458062Z 
2025-05-07T20:32:44.9458230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9458490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9458594Z                            module_map=module_map)
2025-05-07T20:32:44.9458756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9458855Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9458936Z E       ^
2025-05-07T20:32:44.9459288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9459293Z 
2025-05-07T20:32:44.9459701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9459705Z 
2025-05-07T20:32:44.9459810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9460032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9460110Z     T=16384,
2025-05-07T20:32:44.9460188Z     D=7168,
2025-05-07T20:32:44.9460269Z     scale_ub=1200.0,
2025-05-07T20:32:44.9460355Z     contiguous=True,
2025-05-07T20:32:44.9460437Z     compiled=True,
2025-05-07T20:32:44.9460509Z )
2025-05-07T20:32:44.9460729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9460905Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9460956Z 
2025-05-07T20:32:44.9461034Z     @given(
2025-05-07T20:32:44.9461154Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9461252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9461366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9461481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9461591Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9461667Z     )
2025-05-07T20:32:44.9461912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9462005Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9462087Z         self,
2025-05-07T20:32:44.9462164Z         T: int,
2025-05-07T20:32:44.9462238Z         D: int,
2025-05-07T20:32:44.9462337Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9462426Z         contiguous: bool,
2025-05-07T20:32:44.9462511Z         compiled: bool,
2025-05-07T20:32:44.9462594Z     ) -> None:
2025-05-07T20:32:44.9462689Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9462764Z     
2025-05-07T20:32:44.9462930Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9463005Z     
2025-05-07T20:32:44.9463099Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9463221Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9463309Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9463393Z         x0 = x[:, :D]
2025-05-07T20:32:44.9463516Z         x1 = x[:, D:]
2025-05-07T20:32:44.9463591Z     
2025-05-07T20:32:44.9463677Z         if contiguous:
2025-05-07T20:32:44.9463769Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9463856Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9463931Z     
2025-05-07T20:32:44.9464022Z         if scale_ub is not None:
2025-05-07T20:32:44.9464125Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9464261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9464338Z             )
2025-05-07T20:32:44.9464458Z         else:
2025-05-07T20:32:44.9464552Z             scale_ub_tensor = None
2025-05-07T20:32:44.9464626Z     
2025-05-07T20:32:44.9464761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9464850Z             op = silu_mul_quant
2025-05-07T20:32:44.9464934Z             if compiled:
2025-05-07T20:32:44.9465034Z                 op = torch.compile(op)
2025-05-07T20:32:44.9465144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9465256Z     
2025-05-07T20:32:44.9465348Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9465352Z 
2025-05-07T20:32:44.9465446Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9465576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9465675Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9465773Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9466138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9466236Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9466724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9466823Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9467175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9467399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9467736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9467829Z     kernel = self.compile(
2025-05-07T20:32:44.9468231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9468503Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9468632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9468641Z 
2025-05-07T20:32:44.9468842Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e3ad0d0>
2025-05-07T20:32:44.9469614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9470113Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e50fd80>}
2025-05-07T20:32:44.9470852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9471046Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e31d730>
2025-05-07T20:32:44.9471053Z 
2025-05-07T20:32:44.9471214Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9471473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9471579Z                            module_map=module_map)
2025-05-07T20:32:44.9471737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9471876Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9471958Z E       ^
2025-05-07T20:32:44.9472309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9472314Z 
2025-05-07T20:32:44.9472727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9472732Z 
2025-05-07T20:32:44.9472835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9473055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9473174Z     T=16384,
2025-05-07T20:32:44.9473250Z     D=5120,
2025-05-07T20:32:44.9473336Z     scale_ub=1200.0,
2025-05-07T20:32:44.9473421Z     contiguous=True,
2025-05-07T20:32:44.9473503Z     compiled=False,
2025-05-07T20:32:44.9473580Z )
2025-05-07T20:32:44.9473795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9474011Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9474016Z 
2025-05-07T20:32:44.9474092Z     @given(
2025-05-07T20:32:44.9474209Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9474304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9474422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9474538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9474654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9474731Z     )
2025-05-07T20:32:44.9474971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9475069Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9475144Z         self,
2025-05-07T20:32:44.9475221Z         T: int,
2025-05-07T20:32:44.9475297Z         D: int,
2025-05-07T20:32:44.9475394Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9475482Z         contiguous: bool,
2025-05-07T20:32:44.9475569Z         compiled: bool,
2025-05-07T20:32:44.9475650Z     ) -> None:
2025-05-07T20:32:44.9475741Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9475816Z     
2025-05-07T20:32:44.9475981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9476057Z     
2025-05-07T20:32:44.9476148Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9476270Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9476360Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9476509Z         x0 = x[:, :D]
2025-05-07T20:32:44.9476590Z         x1 = x[:, D:]
2025-05-07T20:32:44.9476664Z     
2025-05-07T20:32:44.9476748Z         if contiguous:
2025-05-07T20:32:44.9476841Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9476930Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9477002Z     
2025-05-07T20:32:44.9477095Z         if scale_ub is not None:
2025-05-07T20:32:44.9477203Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9477343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9477419Z             )
2025-05-07T20:32:44.9477495Z         else:
2025-05-07T20:32:44.9477587Z             scale_ub_tensor = None
2025-05-07T20:32:44.9477663Z     
2025-05-07T20:32:44.9477791Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9477883Z             op = silu_mul_quant
2025-05-07T20:32:44.9477966Z             if compiled:
2025-05-07T20:32:44.9478069Z                 op = torch.compile(op)
2025-05-07T20:32:44.9478181Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9478259Z     
2025-05-07T20:32:44.9478370Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9478374Z 
2025-05-07T20:32:44.9478483Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9478624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9478722Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9478822Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9479366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9479465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9479819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9480036Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9480376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9480509Z     kernel = self.compile(
2025-05-07T20:32:44.9480891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9481062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9481186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9481193Z 
2025-05-07T20:32:44.9481438Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e13e190>
2025-05-07T20:32:44.9482206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9482705Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e3f8cc0>}
2025-05-07T20:32:44.9483451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9483640Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e1be7b0>
2025-05-07T20:32:44.9483645Z 
2025-05-07T20:32:44.9483812Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9484073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9484182Z                            module_map=module_map)
2025-05-07T20:32:44.9484341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9484438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9484515Z E       ^
2025-05-07T20:32:44.9484862Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9484914Z 
2025-05-07T20:32:44.9485324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9485331Z 
2025-05-07T20:32:44.9485433Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9485653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9485735Z     T=1,
2025-05-07T20:32:44.9485816Z     D=7168,
2025-05-07T20:32:44.9485901Z     scale_ub=1200.0,
2025-05-07T20:32:44.9485988Z     contiguous=False,
2025-05-07T20:32:44.9486071Z     compiled=False,
2025-05-07T20:32:44.9486145Z )
2025-05-07T20:32:44.9486363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9486527Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9486531Z 
2025-05-07T20:32:44.9486608Z     @given(
2025-05-07T20:32:44.9486730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9486829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9486947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9487060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9487170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9487251Z     )
2025-05-07T20:32:44.9487579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9487678Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9487754Z         self,
2025-05-07T20:32:44.9487832Z         T: int,
2025-05-07T20:32:44.9487908Z         D: int,
2025-05-07T20:32:44.9488007Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9488096Z         contiguous: bool,
2025-05-07T20:32:44.9488181Z         compiled: bool,
2025-05-07T20:32:44.9488258Z     ) -> None:
2025-05-07T20:32:44.9488350Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9488429Z     
2025-05-07T20:32:44.9488597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9488718Z     
2025-05-07T20:32:44.9488811Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9488937Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9489024Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9489107Z         x0 = x[:, :D]
2025-05-07T20:32:44.9489192Z         x1 = x[:, D:]
2025-05-07T20:32:44.9489264Z     
2025-05-07T20:32:44.9489352Z         if contiguous:
2025-05-07T20:32:44.9489480Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9489573Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9489645Z     
2025-05-07T20:32:44.9489735Z         if scale_ub is not None:
2025-05-07T20:32:44.9489841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9489973Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9490048Z             )
2025-05-07T20:32:44.9490125Z         else:
2025-05-07T20:32:44.9490224Z             scale_ub_tensor = None
2025-05-07T20:32:44.9490300Z     
2025-05-07T20:32:44.9490434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9490524Z             op = silu_mul_quant
2025-05-07T20:32:44.9490607Z             if compiled:
2025-05-07T20:32:44.9490708Z                 op = torch.compile(op)
2025-05-07T20:32:44.9490812Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9490887Z     
2025-05-07T20:32:44.9490976Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9490983Z 
2025-05-07T20:32:44.9491083Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9491213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9491315Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9491412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9491909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9492051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9492411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9492630Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9492966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9493066Z     kernel = self.compile(
2025-05-07T20:32:44.9493447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9493619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9493748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9493752Z 
2025-05-07T20:32:44.9493954Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e1715d0>
2025-05-07T20:32:44.9494727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9495226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e3f9080>}
2025-05-07T20:32:44.9496011Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9496202Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e1303f0>
2025-05-07T20:32:44.9496206Z 
2025-05-07T20:32:44.9496369Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9496631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9496742Z                            module_map=module_map)
2025-05-07T20:32:44.9496943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9497044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9497122Z E       ^
2025-05-07T20:32:44.9497475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9497479Z 
2025-05-07T20:32:44.9497928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9497933Z 
2025-05-07T20:32:44.9498037Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9498259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9498336Z     T=4096,
2025-05-07T20:32:44.9498418Z     D=7168,
2025-05-07T20:32:44.9498501Z     scale_ub=1200.0,
2025-05-07T20:32:44.9498586Z     contiguous=False,
2025-05-07T20:32:44.9498677Z     compiled=True,
2025-05-07T20:32:44.9498751Z )
2025-05-07T20:32:44.9498970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9499144Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9499148Z 
2025-05-07T20:32:44.9499224Z     @given(
2025-05-07T20:32:44.9499341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9499442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9499557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9499674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9499785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9499858Z     )
2025-05-07T20:32:44.9500100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9500194Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9500270Z         self,
2025-05-07T20:32:44.9500392Z         T: int,
2025-05-07T20:32:44.9500469Z         D: int,
2025-05-07T20:32:44.9500568Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9500659Z         contiguous: bool,
2025-05-07T20:32:44.9500743Z         compiled: bool,
2025-05-07T20:32:44.9500820Z     ) -> None:
2025-05-07T20:32:44.9500915Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9500987Z     
2025-05-07T20:32:44.9501154Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9501229Z     
2025-05-07T20:32:44.9501321Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9501448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9501533Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9501613Z         x0 = x[:, :D]
2025-05-07T20:32:44.9501696Z         x1 = x[:, D:]
2025-05-07T20:32:44.9501769Z     
2025-05-07T20:32:44.9501850Z         if contiguous:
2025-05-07T20:32:44.9501943Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9502031Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9502106Z     
2025-05-07T20:32:44.9502197Z         if scale_ub is not None:
2025-05-07T20:32:44.9502305Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9502437Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9502514Z             )
2025-05-07T20:32:44.9502590Z         else:
2025-05-07T20:32:44.9502685Z             scale_ub_tensor = None
2025-05-07T20:32:44.9502755Z     
2025-05-07T20:32:44.9502881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9503016Z             op = silu_mul_quant
2025-05-07T20:32:44.9503101Z             if compiled:
2025-05-07T20:32:44.9503199Z                 op = torch.compile(op)
2025-05-07T20:32:44.9503306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9503378Z     
2025-05-07T20:32:44.9508780Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9508792Z 
2025-05-07T20:32:44.9508908Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9509040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9509246Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9509348Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9509721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9509821Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9510316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9510487Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9510843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9511065Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9511404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9511505Z     kernel = self.compile(
2025-05-07T20:32:44.9511886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9512064Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9512190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9512195Z 
2025-05-07T20:32:44.9512402Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e1b8150>
2025-05-07T20:32:44.9513180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9513682Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e3fb060>}
2025-05-07T20:32:44.9514496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9514687Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e1447b0>
2025-05-07T20:32:44.9514692Z 
2025-05-07T20:32:44.9514862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9515124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9515238Z                            module_map=module_map)
2025-05-07T20:32:44.9515397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9515493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9515573Z E       ^
2025-05-07T20:32:44.9515924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9515932Z 
2025-05-07T20:32:44.9516343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9516350Z 
2025-05-07T20:32:44.9516456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9516674Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9516756Z     T=128,
2025-05-07T20:32:44.9516834Z     D=7168,
2025-05-07T20:32:44.9516918Z     scale_ub=1200.0,
2025-05-07T20:32:44.9517005Z     contiguous=False,
2025-05-07T20:32:44.9517149Z     compiled=True,
2025-05-07T20:32:44.9517224Z )
2025-05-07T20:32:44.9517442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9517611Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9517616Z 
2025-05-07T20:32:44.9517692Z     @given(
2025-05-07T20:32:44.9517813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9517912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9518032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9518191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9518302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9518377Z     )
2025-05-07T20:32:44.9518619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9518711Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9518791Z         self,
2025-05-07T20:32:44.9518873Z         T: int,
2025-05-07T20:32:44.9519011Z         D: int,
2025-05-07T20:32:44.9519115Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9519204Z         contiguous: bool,
2025-05-07T20:32:44.9519288Z         compiled: bool,
2025-05-07T20:32:44.9519369Z     ) -> None:
2025-05-07T20:32:44.9519461Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9519537Z     
2025-05-07T20:32:44.9519705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9519784Z     
2025-05-07T20:32:44.9519877Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9520002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9520096Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9520177Z         x0 = x[:, :D]
2025-05-07T20:32:44.9520256Z         x1 = x[:, D:]
2025-05-07T20:32:44.9520328Z     
2025-05-07T20:32:44.9520410Z         if contiguous:
2025-05-07T20:32:44.9520500Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9520593Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9520666Z     
2025-05-07T20:32:44.9520756Z         if scale_ub is not None:
2025-05-07T20:32:44.9520865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9520999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9521075Z             )
2025-05-07T20:32:44.9521152Z         else:
2025-05-07T20:32:44.9521247Z             scale_ub_tensor = None
2025-05-07T20:32:44.9521323Z     
2025-05-07T20:32:44.9521451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9521591Z             op = silu_mul_quant
2025-05-07T20:32:44.9521678Z             if compiled:
2025-05-07T20:32:44.9521777Z                 op = torch.compile(op)
2025-05-07T20:32:44.9521880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9521956Z     
2025-05-07T20:32:44.9522045Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9522050Z 
2025-05-07T20:32:44.9522149Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9522283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9522384Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9522485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9522848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9522938Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9523429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9523532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9523889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9524108Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9524445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9524584Z     kernel = self.compile(
2025-05-07T20:32:44.9524963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9525135Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9525266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9525270Z 
2025-05-07T20:32:44.9525474Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e09d390>
2025-05-07T20:32:44.9526290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9526789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e0c0360>}
2025-05-07T20:32:44.9527640Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9527831Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e0783f0>
2025-05-07T20:32:44.9527836Z 
2025-05-07T20:32:44.9527999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9528264Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9528370Z                            module_map=module_map)
2025-05-07T20:32:44.9528529Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9528631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9528708Z E       ^
2025-05-07T20:32:44.9529061Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9529068Z 
2025-05-07T20:32:44.9529478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9529483Z 
2025-05-07T20:32:44.9529583Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9529806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9529883Z     T=2048,
2025-05-07T20:32:44.9529962Z     D=7168,
2025-05-07T20:32:44.9530043Z     scale_ub=None,
2025-05-07T20:32:44.9530170Z     contiguous=True,
2025-05-07T20:32:44.9530261Z     compiled=True,
2025-05-07T20:32:44.9530335Z )
2025-05-07T20:32:44.9530549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9530719Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.9530724Z 
2025-05-07T20:32:44.9530799Z     @given(
2025-05-07T20:32:44.9530919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9531023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9531141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9531259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9531370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9531444Z     )
2025-05-07T20:32:44.9531688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9531782Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9531858Z         self,
2025-05-07T20:32:44.9531935Z         T: int,
2025-05-07T20:32:44.9532012Z         D: int,
2025-05-07T20:32:44.9532110Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9532201Z         contiguous: bool,
2025-05-07T20:32:44.9532284Z         compiled: bool,
2025-05-07T20:32:44.9532359Z     ) -> None:
2025-05-07T20:32:44.9532455Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9532528Z     
2025-05-07T20:32:44.9532693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9532813Z     
2025-05-07T20:32:44.9532908Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9533034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9533121Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9533200Z         x0 = x[:, :D]
2025-05-07T20:32:44.9533281Z         x1 = x[:, D:]
2025-05-07T20:32:44.9533354Z     
2025-05-07T20:32:44.9533435Z         if contiguous:
2025-05-07T20:32:44.9533528Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9533620Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9533734Z     
2025-05-07T20:32:44.9533826Z         if scale_ub is not None:
2025-05-07T20:32:44.9533931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9534065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9534144Z             )
2025-05-07T20:32:44.9534220Z         else:
2025-05-07T20:32:44.9534315Z             scale_ub_tensor = None
2025-05-07T20:32:44.9534389Z     
2025-05-07T20:32:44.9534559Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9534652Z             op = silu_mul_quant
2025-05-07T20:32:44.9534736Z             if compiled:
2025-05-07T20:32:44.9534833Z                 op = torch.compile(op)
2025-05-07T20:32:44.9534940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9535011Z     
2025-05-07T20:32:44.9535099Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9535104Z 
2025-05-07T20:32:44.9535203Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9535336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9535440Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9535538Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9535904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9535998Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9536490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9536586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9536943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9537161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9537498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9537639Z     kernel = self.compile(
2025-05-07T20:32:44.9538015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9538190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9538313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9538318Z 
2025-05-07T20:32:44.9538525Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e0829d0>
2025-05-07T20:32:44.9539300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9539798Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e0c0ea0>}
2025-05-07T20:32:44.9540544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9540732Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e057030>
2025-05-07T20:32:44.9540737Z 
2025-05-07T20:32:44.9540900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9541207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9541314Z                            module_map=module_map)
2025-05-07T20:32:44.9541478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9541575Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9541652Z E       ^
2025-05-07T20:32:44.9542006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9542014Z 
2025-05-07T20:32:44.9542464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9542469Z 
2025-05-07T20:32:44.9542574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9542790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9542866Z     T=16384,
2025-05-07T20:32:44.9542945Z     D=5120,
2025-05-07T20:32:44.9543031Z     scale_ub=None,
2025-05-07T20:32:44.9543154Z     contiguous=False,
2025-05-07T20:32:44.9543240Z     compiled=False,
2025-05-07T20:32:44.9543314Z )
2025-05-07T20:32:44.9543531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9543707Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9543711Z 
2025-05-07T20:32:44.9543789Z     @given(
2025-05-07T20:32:44.9543911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9544012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9544127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9544244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9544354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9544429Z     )
2025-05-07T20:32:44.9544670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9544763Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9544844Z         self,
2025-05-07T20:32:44.9544919Z         T: int,
2025-05-07T20:32:44.9544997Z         D: int,
2025-05-07T20:32:44.9545097Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9545185Z         contiguous: bool,
2025-05-07T20:32:44.9545268Z         compiled: bool,
2025-05-07T20:32:44.9545347Z     ) -> None:
2025-05-07T20:32:44.9545442Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9545515Z     
2025-05-07T20:32:44.9545687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9545808Z     
2025-05-07T20:32:44.9545899Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9546023Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9547821Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9547830Z 
2025-05-07T20:32:44.9547946Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9547950Z 
2025-05-07T20:32:44.9548049Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9548301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9548399Z     T=4096,
2025-05-07T20:32:44.9548477Z     D=7168,
2025-05-07T20:32:44.9548563Z     scale_ub=1200.0,
2025-05-07T20:32:44.9548645Z     contiguous=True,
2025-05-07T20:32:44.9548726Z     compiled=True,
2025-05-07T20:32:44.9548798Z )
2025-05-07T20:32:44.9549013Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9549230Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9549235Z 
2025-05-07T20:32:44.9549310Z     @given(
2025-05-07T20:32:44.9549430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9549528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9549639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9549753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9549866Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9549943Z     )
2025-05-07T20:32:44.9550246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9550341Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9550418Z         self,
2025-05-07T20:32:44.9550496Z         T: int,
2025-05-07T20:32:44.9550573Z         D: int,
2025-05-07T20:32:44.9550669Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9550763Z         contiguous: bool,
2025-05-07T20:32:44.9550854Z         compiled: bool,
2025-05-07T20:32:44.9550972Z     ) -> None:
2025-05-07T20:32:44.9551067Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9551139Z     
2025-05-07T20:32:44.9551303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9551378Z     
2025-05-07T20:32:44.9551469Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9551594Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9553373Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9553386Z 
2025-05-07T20:32:44.9553505Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9553510Z 
2025-05-07T20:32:44.9553611Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9553831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9553907Z     T=16384,
2025-05-07T20:32:44.9553985Z     D=7168,
2025-05-07T20:32:44.9554069Z     scale_ub=None,
2025-05-07T20:32:44.9554198Z     contiguous=False,
2025-05-07T20:32:44.9554281Z     compiled=False,
2025-05-07T20:32:44.9554358Z )
2025-05-07T20:32:44.9554570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9554742Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9554746Z 
2025-05-07T20:32:44.9554826Z     @given(
2025-05-07T20:32:44.9554943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9555043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9555162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9555275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9555390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9555463Z     )
2025-05-07T20:32:44.9555705Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9555799Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9555875Z         self,
2025-05-07T20:32:44.9555953Z         T: int,
2025-05-07T20:32:44.9556032Z         D: int,
2025-05-07T20:32:44.9556129Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9556220Z         contiguous: bool,
2025-05-07T20:32:44.9556306Z         compiled: bool,
2025-05-07T20:32:44.9556383Z     ) -> None:
2025-05-07T20:32:44.9556479Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9556552Z     
2025-05-07T20:32:44.9556715Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9558595Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9558639Z 
2025-05-07T20:32:44.9558757Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9558761Z 
2025-05-07T20:32:44.9558863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9559080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9559157Z     T=2048,
2025-05-07T20:32:44.9559233Z     D=7168,
2025-05-07T20:32:44.9559315Z     scale_ub=1200.0,
2025-05-07T20:32:44.9559402Z     contiguous=True,
2025-05-07T20:32:44.9559524Z     compiled=True,
2025-05-07T20:32:44.9559597Z )
2025-05-07T20:32:44.9559810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9559981Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9559986Z 
2025-05-07T20:32:44.9560062Z     @given(
2025-05-07T20:32:44.9560180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9560281Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9560394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9560513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9560623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9560693Z     )
2025-05-07T20:32:44.9560933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9561026Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9561107Z         self,
2025-05-07T20:32:44.9561183Z         T: int,
2025-05-07T20:32:44.9561256Z         D: int,
2025-05-07T20:32:44.9561354Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9561442Z         contiguous: bool,
2025-05-07T20:32:44.9561526Z         compiled: bool,
2025-05-07T20:32:44.9561605Z     ) -> None:
2025-05-07T20:32:44.9561698Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9561771Z     
2025-05-07T20:32:44.9561938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9562056Z     
2025-05-07T20:32:44.9562148Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9562272Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9564038Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9564045Z 
2025-05-07T20:32:44.9564162Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9564166Z 
2025-05-07T20:32:44.9564265Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9564487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9564564Z     T=2048,
2025-05-07T20:32:44.9564639Z     D=7168,
2025-05-07T20:32:44.9564721Z     scale_ub=None,
2025-05-07T20:32:44.9564802Z     contiguous=True,
2025-05-07T20:32:44.9564884Z     compiled=False,
2025-05-07T20:32:44.9564960Z )
2025-05-07T20:32:44.9565171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9565381Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9565393Z 
2025-05-07T20:32:44.9565469Z     @given(
2025-05-07T20:32:44.9565584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9565682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9565794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9565907Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9566023Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9566100Z     )
2025-05-07T20:32:44.9566383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9566475Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9566552Z         self,
2025-05-07T20:32:44.9566628Z         T: int,
2025-05-07T20:32:44.9566705Z         D: int,
2025-05-07T20:32:44.9566801Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9566890Z         contiguous: bool,
2025-05-07T20:32:44.9566975Z         compiled: bool,
2025-05-07T20:32:44.9567053Z     ) -> None:
2025-05-07T20:32:44.9567188Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9567262Z     
2025-05-07T20:32:44.9567427Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9567502Z     
2025-05-07T20:32:44.9567634Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.9569397Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9569412Z 
2025-05-07T20:32:44.9569527Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.9569535Z 
2025-05-07T20:32:44.9569637Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9569856Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9569933Z     T=1,
2025-05-07T20:32:44.9570012Z     D=7168,
2025-05-07T20:32:44.9570096Z     scale_ub=1200.0,
2025-05-07T20:32:44.9570179Z     contiguous=True,
2025-05-07T20:32:44.9570263Z     compiled=False,
2025-05-07T20:32:44.9570335Z )
2025-05-07T20:32:44.9570592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9570759Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9570763Z 
2025-05-07T20:32:44.9570840Z     @given(
2025-05-07T20:32:44.9570956Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9571054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9571164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9571279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9571395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9571469Z     )
2025-05-07T20:32:44.9571711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9571801Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9571876Z         self,
2025-05-07T20:32:44.9571955Z         T: int,
2025-05-07T20:32:44.9572031Z         D: int,
2025-05-07T20:32:44.9572127Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9572220Z         contiguous: bool,
2025-05-07T20:32:44.9572305Z         compiled: bool,
2025-05-07T20:32:44.9572381Z     ) -> None:
2025-05-07T20:32:44.9572477Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9572550Z     
2025-05-07T20:32:44.9572715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9572791Z     
2025-05-07T20:32:44.9572881Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9573050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9573142Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9573220Z         x0 = x[:, :D]
2025-05-07T20:32:44.9573301Z         x1 = x[:, D:]
2025-05-07T20:32:44.9573372Z     
2025-05-07T20:32:44.9573454Z         if contiguous:
2025-05-07T20:32:44.9573547Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9573635Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9573708Z     
2025-05-07T20:32:44.9573802Z         if scale_ub is not None:
2025-05-07T20:32:44.9573909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9574083Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9574157Z             )
2025-05-07T20:32:44.9574233Z         else:
2025-05-07T20:32:44.9574328Z             scale_ub_tensor = None
2025-05-07T20:32:44.9574401Z     
2025-05-07T20:32:44.9574529Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9574618Z             op = silu_mul_quant
2025-05-07T20:32:44.9574706Z             if compiled:
2025-05-07T20:32:44.9574844Z                 op = torch.compile(op)
2025-05-07T20:32:44.9574956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9575027Z     
2025-05-07T20:32:44.9575116Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9575121Z 
2025-05-07T20:32:44.9575218Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9575345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9575446Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9575549Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9576046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9576144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9576498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9576719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9577061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9577153Z     kernel = self.compile(
2025-05-07T20:32:44.9577535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9577705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9577872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9577879Z 
2025-05-07T20:32:44.9578084Z self = <triton.compiler.compiler.ASTSource object at 0x7f5d3e29c850>
2025-05-07T20:32:44.9578854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9579357Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e230680>}
2025-05-07T20:32:44.9580099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9580288Z context = <triton._C.libtriton.ir.context object at 0x7f5d3e2403f0>
2025-05-07T20:32:44.9580299Z 
2025-05-07T20:32:44.9580463Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9580722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9580830Z                            module_map=module_map)
2025-05-07T20:32:44.9580988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9581085Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9581231Z E       ^
2025-05-07T20:32:44.9581586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9581591Z 
2025-05-07T20:32:44.9582002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9582007Z 
2025-05-07T20:32:44.9582108Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9582326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9582470Z     T=128,
2025-05-07T20:32:44.9582545Z     D=5120,
2025-05-07T20:32:44.9582626Z     scale_ub=None,
2025-05-07T20:32:44.9582713Z     contiguous=True,
2025-05-07T20:32:44.9582795Z     compiled=False,
2025-05-07T20:32:44.9582867Z )
2025-05-07T20:32:44.9583086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9583253Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9583260Z 
2025-05-07T20:32:44.9583381Z     @given(
2025-05-07T20:32:44.9583503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9583601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9583715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9583831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9583940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9584020Z     )
2025-05-07T20:32:44.9584263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9584357Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9584435Z         self,
2025-05-07T20:32:44.9584511Z         T: int,
2025-05-07T20:32:44.9584586Z         D: int,
2025-05-07T20:32:44.9584683Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9584770Z         contiguous: bool,
2025-05-07T20:32:44.9584855Z         compiled: bool,
2025-05-07T20:32:44.9584933Z     ) -> None:
2025-05-07T20:32:44.9585030Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9585105Z     
2025-05-07T20:32:44.9585270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9585345Z     
2025-05-07T20:32:44.9585439Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9585562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9585651Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9585733Z         x0 = x[:, :D]
2025-05-07T20:32:44.9585858Z         x1 = x[:, D:]
2025-05-07T20:32:44.9585932Z     
2025-05-07T20:32:44.9586021Z         if contiguous:
2025-05-07T20:32:44.9586110Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9586201Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9586273Z     
2025-05-07T20:32:44.9586362Z         if scale_ub is not None:
2025-05-07T20:32:44.9586468Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9586603Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9586680Z             )
2025-05-07T20:32:44.9586763Z         else:
2025-05-07T20:32:44.9586856Z             scale_ub_tensor = None
2025-05-07T20:32:44.9586929Z     
2025-05-07T20:32:44.9587063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9587153Z             op = silu_mul_quant
2025-05-07T20:32:44.9587236Z             if compiled:
2025-05-07T20:32:44.9587338Z                 op = torch.compile(op)
2025-05-07T20:32:44.9587442Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9587522Z     
2025-05-07T20:32:44.9587611Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9587618Z 
2025-05-07T20:32:44.9587712Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9587842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9587945Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9588044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9588586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9588683Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9589039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9589256Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9589591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9589692Z     kernel = self.compile(
2025-05-07T20:32:44.9590107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9590277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9590404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9590408Z 
2025-05-07T20:32:44.9590610Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999feb8d0>
2025-05-07T20:32:44.9591424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9591925Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e2318a0>}
2025-05-07T20:32:44.9592671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9592862Z context = <triton._C.libtriton.ir.context object at 0x7f5999f97ef0>
2025-05-07T20:32:44.9592866Z 
2025-05-07T20:32:44.9593028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9593293Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9593398Z                            module_map=module_map)
2025-05-07T20:32:44.9593558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9593655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9593732Z E       ^
2025-05-07T20:32:44.9594083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9594130Z 
2025-05-07T20:32:44.9594542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9594547Z 
2025-05-07T20:32:44.9594649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9594868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9594946Z     T=128,
2025-05-07T20:32:44.9595028Z     D=7168,
2025-05-07T20:32:44.9595108Z     scale_ub=None,
2025-05-07T20:32:44.9595194Z     contiguous=True,
2025-05-07T20:32:44.9595279Z     compiled=False,
2025-05-07T20:32:44.9595350Z )
2025-05-07T20:32:44.9595562Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9595733Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9595737Z 
2025-05-07T20:32:44.9595814Z     @given(
2025-05-07T20:32:44.9595932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9596035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9596150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9596268Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9596378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9596452Z     )
2025-05-07T20:32:44.9596695Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9596788Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9596909Z         self,
2025-05-07T20:32:44.9596993Z         T: int,
2025-05-07T20:32:44.9597072Z         D: int,
2025-05-07T20:32:44.9597170Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9597260Z         contiguous: bool,
2025-05-07T20:32:44.9597344Z         compiled: bool,
2025-05-07T20:32:44.9597421Z     ) -> None:
2025-05-07T20:32:44.9597516Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9597588Z     
2025-05-07T20:32:44.9597756Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9597831Z     
2025-05-07T20:32:44.9597962Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9598089Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9598176Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9598255Z         x0 = x[:, :D]
2025-05-07T20:32:44.9598337Z         x1 = x[:, D:]
2025-05-07T20:32:44.9598407Z     
2025-05-07T20:32:44.9598489Z         if contiguous:
2025-05-07T20:32:44.9598582Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9598711Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9598784Z     
2025-05-07T20:32:44.9598875Z         if scale_ub is not None:
2025-05-07T20:32:44.9598979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9599114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9599191Z             )
2025-05-07T20:32:44.9599267Z         else:
2025-05-07T20:32:44.9599364Z             scale_ub_tensor = None
2025-05-07T20:32:44.9599440Z     
2025-05-07T20:32:44.9599566Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9599662Z             op = silu_mul_quant
2025-05-07T20:32:44.9599746Z             if compiled:
2025-05-07T20:32:44.9599848Z                 op = torch.compile(op)
2025-05-07T20:32:44.9599951Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9600024Z     
2025-05-07T20:32:44.9600116Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9600120Z 
2025-05-07T20:32:44.9600215Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9600346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9600449Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9600547Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9601043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9601139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9601539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9601762Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9602098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9602194Z     kernel = self.compile(
2025-05-07T20:32:44.9602575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9602747Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9602875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9602880Z 
2025-05-07T20:32:44.9603082Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999f2a8d0>
2025-05-07T20:32:44.9603851Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9604356Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e2327a0>}
2025-05-07T20:32:44.9605138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9605330Z context = <triton._C.libtriton.ir.context object at 0x7f5999feeef0>
2025-05-07T20:32:44.9605335Z 
2025-05-07T20:32:44.9605495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9605993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9606101Z                            module_map=module_map)
2025-05-07T20:32:44.9606264Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9606434Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9606512Z E       ^
2025-05-07T20:32:44.9606862Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9606867Z 
2025-05-07T20:32:44.9607282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9607286Z 
2025-05-07T20:32:44.9607444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9607750Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9607829Z     T=2048,
2025-05-07T20:32:44.9607905Z     D=7168,
2025-05-07T20:32:44.9607989Z     scale_ub=1200.0,
2025-05-07T20:32:44.9608075Z     contiguous=True,
2025-05-07T20:32:44.9608158Z     compiled=False,
2025-05-07T20:32:44.9608231Z )
2025-05-07T20:32:44.9608449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9608626Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9608630Z 
2025-05-07T20:32:44.9608709Z     @given(
2025-05-07T20:32:44.9608825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9608924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9609035Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9609151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9609266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9609339Z     )
2025-05-07T20:32:44.9609579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9609674Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9609750Z         self,
2025-05-07T20:32:44.9609829Z         T: int,
2025-05-07T20:32:44.9609904Z         D: int,
2025-05-07T20:32:44.9610094Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9610184Z         contiguous: bool,
2025-05-07T20:32:44.9610273Z         compiled: bool,
2025-05-07T20:32:44.9610350Z     ) -> None:
2025-05-07T20:32:44.9610448Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9610521Z     
2025-05-07T20:32:44.9610689Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9612473Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9612481Z 
2025-05-07T20:32:44.9612597Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9612604Z 
2025-05-07T20:32:44.9612706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9612923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9613002Z     T=1,
2025-05-07T20:32:44.9613075Z     D=5120,
2025-05-07T20:32:44.9613157Z     scale_ub=1200.0,
2025-05-07T20:32:44.9613245Z     contiguous=True,
2025-05-07T20:32:44.9613328Z     compiled=False,
2025-05-07T20:32:44.9613400Z )
2025-05-07T20:32:44.9613680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9613844Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9613848Z 
2025-05-07T20:32:44.9613926Z     @given(
2025-05-07T20:32:44.9614046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9614142Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9614253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9614375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9614529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9614604Z     )
2025-05-07T20:32:44.9614844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9614935Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9615013Z         self,
2025-05-07T20:32:44.9615090Z         T: int,
2025-05-07T20:32:44.9615164Z         D: int,
2025-05-07T20:32:44.9615304Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9615393Z         contiguous: bool,
2025-05-07T20:32:44.9615480Z         compiled: bool,
2025-05-07T20:32:44.9615561Z     ) -> None:
2025-05-07T20:32:44.9615654Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9615730Z     
2025-05-07T20:32:44.9615900Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9615974Z     
2025-05-07T20:32:44.9616067Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9616192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9616284Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9616366Z         x0 = x[:, :D]
2025-05-07T20:32:44.9616445Z         x1 = x[:, D:]
2025-05-07T20:32:44.9616519Z     
2025-05-07T20:32:44.9616604Z         if contiguous:
2025-05-07T20:32:44.9616693Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9616783Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9616855Z     
2025-05-07T20:32:44.9616947Z         if scale_ub is not None:
2025-05-07T20:32:44.9617053Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9617190Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9617266Z             )
2025-05-07T20:32:44.9617344Z         else:
2025-05-07T20:32:44.9617436Z             scale_ub_tensor = None
2025-05-07T20:32:44.9617508Z     
2025-05-07T20:32:44.9617640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9617729Z             op = silu_mul_quant
2025-05-07T20:32:44.9617858Z             if compiled:
2025-05-07T20:32:44.9617963Z                 op = torch.compile(op)
2025-05-07T20:32:44.9618067Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9618139Z     
2025-05-07T20:32:44.9618231Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9618235Z 
2025-05-07T20:32:44.9618332Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9618461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9618563Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9618663Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9619163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9619259Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9619613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9619840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9620178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9620271Z     kernel = self.compile(
2025-05-07T20:32:44.9620648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9620862Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9620994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9620998Z 
2025-05-07T20:32:44.9621198Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999da2750>
2025-05-07T20:32:44.9621969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9622470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5d3e233b00>}
2025-05-07T20:32:44.9623250Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9623479Z context = <triton._C.libtriton.ir.context object at 0x7f5999db6db0>
2025-05-07T20:32:44.9623485Z 
2025-05-07T20:32:44.9623647Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9623909Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9624015Z                            module_map=module_map)
2025-05-07T20:32:44.9624174Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9624278Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9624354Z E       ^
2025-05-07T20:32:44.9624712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9624719Z 
2025-05-07T20:32:44.9625127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9625131Z 
2025-05-07T20:32:44.9625231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9625455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9625531Z     T=2048,
2025-05-07T20:32:44.9625608Z     D=5120,
2025-05-07T20:32:44.9625691Z     scale_ub=None,
2025-05-07T20:32:44.9625775Z     contiguous=True,
2025-05-07T20:32:44.9625861Z     compiled=False,
2025-05-07T20:32:44.9625938Z )
2025-05-07T20:32:44.9626150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9626323Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9626374Z 
2025-05-07T20:32:44.9626450Z     @given(
2025-05-07T20:32:44.9626566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9626665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9626777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9626891Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9627006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9627083Z     )
2025-05-07T20:32:44.9627327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9627422Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9627497Z         self,
2025-05-07T20:32:44.9627574Z         T: int,
2025-05-07T20:32:44.9627650Z         D: int,
2025-05-07T20:32:44.9627747Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9627836Z         contiguous: bool,
2025-05-07T20:32:44.9627919Z         compiled: bool,
2025-05-07T20:32:44.9627998Z     ) -> None:
2025-05-07T20:32:44.9628097Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9628184Z     
2025-05-07T20:32:44.9628375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9628461Z     
2025-05-07T20:32:44.9628552Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.9630368Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9630374Z 
2025-05-07T20:32:44.9630491Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.9630498Z 
2025-05-07T20:32:44.9630642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9630864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9634338Z     T=16384,
2025-05-07T20:32:44.9634432Z     D=5120,
2025-05-07T20:32:44.9634519Z     scale_ub=None,
2025-05-07T20:32:44.9634606Z     contiguous=True,
2025-05-07T20:32:44.9634695Z     compiled=False,
2025-05-07T20:32:44.9634771Z )
2025-05-07T20:32:44.9635063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9635247Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9635251Z 
2025-05-07T20:32:44.9635329Z     @given(
2025-05-07T20:32:44.9635452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9635551Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9635666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9635796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9635914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9635989Z     )
2025-05-07T20:32:44.9636239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9636334Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9636415Z         self,
2025-05-07T20:32:44.9636494Z         T: int,
2025-05-07T20:32:44.9636572Z         D: int,
2025-05-07T20:32:44.9636677Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9636770Z         contiguous: bool,
2025-05-07T20:32:44.9636857Z         compiled: bool,
2025-05-07T20:32:44.9636941Z     ) -> None:
2025-05-07T20:32:44.9637036Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9637110Z     
2025-05-07T20:32:44.9637281Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9639118Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9639176Z 
2025-05-07T20:32:44.9639303Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9639308Z 
2025-05-07T20:32:44.9639412Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9639636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9639714Z     T=4096,
2025-05-07T20:32:44.9639792Z     D=5120,
2025-05-07T20:32:44.9639881Z     scale_ub=None,
2025-05-07T20:32:44.9639967Z     contiguous=True,
2025-05-07T20:32:44.9640051Z     compiled=False,
2025-05-07T20:32:44.9640130Z )
2025-05-07T20:32:44.9640347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9640520Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9640524Z 
2025-05-07T20:32:44.9640604Z     @given(
2025-05-07T20:32:44.9640723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9640822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9640940Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9641104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9641223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9641297Z     )
2025-05-07T20:32:44.9641541Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9641639Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9641716Z         self,
2025-05-07T20:32:44.9641793Z         T: int,
2025-05-07T20:32:44.9641879Z         D: int,
2025-05-07T20:32:44.9641979Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9642111Z         contiguous: bool,
2025-05-07T20:32:44.9642200Z         compiled: bool,
2025-05-07T20:32:44.9642279Z     ) -> None:
2025-05-07T20:32:44.9642374Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9642452Z     
2025-05-07T20:32:44.9642619Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9644451Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9644462Z 
2025-05-07T20:32:44.9644585Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9644592Z 
2025-05-07T20:32:44.9644695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9644917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9644998Z     T=2048,
2025-05-07T20:32:44.9645075Z     D=5120,
2025-05-07T20:32:44.9645158Z     scale_ub=None,
2025-05-07T20:32:44.9645247Z     contiguous=False,
2025-05-07T20:32:44.9645334Z     compiled=False,
2025-05-07T20:32:44.9645408Z )
2025-05-07T20:32:44.9645626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9645798Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9645802Z 
2025-05-07T20:32:44.9645884Z     @given(
2025-05-07T20:32:44.9646002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9646101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9646264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9646385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9646497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9646575Z     )
2025-05-07T20:32:44.9646818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9646913Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9646992Z         self,
2025-05-07T20:32:44.9647071Z         T: int,
2025-05-07T20:32:44.9647154Z         D: int,
2025-05-07T20:32:44.9647255Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9647345Z         contiguous: bool,
2025-05-07T20:32:44.9647434Z         compiled: bool,
2025-05-07T20:32:44.9647514Z     ) -> None:
2025-05-07T20:32:44.9647687Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9647764Z     
2025-05-07T20:32:44.9647931Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9649743Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9649760Z 
2025-05-07T20:32:44.9649879Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9649883Z 
2025-05-07T20:32:44.9649986Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9650208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9650285Z     T=4096,
2025-05-07T20:32:44.9650367Z     D=7168,
2025-05-07T20:32:44.9650450Z     scale_ub=None,
2025-05-07T20:32:44.9650539Z     contiguous=True,
2025-05-07T20:32:44.9650627Z     compiled=True,
2025-05-07T20:32:44.9650746Z )
2025-05-07T20:32:44.9650961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9651132Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.9651136Z 
2025-05-07T20:32:44.9651214Z     @given(
2025-05-07T20:32:44.9651332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9651434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9651589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9651710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9651823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9651898Z     )
2025-05-07T20:32:44.9652144Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9652238Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9652315Z         self,
2025-05-07T20:32:44.9652399Z         T: int,
2025-05-07T20:32:44.9652476Z         D: int,
2025-05-07T20:32:44.9652578Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9652670Z         contiguous: bool,
2025-05-07T20:32:44.9652757Z         compiled: bool,
2025-05-07T20:32:44.9652836Z     ) -> None:
2025-05-07T20:32:44.9652937Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9653011Z     
2025-05-07T20:32:44.9653184Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9654953Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9655004Z 
2025-05-07T20:32:44.9655127Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9655132Z 
2025-05-07T20:32:44.9655234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9655454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9655535Z     T=2048,
2025-05-07T20:32:44.9655612Z     D=5120,
2025-05-07T20:32:44.9655696Z     scale_ub=1200.0,
2025-05-07T20:32:44.9655788Z     contiguous=False,
2025-05-07T20:32:44.9655875Z     compiled=False,
2025-05-07T20:32:44.9655950Z )
2025-05-07T20:32:44.9656166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9656339Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9656344Z 
2025-05-07T20:32:44.9656424Z     @given(
2025-05-07T20:32:44.9656542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9656644Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9656763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9656879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9656992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9657069Z     )
2025-05-07T20:32:44.9657310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9657404Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9657527Z         self,
2025-05-07T20:32:44.9657609Z         T: int,
2025-05-07T20:32:44.9657689Z         D: int,
2025-05-07T20:32:44.9657788Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9657877Z         contiguous: bool,
2025-05-07T20:32:44.9657965Z         compiled: bool,
2025-05-07T20:32:44.9658045Z     ) -> None:
2025-05-07T20:32:44.9658142Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9658218Z     
2025-05-07T20:32:44.9658385Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9660193Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9660235Z 
2025-05-07T20:32:44.9660355Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9660360Z 
2025-05-07T20:32:44.9660463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9660685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9660764Z     T=4096,
2025-05-07T20:32:44.9660845Z     D=7168,
2025-05-07T20:32:44.9660932Z     scale_ub=1200.0,
2025-05-07T20:32:44.9661021Z     contiguous=True,
2025-05-07T20:32:44.9661108Z     compiled=False,
2025-05-07T20:32:44.9661184Z )
2025-05-07T20:32:44.9661397Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9661569Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9661574Z 
2025-05-07T20:32:44.9661653Z     @given(
2025-05-07T20:32:44.9661774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9661878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9661992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9662113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9662226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9662301Z     )
2025-05-07T20:32:44.9662545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9662684Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9662761Z         self,
2025-05-07T20:32:44.9662844Z         T: int,
2025-05-07T20:32:44.9662921Z         D: int,
2025-05-07T20:32:44.9663020Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9663114Z         contiguous: bool,
2025-05-07T20:32:44.9663200Z         compiled: bool,
2025-05-07T20:32:44.9663279Z     ) -> None:
2025-05-07T20:32:44.9663376Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9663452Z     
2025-05-07T20:32:44.9663627Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9665390Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9665401Z 
2025-05-07T20:32:44.9665522Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9665527Z 
2025-05-07T20:32:44.9665630Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9665850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9665932Z     T=16384,
2025-05-07T20:32:44.9666054Z     D=7168,
2025-05-07T20:32:44.9666142Z     scale_ub=None,
2025-05-07T20:32:44.9666232Z     contiguous=False,
2025-05-07T20:32:44.9666316Z     compiled=True,
2025-05-07T20:32:44.9666390Z )
2025-05-07T20:32:44.9666611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9666785Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9666790Z 
2025-05-07T20:32:44.9666871Z     @given(
2025-05-07T20:32:44.9666992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9667131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9667248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9667364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9667476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9667554Z     )
2025-05-07T20:32:44.9667796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9667931Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9668014Z         self,
2025-05-07T20:32:44.9668092Z         T: int,
2025-05-07T20:32:44.9668191Z         D: int,
2025-05-07T20:32:44.9668298Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9668410Z         contiguous: bool,
2025-05-07T20:32:44.9668499Z         compiled: bool,
2025-05-07T20:32:44.9668578Z     ) -> None:
2025-05-07T20:32:44.9668672Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9668752Z     
2025-05-07T20:32:44.9668918Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9670698Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9670704Z 
2025-05-07T20:32:44.9670822Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9670826Z 
2025-05-07T20:32:44.9670928Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9671150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9671271Z     T=4096,
2025-05-07T20:32:44.9671352Z     D=7168,
2025-05-07T20:32:44.9671437Z     scale_ub=None,
2025-05-07T20:32:44.9671523Z     contiguous=True,
2025-05-07T20:32:44.9671614Z     compiled=False,
2025-05-07T20:32:44.9671689Z )
2025-05-07T20:32:44.9671903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9672075Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9672080Z 
2025-05-07T20:32:44.9672157Z     @given(
2025-05-07T20:32:44.9672282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9672383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9672497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9672618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9672730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9672805Z     )
2025-05-07T20:32:44.9673049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9673149Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9673226Z         self,
2025-05-07T20:32:44.9673307Z         T: int,
2025-05-07T20:32:44.9673385Z         D: int,
2025-05-07T20:32:44.9673483Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9673575Z         contiguous: bool,
2025-05-07T20:32:44.9673662Z         compiled: bool,
2025-05-07T20:32:44.9673741Z     ) -> None:
2025-05-07T20:32:44.9673839Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9673958Z     
2025-05-07T20:32:44.9674130Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9675896Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9675965Z 
2025-05-07T20:32:44.9676086Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9676091Z 
2025-05-07T20:32:44.9676193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9676417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9676534Z     T=16384,
2025-05-07T20:32:44.9676616Z     D=7168,
2025-05-07T20:32:44.9676700Z     scale_ub=None,
2025-05-07T20:32:44.9676793Z     contiguous=True,
2025-05-07T20:32:44.9676877Z     compiled=False,
2025-05-07T20:32:44.9676951Z )
2025-05-07T20:32:44.9677166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9677338Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9677346Z 
2025-05-07T20:32:44.9677429Z     @given(
2025-05-07T20:32:44.9677549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9677651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9677766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9677883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9677999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9678073Z     )
2025-05-07T20:32:44.9678342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9678449Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9678545Z         self,
2025-05-07T20:32:44.9678623Z         T: int,
2025-05-07T20:32:44.9678705Z         D: int,
2025-05-07T20:32:44.9678808Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9678900Z         contiguous: bool,
2025-05-07T20:32:44.9678987Z         compiled: bool,
2025-05-07T20:32:44.9679066Z     ) -> None:
2025-05-07T20:32:44.9679210Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9679287Z     
2025-05-07T20:32:44.9679452Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9681223Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9681229Z 
2025-05-07T20:32:44.9681347Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9681351Z 
2025-05-07T20:32:44.9681457Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9681681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9681762Z     T=16384,
2025-05-07T20:32:44.9681843Z     D=7168,
2025-05-07T20:32:44.9681927Z     scale_ub=1200.0,
2025-05-07T20:32:44.9682014Z     contiguous=True,
2025-05-07T20:32:44.9682099Z     compiled=False,
2025-05-07T20:32:44.9682173Z )
2025-05-07T20:32:44.9682388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9682607Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9682614Z 
2025-05-07T20:32:44.9682694Z     @given(
2025-05-07T20:32:44.9682814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9682913Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9683027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9683146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9683259Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9683340Z     )
2025-05-07T20:32:44.9683582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9683717Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9683796Z         self,
2025-05-07T20:32:44.9683874Z         T: int,
2025-05-07T20:32:44.9683951Z         D: int,
2025-05-07T20:32:44.9684052Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9684141Z         contiguous: bool,
2025-05-07T20:32:44.9684228Z         compiled: bool,
2025-05-07T20:32:44.9684314Z     ) -> None:
2025-05-07T20:32:44.9684447Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9684526Z     
2025-05-07T20:32:44.9684693Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9686459Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9686470Z 
2025-05-07T20:32:44.9686591Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9686596Z 
2025-05-07T20:32:44.9686698Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9686924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9687003Z     T=128,
2025-05-07T20:32:44.9687081Z     D=5120,
2025-05-07T20:32:44.9687168Z     scale_ub=1200.0,
2025-05-07T20:32:44.9687255Z     contiguous=False,
2025-05-07T20:32:44.9687339Z     compiled=False,
2025-05-07T20:32:44.9687416Z )
2025-05-07T20:32:44.9687675Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9687892Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9687901Z 
2025-05-07T20:32:44.9687979Z     @given(
2025-05-07T20:32:44.9688098Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9688199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9688312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9688429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9688547Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9688625Z     )
2025-05-07T20:32:44.9688867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9688964Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9689040Z         self,
2025-05-07T20:32:44.9689117Z         T: int,
2025-05-07T20:32:44.9689196Z         D: int,
2025-05-07T20:32:44.9689295Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9689387Z         contiguous: bool,
2025-05-07T20:32:44.9689476Z         compiled: bool,
2025-05-07T20:32:44.9689555Z     ) -> None:
2025-05-07T20:32:44.9689655Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9689730Z     
2025-05-07T20:32:44.9689897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9689974Z     
2025-05-07T20:32:44.9690067Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9690192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9690286Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9690413Z         x0 = x[:, :D]
2025-05-07T20:32:44.9690499Z         x1 = x[:, D:]
2025-05-07T20:32:44.9690577Z     
2025-05-07T20:32:44.9690661Z         if contiguous:
2025-05-07T20:32:44.9690760Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9690850Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9690923Z     
2025-05-07T20:32:44.9691017Z         if scale_ub is not None:
2025-05-07T20:32:44.9691124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9691263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9691384Z             )
2025-05-07T20:32:44.9691462Z         else:
2025-05-07T20:32:44.9691557Z             scale_ub_tensor = None
2025-05-07T20:32:44.9691634Z     
2025-05-07T20:32:44.9691765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9691857Z             op = silu_mul_quant
2025-05-07T20:32:44.9691949Z             if compiled:
2025-05-07T20:32:44.9692050Z                 op = torch.compile(op)
2025-05-07T20:32:44.9692199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9692276Z     
2025-05-07T20:32:44.9692369Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9692374Z 
2025-05-07T20:32:44.9692474Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9692604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9692706Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9692809Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9693317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9693420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9693780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9694001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9694345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9694441Z     kernel = self.compile(
2025-05-07T20:32:44.9694821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9694998Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9695125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9695171Z 
2025-05-07T20:32:44.9695376Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999c50f50>
2025-05-07T20:32:44.9696152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9696658Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5999d5a700>}
2025-05-07T20:32:44.9697406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9697595Z context = <triton._C.libtriton.ir.context object at 0x7f5999c983f0>
2025-05-07T20:32:44.9697599Z 
2025-05-07T20:32:44.9697766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9698033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9698141Z                            module_map=module_map)
2025-05-07T20:32:44.9698306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9698405Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9698486Z E       ^
2025-05-07T20:32:44.9698887Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9698892Z 
2025-05-07T20:32:44.9699304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9699309Z 
2025-05-07T20:32:44.9699416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9699636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9699719Z     T=2048,
2025-05-07T20:32:44.9699800Z     D=7168,
2025-05-07T20:32:44.9699884Z     scale_ub=None,
2025-05-07T20:32:44.9700014Z     contiguous=False,
2025-05-07T20:32:44.9700098Z     compiled=False,
2025-05-07T20:32:44.9700172Z )
2025-05-07T20:32:44.9700391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9700562Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9700567Z 
2025-05-07T20:32:44.9700646Z     @given(
2025-05-07T20:32:44.9700809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9700910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9701024Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9701143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9701256Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9701333Z     )
2025-05-07T20:32:44.9701576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9701673Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9701755Z         self,
2025-05-07T20:32:44.9701832Z         T: int,
2025-05-07T20:32:44.9701910Z         D: int,
2025-05-07T20:32:44.9702011Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9702102Z         contiguous: bool,
2025-05-07T20:32:44.9702189Z         compiled: bool,
2025-05-07T20:32:44.9702271Z     ) -> None:
2025-05-07T20:32:44.9702367Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9702441Z     
2025-05-07T20:32:44.9702619Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9704388Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9704443Z 
2025-05-07T20:32:44.9704561Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9704566Z 
2025-05-07T20:32:44.9704669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9704893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9704974Z     T=128,
2025-05-07T20:32:44.9705054Z     D=7168,
2025-05-07T20:32:44.9705142Z     scale_ub=1200.0,
2025-05-07T20:32:44.9705228Z     contiguous=True,
2025-05-07T20:32:44.9705313Z     compiled=True,
2025-05-07T20:32:44.9705390Z )
2025-05-07T20:32:44.9705880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9706053Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9706060Z 
2025-05-07T20:32:44.9706176Z     @given(
2025-05-07T20:32:44.9706340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9706487Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9706639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9706758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9706873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9706948Z     )
2025-05-07T20:32:44.9707317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9707421Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9707502Z         self,
2025-05-07T20:32:44.9707582Z         T: int,
2025-05-07T20:32:44.9707661Z         D: int,
2025-05-07T20:32:44.9707760Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9707853Z         contiguous: bool,
2025-05-07T20:32:44.9707942Z         compiled: bool,
2025-05-07T20:32:44.9708022Z     ) -> None:
2025-05-07T20:32:44.9708121Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9708199Z     
2025-05-07T20:32:44.9708430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9708509Z     
2025-05-07T20:32:44.9708602Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9708730Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9708823Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9708905Z         x0 = x[:, :D]
2025-05-07T20:32:44.9708988Z         x1 = x[:, D:]
2025-05-07T20:32:44.9709065Z     
2025-05-07T20:32:44.9709154Z         if contiguous:
2025-05-07T20:32:44.9709309Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9709402Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9709477Z     
2025-05-07T20:32:44.9709573Z         if scale_ub is not None:
2025-05-07T20:32:44.9709680Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9709817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9709898Z             )
2025-05-07T20:32:44.9709979Z         else:
2025-05-07T20:32:44.9710074Z             scale_ub_tensor = None
2025-05-07T20:32:44.9710153Z     
2025-05-07T20:32:44.9710283Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9710377Z             op = silu_mul_quant
2025-05-07T20:32:44.9710468Z             if compiled:
2025-05-07T20:32:44.9710570Z                 op = torch.compile(op)
2025-05-07T20:32:44.9710679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9710755Z     
2025-05-07T20:32:44.9710850Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9710857Z 
2025-05-07T20:32:44.9710961Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9711092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9711194Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9711297Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9711667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9711829Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9712325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9712426Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9712784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9713009Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9713352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9713453Z     kernel = self.compile(
2025-05-07T20:32:44.9713832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9714013Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9714142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9714151Z 
2025-05-07T20:32:44.9714355Z self = <triton.compiler.compiler.ASTSource object at 0x7f5999ee0290>
2025-05-07T20:32:44.9715132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9715676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5e3ae34540>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5999d5bf60>}
2025-05-07T20:32:44.9716427Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9716615Z context = <triton._C.libtriton.ir.context object at 0x7f5999ee4c30>
2025-05-07T20:32:44.9716623Z 
2025-05-07T20:32:44.9716792Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9717098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9717210Z                            module_map=module_map)
2025-05-07T20:32:44.9717376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9717477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9717557Z E       ^
2025-05-07T20:32:44.9717953Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9717958Z 
2025-05-07T20:32:44.9718396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9718400Z 
2025-05-07T20:32:44.9718533Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9718753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9718836Z     T=128,
2025-05-07T20:32:44.9718921Z     D=7168,
2025-05-07T20:32:44.9719007Z     scale_ub=1200.0,
2025-05-07T20:32:44.9719094Z     contiguous=True,
2025-05-07T20:32:44.9719185Z     compiled=False,
2025-05-07T20:32:44.9719262Z )
2025-05-07T20:32:44.9719477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9719651Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9719655Z 
2025-05-07T20:32:44.9719736Z     @given(
2025-05-07T20:32:44.9719860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9719962Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9720077Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9720197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9720311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9720388Z     )
2025-05-07T20:32:44.9720681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9720779Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9720856Z         self,
2025-05-07T20:32:44.9720938Z         T: int,
2025-05-07T20:32:44.9721017Z         D: int,
2025-05-07T20:32:44.9721115Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9721208Z         contiguous: bool,
2025-05-07T20:32:44.9721295Z         compiled: bool,
2025-05-07T20:32:44.9721377Z     ) -> None:
2025-05-07T20:32:44.9721474Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9721551Z     
2025-05-07T20:32:44.9721725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9721799Z     
2025-05-07T20:32:44.9721891Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9722022Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9723790Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9723802Z 
2025-05-07T20:32:44.9723965Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9723972Z 
2025-05-07T20:32:44.9724077Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9724300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9724385Z     T=128,
2025-05-07T20:32:44.9724462Z     D=5120,
2025-05-07T20:32:44.9724550Z     scale_ub=1200.0,
2025-05-07T20:32:44.9724637Z     contiguous=True,
2025-05-07T20:32:44.9724720Z     compiled=True,
2025-05-07T20:32:44.9724798Z )
2025-05-07T20:32:44.9725018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9725226Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9725230Z 
2025-05-07T20:32:44.9725313Z     @given(
2025-05-07T20:32:44.9725431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9725529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9725647Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9725805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9725925Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9726000Z     )
2025-05-07T20:32:44.9726243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9726340Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9726417Z         self,
2025-05-07T20:32:44.9726495Z         T: int,
2025-05-07T20:32:44.9726574Z         D: int,
2025-05-07T20:32:44.9726675Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9726768Z         contiguous: bool,
2025-05-07T20:32:44.9726859Z         compiled: bool,
2025-05-07T20:32:44.9726940Z     ) -> None:
2025-05-07T20:32:44.9727035Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9727113Z     
2025-05-07T20:32:44.9727281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9727359Z     
2025-05-07T20:32:44.9727452Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9727651Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9729414Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9729468Z 
2025-05-07T20:32:44.9729590Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9729595Z 
2025-05-07T20:32:44.9729697Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9729917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9730003Z     T=128,
2025-05-07T20:32:44.9730080Z     D=7168,
2025-05-07T20:32:44.9730168Z     scale_ub=None,
2025-05-07T20:32:44.9730257Z     contiguous=True,
2025-05-07T20:32:44.9730340Z     compiled=True,
2025-05-07T20:32:44.9730414Z )
2025-05-07T20:32:44.9730631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9730797Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.9730802Z 
2025-05-07T20:32:44.9730882Z     @given(
2025-05-07T20:32:44.9731003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9731106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9731224Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9731340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9731453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9731532Z     )
2025-05-07T20:32:44.9731776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9731918Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9732000Z         self,
2025-05-07T20:32:44.9732079Z         T: int,
2025-05-07T20:32:44.9732160Z         D: int,
2025-05-07T20:32:44.9732258Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9732348Z         contiguous: bool,
2025-05-07T20:32:44.9732436Z         compiled: bool,
2025-05-07T20:32:44.9732516Z     ) -> None:
2025-05-07T20:32:44.9732610Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9732688Z     
2025-05-07T20:32:44.9732858Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9734703Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9734709Z 
2025-05-07T20:32:44.9734829Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9734965Z =============================== warnings summary ===============================
2025-05-07T20:32:44.9735277Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.9735581Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.9735883Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.9736761Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:44.9736988Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:44.9736996Z 
2025-05-07T20:32:44.9737203Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:44.9737366Z ================= 1 failed, 1 deselected, 3 warnings in 13.74s =================
2025-05-07T20:32:46.5363991Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:46.5991720Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:32:46.5992181Z 
2025-05-07T20:32:46.5992518Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:32:46.5993629Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:32:46.5994412Z 
2025-05-07T20:32:46.5994431Z 
2025-05-07T20:32:46.5994447Z 
2025-05-07T20:32:46.6009962Z ##[error]Process completed with exit code 1.
2025-05-07T20:32:46.6097080Z Post job cleanup.
2025-05-07T20:32:46.7079923Z [command]/usr/bin/git version
2025-05-07T20:32:46.7120253Z git version 2.47.1
2025-05-07T20:32:46.7156137Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/a3f6b24d-5dfa-43e1-a56f-95622c529176/.gitconfig'
2025-05-07T20:32:46.7166963Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/a3f6b24d-5dfa-43e1-a56f-95622c529176' before making global git config changes
2025-05-07T20:32:46.7167880Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:32:46.7173251Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:32:46.7218972Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:32:46.7253635Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:32:46.7587916Z Entering 'external/asmjit'
2025-05-07T20:32:46.7654679Z Entering 'external/composable_kernel'
2025-05-07T20:32:46.7728623Z Entering 'external/cpuinfo'
2025-05-07T20:32:46.7795217Z Entering 'external/cutlass'
2025-05-07T20:32:46.7871101Z Entering 'external/googletest'
2025-05-07T20:32:46.7937559Z Entering 'external/hipify_torch'
2025-05-07T20:32:46.8003263Z Entering 'external/json'
2025-05-07T20:32:46.8089101Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:32:46.8114389Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8126062Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:32:46.8159333Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:32:46.8492407Z Entering 'external/asmjit'
2025-05-07T20:32:46.8536851Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8579039Z Entering 'external/composable_kernel'
2025-05-07T20:32:46.8622839Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8671720Z Entering 'external/cpuinfo'
2025-05-07T20:32:46.8714877Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8757129Z Entering 'external/cutlass'
2025-05-07T20:32:46.8799135Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8849807Z Entering 'external/googletest'
2025-05-07T20:32:46.8893612Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8935836Z Entering 'external/hipify_torch'
2025-05-07T20:32:46.8978217Z http.https://github.com/.extraheader
2025-05-07T20:32:46.9020106Z Entering 'external/json'
2025-05-07T20:32:46.9062589Z http.https://github.com/.extraheader
2025-05-07T20:32:46.9243061Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:32:46.9278177Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:32:46.9288591Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:32:46.9288948Z ##[endgroup]
2025-05-07T20:32:46.9391975Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:32:57.7270277Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:14.1077698Z Cleaning up orphan processes