2025-05-07T20:22:34.9267335Z Current runner version: '2.323.0' 2025-05-07T20:22:34.9276919Z Runner name: 'i-0e49e9d70b38203df' 2025-05-07T20:22:34.9278372Z Machine name: 'ip-10-0-29-135' 2025-05-07T20:22:34.9282608Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:34.9285637Z Contents: read 2025-05-07T20:22:34.9286403Z Metadata: read 2025-05-07T20:22:34.9287151Z Packages: read 2025-05-07T20:22:34.9288003Z ##[endgroup] 2025-05-07T20:22:34.9291070Z Secret source: None 2025-05-07T20:22:34.9292026Z Prepare workflow directory 2025-05-07T20:22:34.9841449Z Prepare all required actions 2025-05-07T20:22:34.9877411Z Getting action download info 2025-05-07T20:22:35.1777168Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:35.5006277Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:35.8679630Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:37.4667207Z Getting action download info 2025-05-07T20:22:37.5527012Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:37.7809898Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.11, 12.6.3, 12.6.3, clang) 2025-05-07T20:22:37.8421980Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:37.8555950Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:37.8568954Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:37.8570490Z ##[endgroup] 2025-05-07T20:22:39.0034409Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:39.0035091Z Instance Type: g5.4xlarge 2025-05-07T20:22:39.0035559Z AMI Name: unknown 2025-05-07T20:22:39.0071831Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.3782408Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.3782720Z with: 2025-05-07T20:22:44.3782944Z submodules: true 2025-05-07T20:22:44.3783190Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.3783589Z token: *** 2025-05-07T20:22:44.3783790Z ssh-strict: true 2025-05-07T20:22:44.3783999Z ssh-user: git 2025-05-07T20:22:44.3784218Z persist-credentials: true 2025-05-07T20:22:44.3784460Z clean: true 2025-05-07T20:22:44.3784683Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.3784946Z fetch-depth: 1 2025-05-07T20:22:44.3785160Z fetch-tags: false 2025-05-07T20:22:44.3785379Z show-progress: true 2025-05-07T20:22:44.3785600Z lfs: false 2025-05-07T20:22:44.3785802Z set-safe-directory: true 2025-05-07T20:22:44.3786052Z env: 2025-05-07T20:22:44.3786262Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.3786566Z BUILD_ENV: build_binary 2025-05-07T20:22:44.3786818Z BUILD_TARGET: genai 2025-05-07T20:22:44.3787036Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.3787307Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:44.3787563Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.3787813Z ##[endgroup] 2025-05-07T20:22:44.4948015Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:44.4949207Z ##[group]Getting Git version info 2025-05-07T20:22:44.4949652Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.4950264Z [command]/usr/bin/git version 2025-05-07T20:22:44.4950538Z git version 2.47.1 2025-05-07T20:22:44.4962095Z ##[endgroup] 2025-05-07T20:22:44.4976567Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/854af8bc-58eb-438a-bc80-ef8ee6ced871' before making global git config changes 2025-05-07T20:22:44.4977461Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:44.4991043Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.5029142Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.5032072Z ##[group]Initializing the repository 2025-05-07T20:22:44.5036368Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.5079760Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:44.5080989Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:44.5082063Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:44.5082825Z hint: 2025-05-07T20:22:44.5083423Z hint: git config --global init.defaultBranch 2025-05-07T20:22:44.5084082Z hint: 2025-05-07T20:22:44.5084736Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:44.5085477Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:44.5085891Z hint: 2025-05-07T20:22:44.5086110Z hint: git branch -m 2025-05-07T20:22:44.5086586Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:44.5091130Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:44.5125536Z ##[endgroup] 2025-05-07T20:22:44.5126041Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:44.5129776Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:44.5161214Z ##[endgroup] 2025-05-07T20:22:44.5161603Z ##[group]Setting up auth 2025-05-07T20:22:44.5167823Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:44.5200191Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:44.5564806Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:44.5596099Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:44.5939748Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:44.5988067Z ##[endgroup] 2025-05-07T20:22:44.5988508Z ##[group]Fetching the repository 2025-05-07T20:22:44.5995930Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:45.3963908Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.3964828Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:45.3988085Z ##[endgroup] 2025-05-07T20:22:45.3988486Z ##[group]Determining the checkout info 2025-05-07T20:22:45.3991015Z ##[endgroup] 2025-05-07T20:22:45.4005931Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:45.4043719Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:45.4087578Z ##[group]Checking out the ref 2025-05-07T20:22:45.4091314Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:45.5185971Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:45.5186236Z 2025-05-07T20:22:45.5187273Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:45.5187816Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:45.5188330Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:45.5188639Z 2025-05-07T20:22:45.5188859Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:45.5189330Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:45.5189617Z 2025-05-07T20:22:45.5189732Z git switch -c 2025-05-07T20:22:45.5189925Z 2025-05-07T20:22:45.5190054Z Or undo this operation with: 2025-05-07T20:22:45.5190226Z 2025-05-07T20:22:45.5190321Z git switch - 2025-05-07T20:22:45.5191089Z 2025-05-07T20:22:45.5191316Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:45.5191645Z 2025-05-07T20:22:45.5192028Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:45.5199149Z ##[endgroup] 2025-05-07T20:22:45.5199556Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:45.5204606Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.5252542Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:45.5285157Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:45.5317754Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:45.5345622Z ##[endgroup] 2025-05-07T20:22:45.5345998Z ##[group]Fetching submodules 2025-05-07T20:22:45.5348891Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:45.5690586Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:45.6011832Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:45.6013814Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:45.6018115Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:45.6022187Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:45.6026457Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:45.6031068Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:45.6035186Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:45.6065910Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:45.9861749Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:46.4722335Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:46.9298870Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:48.0445259Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.3860781Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:48.6848699Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:49.7675317Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:49.7675786Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:49.8168458Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:50.4837349Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:50.4837908Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:50.7637159Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:51.3903930Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:51.3904374Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:51.4901506Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:52.6267467Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:52.6267912Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:53.3201234Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.0569276Z From https://github.com/google/googletest 2025-05-07T20:22:54.0569720Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.0976855Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:54.7288302Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:54.7288810Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:54.7402582Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:55.4854803Z From https://github.com/nlohmann/json 2025-05-07T20:22:55.4855247Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:55.5964914Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:55.5986140Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:55.6320636Z Entering 'external/asmjit' 2025-05-07T20:22:55.6351951Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.6382276Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.6414982Z Entering 'external/cutlass' 2025-05-07T20:22:55.6445442Z Entering 'external/googletest' 2025-05-07T20:22:55.6477800Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.6510808Z Entering 'external/json' 2025-05-07T20:22:55.6557050Z ##[endgroup] 2025-05-07T20:22:55.6557462Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:55.6564046Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:55.6894153Z Entering 'external/asmjit' 2025-05-07T20:22:55.6958468Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.7038526Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.7107569Z Entering 'external/cutlass' 2025-05-07T20:22:55.7181271Z Entering 'external/googletest' 2025-05-07T20:22:55.7251852Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.7318097Z Entering 'external/json' 2025-05-07T20:22:55.7400531Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:55.7729542Z Entering 'external/asmjit' 2025-05-07T20:22:55.7794351Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:55.7797175Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.7858373Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:55.7861214Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.7922023Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:55.7925072Z Entering 'external/cutlass' 2025-05-07T20:22:55.7985682Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:55.7988475Z Entering 'external/googletest' 2025-05-07T20:22:55.8049557Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:55.8052573Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.8113935Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:55.8116558Z Entering 'external/json' 2025-05-07T20:22:55.8179438Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:55.8270209Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:55.8608492Z Entering 'external/asmjit' 2025-05-07T20:22:55.8640767Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.8672311Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.8703671Z Entering 'external/cutlass' 2025-05-07T20:22:55.8736903Z Entering 'external/googletest' 2025-05-07T20:22:55.8768140Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.8799505Z Entering 'external/json' 2025-05-07T20:22:55.8846925Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:55.9167511Z Entering 'external/asmjit' 2025-05-07T20:22:55.9201219Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.9233721Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.9264808Z Entering 'external/cutlass' 2025-05-07T20:22:55.9296579Z Entering 'external/googletest' 2025-05-07T20:22:55.9328319Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.9359703Z Entering 'external/json' 2025-05-07T20:22:55.9402661Z ##[endgroup] 2025-05-07T20:22:55.9444403Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:55.9471289Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:55.9647198Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:55.9647609Z with: 2025-05-07T20:22:55.9647852Z name: fbgemm_genai_x86_clang_py3.11_cu12.6.3.whl 2025-05-07T20:22:55.9648182Z merge-multiple: false 2025-05-07T20:22:55.9648433Z repository: pytorch/FBGEMM 2025-05-07T20:22:55.9648709Z run-id: 14891846252 2025-05-07T20:22:55.9648947Z env: 2025-05-07T20:22:55.9649166Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:55.9649456Z BUILD_ENV: build_binary 2025-05-07T20:22:55.9649695Z BUILD_TARGET: genai 2025-05-07T20:22:55.9649917Z BUILD_VARIANT: cuda 2025-05-07T20:22:55.9650152Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:55.9650393Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:55.9650638Z ##[endgroup] 2025-05-07T20:22:56.1957846Z Downloading single artifact 2025-05-07T20:22:56.3256734Z Preparing to download the following artifacts: 2025-05-07T20:22:56.3257733Z - fbgemm_genai_x86_clang_py3.11_cu12.6.3.whl (ID: 3081362348, Size: 12539580, Expected Digest: sha256:3102cc9a69eb3583ac0189afa1ca8413efb7589bfcb502323f89be9052562745) 2025-05-07T20:22:56.3770709Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-ed4592c2-86e8-5a3b-90c7-f7838a45299d/artifacts/e12be460a7cfa72843e2743f76c4cd1766662f0611b2be7066228e0cc294f44e.zip 2025-05-07T20:22:56.3772132Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:56.4451485Z (node:57018) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:56.4452500Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:56.6224279Z SHA256 digest of downloaded artifact is 3102cc9a69eb3583ac0189afa1ca8413efb7589bfcb502323f89be9052562745 2025-05-07T20:22:56.6225091Z Artifact download completed successfully. 2025-05-07T20:22:56.6225423Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:56.6230156Z Download artifact has finished successfully 2025-05-07T20:22:56.6488144Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:56.6488536Z with: 2025-05-07T20:22:56.6488753Z driver-version: 570.133.07 2025-05-07T20:22:56.6488997Z env: 2025-05-07T20:22:56.6489212Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.6489515Z BUILD_ENV: build_binary 2025-05-07T20:22:56.6489765Z BUILD_TARGET: genai 2025-05-07T20:22:56.6489992Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.6490230Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.6490487Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.6490726Z ##[endgroup] 2025-05-07T20:22:56.6580209Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:56.6580596Z with: 2025-05-07T20:22:56.6580984Z timeout_minutes: 10 2025-05-07T20:22:56.6581216Z max_attempts: 3 2025-05-07T20:22:56.6604261Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:56.6627732Z retry_wait_seconds: 10 2025-05-07T20:22:56.6627993Z polling_interval_seconds: 1 2025-05-07T20:22:56.6628255Z warning_on_retry: true 2025-05-07T20:22:56.6628508Z continue_on_error: false 2025-05-07T20:22:56.6628746Z env: 2025-05-07T20:22:56.6628972Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.6629272Z BUILD_ENV: build_binary 2025-05-07T20:22:56.6629515Z BUILD_TARGET: genai 2025-05-07T20:22:56.6629743Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.6629987Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.6630245Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.6630483Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:56.6630732Z ##[endgroup] 2025-05-07T20:22:56.7428189Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:56.7428805Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:56.7432090Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:57.2929574Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:57.2943763Z No packages marked for removal. 2025-05-07T20:22:57.2993803Z Dependencies resolved. 2025-05-07T20:22:57.3003471Z Nothing to do. 2025-05-07T20:22:57.3003750Z Complete! 2025-05-07T20:22:57.3330775Z + install_nvidia_driver_common 2025-05-07T20:22:57.3334427Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:57.3334738Z + lspci 2025-05-07T20:22:57.3336117Z Before installing NVIDIA driver 2025-05-07T20:22:57.3522977Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:57.3523770Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:57.3524317Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:57.3524819Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:57.3525288Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:57.3525804Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:57.3526280Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:57.3526740Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:57.3527133Z + lsmod 2025-05-07T20:22:57.3569382Z Module Size Used by 2025-05-07T20:22:57.3569966Z xt_conntrack 16384 1 2025-05-07T20:22:57.3570477Z nft_chain_nat 16384 3 2025-05-07T20:22:57.3570998Z xt_MASQUERADE 20480 1 2025-05-07T20:22:57.3571601Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:57.3572248Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:57.3573025Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:57.3573879Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:57.3574488Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:57.3575061Z xfrm_user 57344 1 2025-05-07T20:22:57.3575583Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:57.3576144Z xt_addrtype 16384 2 2025-05-07T20:22:57.3576655Z nft_compat 20480 4 2025-05-07T20:22:57.3577262Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:57.3578086Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:57.3578820Z br_netfilter 36864 0 2025-05-07T20:22:57.3579125Z bridge 323584 1 br_netfilter 2025-05-07T20:22:57.3579431Z stp 16384 1 bridge 2025-05-07T20:22:57.3579708Z llc 16384 2 bridge,stp 2025-05-07T20:22:57.3579993Z overlay 167936 0 2025-05-07T20:22:57.3580248Z tls 135168 0 2025-05-07T20:22:57.3580499Z nls_ascii 16384 1 2025-05-07T20:22:57.3580749Z nls_cp437 20480 1 2025-05-07T20:22:57.3581001Z vfat 24576 1 2025-05-07T20:22:57.3581254Z fat 86016 1 vfat 2025-05-07T20:22:57.3581517Z ena 180224 0 2025-05-07T20:22:57.3581759Z i8042 45056 0 2025-05-07T20:22:57.3582015Z serio 28672 3 i8042 2025-05-07T20:22:57.3582287Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:57.3582548Z button 24576 0 2025-05-07T20:22:57.3582802Z sunrpc 696320 1 2025-05-07T20:22:57.3583053Z sch_fq_codel 20480 17 2025-05-07T20:22:57.3583316Z dm_mod 188416 0 2025-05-07T20:22:57.3583570Z fuse 163840 1 2025-05-07T20:22:57.3583822Z configfs 57344 1 2025-05-07T20:22:57.3584076Z loop 36864 0 2025-05-07T20:22:57.3584333Z dax 45056 1 dm_mod 2025-05-07T20:22:57.3584604Z dmi_sysfs 20480 0 2025-05-07T20:22:57.3584863Z crc32_pclmul 16384 0 2025-05-07T20:22:57.3585121Z crc32c_intel 24576 0 2025-05-07T20:22:57.3585380Z efivarfs 24576 1 2025-05-07T20:22:57.3585627Z + modinfo nvidia 2025-05-07T20:22:57.3591088Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:57.3591562Z import_ns: DMA_BUF 2025-05-07T20:22:57.3591813Z alias: char-major-195-* 2025-05-07T20:22:57.3592084Z version: 570.133.07 2025-05-07T20:22:57.3592341Z supported: external 2025-05-07T20:22:57.3592674Z license: Dual MIT/GPL 2025-05-07T20:22:57.3592976Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:57.3593320Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:57.3593931Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:57.3594266Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:57.3594608Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:57.3594949Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:57.3595265Z depends: i2c-core,drm 2025-05-07T20:22:57.3595521Z retpoline: Y 2025-05-07T20:22:57.3595749Z name: nvidia 2025-05-07T20:22:57.3596118Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:57.3596624Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:57.3597065Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:57.3597582Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:57.3597900Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:57.3598312Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:57.3598766Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:57.3599141Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:57.3599443Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:57.3599809Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:57.3600202Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:57.3600544Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:57.3600852Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:57.3601176Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:57.3601686Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:57.3602095Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:57.3602475Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:57.3602895Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.3603297Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:57.3603722Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.3604132Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:57.3604473Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:57.3604840Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:57.3605219Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:57.3605561Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:57.3606079Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:57.3606417Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:57.3606745Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:57.3607053Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:57.3607406Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:57.3607839Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:57.3608179Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:57.3608515Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:57.3608872Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:57.3609217Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:57.3609556Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:57.3609901Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:57.3610198Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:57.3610521Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:57.3610860Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:57.3611185Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:57.3611517Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:57.3611882Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:57.3612242Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:57.3612580Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:57.3612924Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:57.3613272Z parm: rm_firmware_active:charp 2025-05-07T20:22:57.3613740Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:57.3613994Z ++ command -v nvidia-smi 2025-05-07T20:22:57.3614252Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:57.3614510Z + set +e 2025-05-07T20:22:57.3614815Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:22:59.1803623Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:22:59.1804003Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:22:59.1804245Z + '[' 0 -ne 0 ']' 2025-05-07T20:22:59.1804469Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:22:59.1804745Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:22:59.1805173Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:22:59.1805875Z + set -e 2025-05-07T20:22:59.1806453Z + '[' 1 -eq 0 ']' 2025-05-07T20:22:59.1806841Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:22:59.1807293Z + post_install_nvidia_driver_common 2025-05-07T20:22:59.1810932Z + sudo modprobe nvidia 2025-05-07T20:22:59.2711710Z + echo 'After installing NVIDIA driver' 2025-05-07T20:22:59.2712043Z + lspci 2025-05-07T20:22:59.2712268Z After installing NVIDIA driver 2025-05-07T20:22:59.2827332Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:59.2827821Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:59.2828359Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:59.2828876Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:59.2829351Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:59.2829868Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:59.2830379Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:59.2830856Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:59.2831265Z + lsmod 2025-05-07T20:22:59.2858878Z Module Size Used by 2025-05-07T20:22:59.2859243Z nvidia_uvm 1884160 0 2025-05-07T20:22:59.2859619Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:22:59.2860015Z drm 602112 1 nvidia 2025-05-07T20:22:59.2860384Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:22:59.2860695Z backlight 24576 1 drm 2025-05-07T20:22:59.2860983Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:22:59.2861297Z xt_conntrack 16384 1 2025-05-07T20:22:59.2861669Z nft_chain_nat 16384 3 2025-05-07T20:22:59.2862022Z xt_MASQUERADE 20480 1 2025-05-07T20:22:59.2862423Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:59.2862778Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:59.2863177Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:59.2863606Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:59.2863926Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:59.2864220Z xfrm_user 57344 1 2025-05-07T20:22:59.2864483Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:59.2864761Z xt_addrtype 16384 2 2025-05-07T20:22:59.2865019Z nft_compat 20480 4 2025-05-07T20:22:59.2865328Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:59.2865737Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:59.2866100Z br_netfilter 36864 0 2025-05-07T20:22:59.2866369Z bridge 323584 1 br_netfilter 2025-05-07T20:22:59.2866678Z stp 16384 1 bridge 2025-05-07T20:22:59.2866951Z llc 16384 2 bridge,stp 2025-05-07T20:22:59.2867237Z overlay 167936 0 2025-05-07T20:22:59.2867485Z tls 135168 0 2025-05-07T20:22:59.2867728Z nls_ascii 16384 1 2025-05-07T20:22:59.2868236Z nls_cp437 20480 1 2025-05-07T20:22:59.2868485Z vfat 24576 1 2025-05-07T20:22:59.2868733Z fat 86016 1 vfat 2025-05-07T20:22:59.2868988Z ena 180224 0 2025-05-07T20:22:59.2869231Z i8042 45056 0 2025-05-07T20:22:59.2869477Z serio 28672 3 i8042 2025-05-07T20:22:59.2869746Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:59.2870003Z button 24576 0 2025-05-07T20:22:59.2870253Z sunrpc 696320 1 2025-05-07T20:22:59.2870500Z sch_fq_codel 20480 17 2025-05-07T20:22:59.2870757Z dm_mod 188416 0 2025-05-07T20:22:59.2870999Z fuse 163840 1 2025-05-07T20:22:59.2871237Z configfs 57344 1 2025-05-07T20:22:59.2871623Z loop 36864 0 2025-05-07T20:22:59.2871872Z dax 45056 1 dm_mod 2025-05-07T20:22:59.2872144Z dmi_sysfs 20480 0 2025-05-07T20:22:59.2872388Z crc32_pclmul 16384 0 2025-05-07T20:22:59.2872644Z crc32c_intel 24576 0 2025-05-07T20:22:59.2872891Z efivarfs 24576 1 2025-05-07T20:22:59.2873130Z + modinfo nvidia 2025-05-07T20:22:59.2876062Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:59.2876721Z import_ns: DMA_BUF 2025-05-07T20:22:59.2877055Z alias: char-major-195-* 2025-05-07T20:22:59.2877392Z version: 570.133.07 2025-05-07T20:22:59.2877641Z supported: external 2025-05-07T20:22:59.2877886Z license: Dual MIT/GPL 2025-05-07T20:22:59.2878159Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:59.2878494Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:59.2878806Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:59.2879119Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:59.2879455Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:59.2879783Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:59.2880090Z depends: i2c-core,drm 2025-05-07T20:22:59.2880344Z retpoline: Y 2025-05-07T20:22:59.2880557Z name: nvidia 2025-05-07T20:22:59.2880953Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:59.2881600Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:59.2882199Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:59.2882678Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:59.2882981Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:59.2883280Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:59.2883595Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:59.2883893Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:59.2884208Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:59.2884572Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:59.2884973Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:59.2885305Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:59.2885607Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:59.2885913Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:59.2886270Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:59.2886668Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:59.2887050Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:59.2887541Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.2887960Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:59.2888382Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.2888811Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:59.2889145Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:59.2889511Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:59.2890004Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:59.2890345Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:59.2890666Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:59.2890997Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:59.2891312Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:59.2891623Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:59.2891967Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:59.2892326Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:59.2892646Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:59.2892976Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:59.2893316Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:59.2893743Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:59.2894081Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:59.2894408Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:59.2894694Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:59.2895026Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:59.2895350Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:59.2895655Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:59.2895985Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:59.2896337Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:59.2896732Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:59.2897061Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:59.2897398Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:59.2897734Z parm: rm_firmware_active:charp 2025-05-07T20:22:59.2898019Z + set +e 2025-05-07T20:22:59.2898219Z + nvidia-smi 2025-05-07T20:23:00.6992256Z Wed May 7 20:23:00 2025 2025-05-07T20:23:00.6992781Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:00.6993408Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:00.6993898Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:00.6994394Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:00.6994917Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:00.6995335Z | | | MIG M. | 2025-05-07T20:23:00.6995665Z |=========================================+========================+======================| 2025-05-07T20:23:00.7056239Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:00.7056872Z | 0% 29C P0 61W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:00.7057406Z | | | N/A | 2025-05-07T20:23:00.7057953Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:00.7058479Z 2025-05-07T20:23:00.7058870Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:00.7059292Z | Processes: | 2025-05-07T20:23:00.7059727Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:00.7060133Z | ID ID Usage | 2025-05-07T20:23:00.7060473Z |=========================================================================================| 2025-05-07T20:23:00.7061361Z | No running processes found | 2025-05-07T20:23:00.7062452Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.1168013Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:02.5204024Z NVIDIA A10G 2025-05-07T20:23:02.8032403Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:02.8032664Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:02.8032907Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:02.8033190Z + set -e 2025-05-07T20:23:02.8033390Z INFO: Ignoring allowed status 0 2025-05-07T20:23:02.8041436Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:02.8045411Z + sudo yum install -y yum-utils 2025-05-07T20:23:03.2040635Z Last metadata expiration check: 0:05:44 ago on Wed May 7 20:17:19 2025. 2025-05-07T20:23:03.2288389Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:03.2680078Z Dependencies resolved. 2025-05-07T20:23:03.2859750Z Nothing to do. 2025-05-07T20:23:03.2860441Z Complete! 2025-05-07T20:23:03.3253570Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:03.3254392Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.3255315Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.6104974Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.6668170Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:04.2313304Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:04.2559093Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:04.2955567Z Dependencies resolved. 2025-05-07T20:23:04.3137286Z ================================================================================ 2025-05-07T20:23:04.3137711Z Package Arch Version Repository Size 2025-05-07T20:23:04.3138193Z ================================================================================ 2025-05-07T20:23:04.3138585Z Downgrading: 2025-05-07T20:23:04.3138961Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:04.3139554Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:04.3139907Z 2025-05-07T20:23:04.3140008Z Transaction Summary 2025-05-07T20:23:04.3140269Z ================================================================================ 2025-05-07T20:23:04.3140584Z Downgrade 2 Packages 2025-05-07T20:23:04.3140741Z 2025-05-07T20:23:04.3140915Z Total download size: 6.8 M 2025-05-07T20:23:04.3142339Z Downloading Packages: 2025-05-07T20:23:04.3571701Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 30 MB/s | 1.2 MB 00:00 2025-05-07T20:23:04.4299744Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 49 MB/s | 5.6 MB 00:00 2025-05-07T20:23:04.4308534Z -------------------------------------------------------------------------------- 2025-05-07T20:23:04.4311463Z Total 59 MB/s | 6.8 MB 00:00 2025-05-07T20:23:04.4314000Z Running transaction check 2025-05-07T20:23:04.4417048Z Transaction check succeeded. 2025-05-07T20:23:04.4417351Z Running transaction test 2025-05-07T20:23:04.4711084Z Transaction test succeeded. 2025-05-07T20:23:04.4713648Z Running transaction 2025-05-07T20:23:05.0235639Z Preparing : 1/1 2025-05-07T20:23:05.1329560Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:05.1358556Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.1576006Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.1576788Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.1684356Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.1709084Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:06.5438072Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:06.5438865Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:06.5439449Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:06.5439974Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:06.6727410Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:06.6728558Z WARNING: 2025-05-07T20:23:06.6728836Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:06.6729063Z 2025-05-07T20:23:06.6729156Z Available Versions: 2025-05-07T20:23:06.6729308Z 2025-05-07T20:23:06.6729411Z Version 2023.7.20250331: 2025-05-07T20:23:06.6729728Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:06.6729981Z 2025-05-07T20:23:06.6730108Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:06.6730315Z 2025-05-07T20:23:06.6730402Z Release notes: 2025-05-07T20:23:06.6730815Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:06.6731181Z 2025-05-07T20:23:06.6731279Z Version 2023.7.20250414: 2025-05-07T20:23:06.6731588Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:06.6731843Z 2025-05-07T20:23:06.6731957Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:06.6732171Z 2025-05-07T20:23:06.6732257Z Release notes: 2025-05-07T20:23:06.6732797Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:06.6733186Z 2025-05-07T20:23:06.6733307Z Version 2023.7.20250428: 2025-05-07T20:23:06.6733918Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:06.6734201Z 2025-05-07T20:23:06.6734379Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:06.6734614Z 2025-05-07T20:23:06.6734729Z Release notes: 2025-05-07T20:23:06.6735280Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:06.6748037Z 2025-05-07T20:23:06.6748173Z ================================================================================ 2025-05-07T20:23:06.7087686Z 2025-05-07T20:23:06.7087954Z 2025-05-07T20:23:06.7088099Z Downgraded: 2025-05-07T20:23:06.7088592Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:06.7089242Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:06.7089596Z 2025-05-07T20:23:06.7089684Z Complete! 2025-05-07T20:23:06.7535303Z + sudo systemctl restart docker 2025-05-07T20:23:10.6649006Z Wed May 7 20:23:10 2025 2025-05-07T20:23:10.6649607Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:10.6650237Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:10.6650725Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:10.6651218Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:10.6651749Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:10.6652170Z | | | MIG M. | 2025-05-07T20:23:10.6652504Z |=========================================+========================+======================| 2025-05-07T20:23:10.6732043Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:10.6733074Z | 0% 29C P0 61W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:10.6733592Z | | | N/A | 2025-05-07T20:23:10.6734138Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:10.6734653Z 2025-05-07T20:23:10.6735152Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:10.6735593Z | Processes: | 2025-05-07T20:23:10.6736037Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:10.6736678Z | ID ID Usage | 2025-05-07T20:23:10.6737154Z |=========================================================================================| 2025-05-07T20:23:10.6737728Z | No running processes found | 2025-05-07T20:23:10.6738269Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.7199626Z Command completed after 1 attempt(s). 2025-05-07T20:23:11.7286107Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:11.7286585Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:11.7301930Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:11.7302278Z env: 2025-05-07T20:23:11.7302497Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:11.7302817Z BUILD_ENV: build_binary 2025-05-07T20:23:11.7303071Z BUILD_TARGET: genai 2025-05-07T20:23:11.7303313Z BUILD_VARIANT: cuda 2025-05-07T20:23:11.7303557Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:11.7303825Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:11.7304131Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:11.7304466Z ##[endgroup] 2025-05-07T20:23:12.0650980Z ################################################################################ 2025-05-07T20:23:12.0651360Z # Print System Info 2025-05-07T20:23:12.0651582Z # 2025-05-07T20:23:12.0665955Z # [2025-05-07T20:23:12.066Z] + print_system_info 2025-05-07T20:23:12.0666327Z ################################################################################ 2025-05-07T20:23:12.0666542Z 2025-05-07T20:23:12.0666655Z ################################################################################ 2025-05-07T20:23:12.0666990Z [INFO] Printing environment variables ... 2025-05-07T20:23:12.0667288Z + printenv 2025-05-07T20:23:12.0667402Z 2025-05-07T20:23:12.0688020Z SHELL=/bin/bash 2025-05-07T20:23:12.0688391Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:12.0688799Z BUILD_VARIANT=cuda 2025-05-07T20:23:12.0689325Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_2a59ce91-961f-4faf-8430-0285c6cfa352 2025-05-07T20:23:12.0689887Z GITHUB_ACTION=__run 2025-05-07T20:23:12.0690177Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.0690549Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:12.0690806Z RUNNER_NAME=i-0e49e9d70b38203df 2025-05-07T20:23:12.0691082Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:12.0691386Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:12.0691654Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:12.0692016Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:12.0692440Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:12.0692722Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:12.0693019Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:12.0693588Z *** 2025-05-07T20:23:12.0693793Z LOGNAME=ec2-user 2025-05-07T20:23:12.0694029Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:12.0694283Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:12.0694514Z GITHUB_ACTIONS=true 2025-05-07T20:23:12.0694739Z SYSTEMD_EXEC_PID=55576 2025-05-07T20:23:12.0695009Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:12.0695557Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:12.0696068Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:12.0696350Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:12.0696606Z RUNNER_OS=Linux 2025-05-07T20:23:12.0696836Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:12.0697084Z HOME=/home/ec2-user 2025-05-07T20:23:12.0697324Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:12.0697617Z LANG=C.UTF-8 2025-05-07T20:23:12.0697907Z RUNNER_TRACKING_ID=github_d031b214-61cb-47a7-8b20-1c60992044bb 2025-05-07T20:23:12.0698253Z RUNNER_ARCH=X64 2025-05-07T20:23:12.0698528Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:12.0699210Z BUILD_TARGET=genai 2025-05-07T20:23:12.0699732Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_2a59ce91-961f-4faf-8430-0285c6cfa352 2025-05-07T20:23:12.0700588Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_2a59ce91-961f-4faf-8430-0285c6cfa352 2025-05-07T20:23:12.0701314Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:12.0701965Z INVOCATION_ID=452351df4c4043e9bfbc1dc973e48402 2025-05-07T20:23:12.0702288Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:12.0702553Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:12.0703125Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_2a59ce91-961f-4faf-8430-0285c6cfa352 2025-05-07T20:23:12.0703724Z BUILD_ENV=build_binary 2025-05-07T20:23:12.0703955Z GITHUB_ACTOR=q10 2025-05-07T20:23:12.0704170Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:12.0704384Z KERN_NAME_LC=linux 2025-05-07T20:23:12.0704617Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:12.0704912Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:12.0705260Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:12.0705498Z USER=ec2-user 2025-05-07T20:23:12.0706029Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:12.0706324Z SHLVL=1 2025-05-07T20:23:12.0706510Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:12.0706819Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:12.0707258Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:12.0707622Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:12.0707855Z KERN_NAME=Linux 2025-05-07T20:23:12.0708083Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:12.0708487Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:12.0708904Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:12.0709179Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:12.0709416Z JOURNAL_STREAM=8:83634 2025-05-07T20:23:12.0709727Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:12.0710087Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:12.0710391Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:12.0710709Z GITHUB_BASE_REF=main 2025-05-07T20:23:12.0710927Z CI=true 2025-05-07T20:23:12.0711137Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:12.0711411Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:12.0711676Z GITHUB_ACTION_REF= 2025-05-07T20:23:12.0711925Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:12.0712528Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_2a59ce91-961f-4faf-8430-0285c6cfa352 2025-05-07T20:23:12.0713101Z MACHINE_NAME=x86_64 2025-05-07T20:23:12.0713321Z _=/usr/bin/printenv 2025-05-07T20:23:12.0713452Z 2025-05-07T20:23:12.0713571Z ################################################################################ 2025-05-07T20:23:12.0713880Z [INFO] Print ldd version ... 2025-05-07T20:23:12.0714147Z + ldd --version 2025-05-07T20:23:12.0714278Z 2025-05-07T20:23:12.0714362Z ldd (GNU libc) 2.34 2025-05-07T20:23:12.0714627Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:12.0715059Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:12.0715586Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:12.0716034Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:12.0716252Z 2025-05-07T20:23:12.0716368Z ################################################################################ 2025-05-07T20:23:12.0716674Z [INFO] Print CPU info ... 2025-05-07T20:23:12.0716906Z + nproc 2025-05-07T20:23:12.0717014Z 2025-05-07T20:23:12.0732447Z 16 2025-05-07T20:23:12.0734185Z 2025-05-07T20:23:12.0734427Z + lscpu 2025-05-07T20:23:12.0734543Z 2025-05-07T20:23:12.0863268Z Architecture: x86_64 2025-05-07T20:23:12.0863735Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:12.0864396Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0864782Z Byte Order: Little Endian 2025-05-07T20:23:12.0865100Z CPU(s): 16 2025-05-07T20:23:12.0865398Z On-line CPU(s) list: 0-15 2025-05-07T20:23:12.0865716Z Vendor ID: AuthenticAMD 2025-05-07T20:23:12.0866064Z Model name: AMD EPYC 7R32 2025-05-07T20:23:12.0866386Z CPU family: 23 2025-05-07T20:23:12.0866810Z Model: 49 2025-05-07T20:23:12.0867094Z Thread(s) per core: 2 2025-05-07T20:23:12.0867387Z Core(s) per socket: 8 2025-05-07T20:23:12.0867676Z Socket(s): 1 2025-05-07T20:23:12.0867953Z Stepping: 0 2025-05-07T20:23:12.0868262Z BogoMIPS: 5600.00 2025-05-07T20:23:12.0870345Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0872419Z Hypervisor vendor: KVM 2025-05-07T20:23:12.0872744Z Virtualization type: full 2025-05-07T20:23:12.0873087Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:12.0873471Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:12.0873834Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:12.0874200Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:12.0874527Z NUMA node(s): 1 2025-05-07T20:23:12.0874822Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:12.0875169Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:12.0875533Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:12.0875897Z Vulnerability L1tf: Not affected 2025-05-07T20:23:12.0876245Z Vulnerability Mds: Not affected 2025-05-07T20:23:12.0876618Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:12.0876974Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:12.0877424Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:12.0878084Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:12.0878869Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:12.0879618Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:12.0880537Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:12.0881444Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:12.0882238Z Vulnerability Srbds: Not affected 2025-05-07T20:23:12.0882603Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:12.0882920Z 2025-05-07T20:23:12.0883023Z + cat /proc/cpuinfo 2025-05-07T20:23:12.0883162Z 2025-05-07T20:23:12.0883266Z processor : 0 2025-05-07T20:23:12.0883490Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0883740Z cpu family : 23 2025-05-07T20:23:12.0883959Z model : 49 2025-05-07T20:23:12.0884173Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0884432Z stepping : 0 2025-05-07T20:23:12.0884657Z microcode : 0x830107f 2025-05-07T20:23:12.0884998Z cpu MHz : 2442.812 2025-05-07T20:23:12.0885235Z cache size : 512 KB 2025-05-07T20:23:12.0885460Z physical id : 0 2025-05-07T20:23:12.0885675Z siblings : 16 2025-05-07T20:23:12.0885890Z core id : 0 2025-05-07T20:23:12.0886102Z cpu cores : 8 2025-05-07T20:23:12.0886307Z apicid : 0 2025-05-07T20:23:12.0886517Z initial apicid : 0 2025-05-07T20:23:12.0886739Z fpu : yes 2025-05-07T20:23:12.0886943Z fpu_exception : yes 2025-05-07T20:23:12.0887175Z cpuid level : 13 2025-05-07T20:23:12.0887394Z wp : yes 2025-05-07T20:23:12.0889567Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0891797Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0892293Z bogomips : 5600.00 2025-05-07T20:23:12.0892514Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0892759Z clflush size : 64 2025-05-07T20:23:12.0892970Z cache_alignment : 64 2025-05-07T20:23:12.0893239Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0893562Z power management: 2025-05-07T20:23:12.0893691Z 2025-05-07T20:23:12.0893773Z processor : 1 2025-05-07T20:23:12.0893993Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0894227Z cpu family : 23 2025-05-07T20:23:12.0894430Z model : 49 2025-05-07T20:23:12.0894634Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0894885Z stepping : 0 2025-05-07T20:23:12.0895084Z microcode : 0x830107f 2025-05-07T20:23:12.0895319Z cpu MHz : 3245.085 2025-05-07T20:23:12.0895528Z cache size : 512 KB 2025-05-07T20:23:12.0895743Z physical id : 0 2025-05-07T20:23:12.0895953Z siblings : 16 2025-05-07T20:23:12.0896160Z core id : 1 2025-05-07T20:23:12.0896349Z cpu cores : 8 2025-05-07T20:23:12.0896551Z apicid : 2 2025-05-07T20:23:12.0896744Z initial apicid : 2 2025-05-07T20:23:12.0896956Z fpu : yes 2025-05-07T20:23:12.0897158Z fpu_exception : yes 2025-05-07T20:23:12.0897380Z cpuid level : 13 2025-05-07T20:23:12.0897580Z wp : yes 2025-05-07T20:23:12.0899533Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0901750Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0902246Z bogomips : 5600.00 2025-05-07T20:23:12.0902465Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0902697Z clflush size : 64 2025-05-07T20:23:12.0902920Z cache_alignment : 64 2025-05-07T20:23:12.0903193Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0903503Z power management: 2025-05-07T20:23:12.0903641Z 2025-05-07T20:23:12.0903727Z processor : 2 2025-05-07T20:23:12.0903948Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0904178Z cpu family : 23 2025-05-07T20:23:12.0904401Z model : 49 2025-05-07T20:23:12.0904608Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0904855Z stepping : 0 2025-05-07T20:23:12.0905067Z microcode : 0x830107f 2025-05-07T20:23:12.0905298Z cpu MHz : 3305.794 2025-05-07T20:23:12.0905504Z cache size : 512 KB 2025-05-07T20:23:12.0905997Z physical id : 0 2025-05-07T20:23:12.0906208Z siblings : 16 2025-05-07T20:23:12.0906566Z core id : 2 2025-05-07T20:23:12.0906763Z cpu cores : 8 2025-05-07T20:23:12.0906970Z apicid : 4 2025-05-07T20:23:12.0907174Z initial apicid : 4 2025-05-07T20:23:12.0907385Z fpu : yes 2025-05-07T20:23:12.0907590Z fpu_exception : yes 2025-05-07T20:23:12.0907814Z cpuid level : 13 2025-05-07T20:23:12.0908017Z wp : yes 2025-05-07T20:23:12.0910103Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0912334Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0912836Z bogomips : 5600.00 2025-05-07T20:23:12.0913053Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0913294Z clflush size : 64 2025-05-07T20:23:12.0913517Z cache_alignment : 64 2025-05-07T20:23:12.0913783Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0914104Z power management: 2025-05-07T20:23:12.0914247Z 2025-05-07T20:23:12.0914335Z processor : 3 2025-05-07T20:23:12.0914558Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0914806Z cpu family : 23 2025-05-07T20:23:12.0915010Z model : 49 2025-05-07T20:23:12.0915221Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0915467Z stepping : 0 2025-05-07T20:23:12.0915674Z microcode : 0x830107f 2025-05-07T20:23:12.0915913Z cpu MHz : 3299.548 2025-05-07T20:23:12.0916132Z cache size : 512 KB 2025-05-07T20:23:12.0916347Z physical id : 0 2025-05-07T20:23:12.0916558Z siblings : 16 2025-05-07T20:23:12.0916763Z core id : 3 2025-05-07T20:23:12.0916963Z cpu cores : 8 2025-05-07T20:23:12.0917166Z apicid : 6 2025-05-07T20:23:12.0917369Z initial apicid : 6 2025-05-07T20:23:12.0917583Z fpu : yes 2025-05-07T20:23:12.0917803Z fpu_exception : yes 2025-05-07T20:23:12.0918022Z cpuid level : 13 2025-05-07T20:23:12.0918231Z wp : yes 2025-05-07T20:23:12.0920191Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0922550Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0923058Z bogomips : 5600.00 2025-05-07T20:23:12.0923276Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0923517Z clflush size : 64 2025-05-07T20:23:12.0923735Z cache_alignment : 64 2025-05-07T20:23:12.0924009Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0924321Z power management: 2025-05-07T20:23:12.0924461Z 2025-05-07T20:23:12.0924546Z processor : 4 2025-05-07T20:23:12.0924761Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0924996Z cpu family : 23 2025-05-07T20:23:12.0925204Z model : 49 2025-05-07T20:23:12.0925420Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0925663Z stepping : 0 2025-05-07T20:23:12.0925876Z microcode : 0x830107f 2025-05-07T20:23:12.0926108Z cpu MHz : 3188.425 2025-05-07T20:23:12.0926311Z cache size : 512 KB 2025-05-07T20:23:12.0926522Z physical id : 0 2025-05-07T20:23:12.0926799Z siblings : 16 2025-05-07T20:23:12.0941343Z core id : 4 2025-05-07T20:23:12.0941676Z cpu cores : 8 2025-05-07T20:23:12.0941965Z apicid : 8 2025-05-07T20:23:12.0942328Z initial apicid : 8 2025-05-07T20:23:12.0942554Z fpu : yes 2025-05-07T20:23:12.0942806Z fpu_exception : yes 2025-05-07T20:23:12.0943035Z cpuid level : 13 2025-05-07T20:23:12.0943249Z wp : yes 2025-05-07T20:23:12.0945291Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0947510Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0947994Z bogomips : 5600.00 2025-05-07T20:23:12.0948230Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0948472Z clflush size : 64 2025-05-07T20:23:12.0948685Z cache_alignment : 64 2025-05-07T20:23:12.0948963Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0949285Z power management: 2025-05-07T20:23:12.0949425Z 2025-05-07T20:23:12.0949518Z processor : 5 2025-05-07T20:23:12.0949744Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0949988Z cpu family : 23 2025-05-07T20:23:12.0950196Z model : 49 2025-05-07T20:23:12.0950411Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0950667Z stepping : 0 2025-05-07T20:23:12.0950872Z microcode : 0x830107f 2025-05-07T20:23:12.0951100Z cpu MHz : 3261.166 2025-05-07T20:23:12.0951320Z cache size : 512 KB 2025-05-07T20:23:12.0951533Z physical id : 0 2025-05-07T20:23:12.0951750Z siblings : 16 2025-05-07T20:23:12.0951956Z core id : 5 2025-05-07T20:23:12.0952155Z cpu cores : 8 2025-05-07T20:23:12.0952362Z apicid : 10 2025-05-07T20:23:12.0952572Z initial apicid : 10 2025-05-07T20:23:12.0952784Z fpu : yes 2025-05-07T20:23:12.0952995Z fpu_exception : yes 2025-05-07T20:23:12.0953215Z cpuid level : 13 2025-05-07T20:23:12.0953427Z wp : yes 2025-05-07T20:23:12.0955361Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0957588Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0958077Z bogomips : 5600.00 2025-05-07T20:23:12.0958300Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0958532Z clflush size : 64 2025-05-07T20:23:12.0958761Z cache_alignment : 64 2025-05-07T20:23:12.0959037Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0959347Z power management: 2025-05-07T20:23:12.0959487Z 2025-05-07T20:23:12.0959572Z processor : 6 2025-05-07T20:23:12.0959792Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0960028Z cpu family : 23 2025-05-07T20:23:12.0960239Z model : 49 2025-05-07T20:23:12.0960450Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0960704Z stepping : 0 2025-05-07T20:23:12.0960911Z microcode : 0x830107f 2025-05-07T20:23:12.0961149Z cpu MHz : 3198.001 2025-05-07T20:23:12.0961368Z cache size : 512 KB 2025-05-07T20:23:12.0961585Z physical id : 0 2025-05-07T20:23:12.0961800Z siblings : 16 2025-05-07T20:23:12.0962007Z core id : 6 2025-05-07T20:23:12.0962212Z cpu cores : 8 2025-05-07T20:23:12.0962423Z apicid : 12 2025-05-07T20:23:12.0962635Z initial apicid : 12 2025-05-07T20:23:12.0962845Z fpu : yes 2025-05-07T20:23:12.0963059Z fpu_exception : yes 2025-05-07T20:23:12.0963308Z cpuid level : 13 2025-05-07T20:23:12.0963630Z wp : yes 2025-05-07T20:23:12.0965668Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0968000Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0968492Z bogomips : 5600.00 2025-05-07T20:23:12.0968712Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0968954Z clflush size : 64 2025-05-07T20:23:12.0969173Z cache_alignment : 64 2025-05-07T20:23:12.0969447Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0969750Z power management: 2025-05-07T20:23:12.0969888Z 2025-05-07T20:23:12.0969975Z processor : 7 2025-05-07T20:23:12.0970199Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0970432Z cpu family : 23 2025-05-07T20:23:12.0970643Z model : 49 2025-05-07T20:23:12.0970853Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0971091Z stepping : 0 2025-05-07T20:23:12.0971303Z microcode : 0x830107f 2025-05-07T20:23:12.0971532Z cpu MHz : 3316.480 2025-05-07T20:23:12.0971747Z cache size : 512 KB 2025-05-07T20:23:12.0971975Z physical id : 0 2025-05-07T20:23:12.0972191Z siblings : 16 2025-05-07T20:23:12.0972392Z core id : 7 2025-05-07T20:23:12.0972601Z cpu cores : 8 2025-05-07T20:23:12.0972809Z apicid : 14 2025-05-07T20:23:12.0973009Z initial apicid : 14 2025-05-07T20:23:12.0973225Z fpu : yes 2025-05-07T20:23:12.0973429Z fpu_exception : yes 2025-05-07T20:23:12.0973643Z cpuid level : 13 2025-05-07T20:23:12.0973857Z wp : yes 2025-05-07T20:23:12.0975813Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0978044Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0978536Z bogomips : 5600.00 2025-05-07T20:23:12.0978755Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0978997Z clflush size : 64 2025-05-07T20:23:12.0979216Z cache_alignment : 64 2025-05-07T20:23:12.0979484Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0979807Z power management: 2025-05-07T20:23:12.0979938Z 2025-05-07T20:23:12.0980034Z processor : 8 2025-05-07T20:23:12.0980251Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0980491Z cpu family : 23 2025-05-07T20:23:12.0980703Z model : 49 2025-05-07T20:23:12.0980902Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0981150Z stepping : 0 2025-05-07T20:23:12.0981366Z microcode : 0x830107f 2025-05-07T20:23:12.0981588Z cpu MHz : 2823.868 2025-05-07T20:23:12.0981813Z cache size : 512 KB 2025-05-07T20:23:12.0982034Z physical id : 0 2025-05-07T20:23:12.0982247Z siblings : 16 2025-05-07T20:23:12.0982452Z core id : 0 2025-05-07T20:23:12.0982650Z cpu cores : 8 2025-05-07T20:23:12.0982850Z apicid : 1 2025-05-07T20:23:12.0983036Z initial apicid : 1 2025-05-07T20:23:12.0983240Z fpu : yes 2025-05-07T20:23:12.0983431Z fpu_exception : yes 2025-05-07T20:23:12.0983637Z cpuid level : 13 2025-05-07T20:23:12.0983837Z wp : yes 2025-05-07T20:23:12.0985769Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0988297Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0988775Z bogomips : 5600.00 2025-05-07T20:23:12.0988988Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0989215Z clflush size : 64 2025-05-07T20:23:12.0989426Z cache_alignment : 64 2025-05-07T20:23:12.0989684Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.0989998Z power management: 2025-05-07T20:23:12.0990128Z 2025-05-07T20:23:12.0990224Z processor : 9 2025-05-07T20:23:12.0990430Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.0990660Z cpu family : 23 2025-05-07T20:23:12.0990868Z model : 49 2025-05-07T20:23:12.0991066Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.0991301Z stepping : 0 2025-05-07T20:23:12.0991503Z microcode : 0x830107f 2025-05-07T20:23:12.0991715Z cpu MHz : 3308.647 2025-05-07T20:23:12.0991928Z cache size : 512 KB 2025-05-07T20:23:12.0992143Z physical id : 0 2025-05-07T20:23:12.0992348Z siblings : 16 2025-05-07T20:23:12.0992598Z core id : 1 2025-05-07T20:23:12.0992806Z cpu cores : 8 2025-05-07T20:23:12.0993028Z apicid : 3 2025-05-07T20:23:12.0993251Z initial apicid : 3 2025-05-07T20:23:12.0993464Z fpu : yes 2025-05-07T20:23:12.0993658Z fpu_exception : yes 2025-05-07T20:23:12.0993876Z cpuid level : 13 2025-05-07T20:23:12.0994085Z wp : yes 2025-05-07T20:23:12.0996028Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.0998293Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.0998773Z bogomips : 5600.00 2025-05-07T20:23:12.0998997Z TLB size : 3072 4K pages 2025-05-07T20:23:12.0999231Z clflush size : 64 2025-05-07T20:23:12.0999444Z cache_alignment : 64 2025-05-07T20:23:12.0999712Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1000024Z power management: 2025-05-07T20:23:12.1000155Z 2025-05-07T20:23:12.1000239Z processor : 10 2025-05-07T20:23:12.1000456Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1000702Z cpu family : 23 2025-05-07T20:23:12.1000901Z model : 49 2025-05-07T20:23:12.1001109Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1001355Z stepping : 0 2025-05-07T20:23:12.1001556Z microcode : 0x830107f 2025-05-07T20:23:12.1001781Z cpu MHz : 2905.656 2025-05-07T20:23:12.1001995Z cache size : 512 KB 2025-05-07T20:23:12.1002217Z physical id : 0 2025-05-07T20:23:12.1002420Z siblings : 16 2025-05-07T20:23:12.1002624Z core id : 2 2025-05-07T20:23:12.1002828Z cpu cores : 8 2025-05-07T20:23:12.1003031Z apicid : 5 2025-05-07T20:23:12.1003237Z initial apicid : 5 2025-05-07T20:23:12.1003448Z fpu : yes 2025-05-07T20:23:12.1003645Z fpu_exception : yes 2025-05-07T20:23:12.1003865Z cpuid level : 13 2025-05-07T20:23:12.1004075Z wp : yes 2025-05-07T20:23:12.1006871Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1009627Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1010316Z bogomips : 5600.00 2025-05-07T20:23:12.1010790Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1011113Z clflush size : 64 2025-05-07T20:23:12.1011353Z cache_alignment : 64 2025-05-07T20:23:12.1011620Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1011938Z power management: 2025-05-07T20:23:12.1012069Z 2025-05-07T20:23:12.1012152Z processor : 11 2025-05-07T20:23:12.1012372Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1012611Z cpu family : 23 2025-05-07T20:23:12.1012811Z model : 49 2025-05-07T20:23:12.1013046Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1013321Z stepping : 0 2025-05-07T20:23:12.1013524Z microcode : 0x830107f 2025-05-07T20:23:12.1013750Z cpu MHz : 3294.868 2025-05-07T20:23:12.1013965Z cache size : 512 KB 2025-05-07T20:23:12.1014174Z physical id : 0 2025-05-07T20:23:12.1014383Z siblings : 16 2025-05-07T20:23:12.1014585Z core id : 3 2025-05-07T20:23:12.1014781Z cpu cores : 8 2025-05-07T20:23:12.1014982Z apicid : 7 2025-05-07T20:23:12.1015181Z initial apicid : 7 2025-05-07T20:23:12.1015395Z fpu : yes 2025-05-07T20:23:12.1015592Z fpu_exception : yes 2025-05-07T20:23:12.1015810Z cpuid level : 13 2025-05-07T20:23:12.1016011Z wp : yes 2025-05-07T20:23:12.1017954Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1020181Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1020862Z bogomips : 5600.00 2025-05-07T20:23:12.1021161Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1021467Z clflush size : 64 2025-05-07T20:23:12.1021681Z cache_alignment : 64 2025-05-07T20:23:12.1021951Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1022315Z power management: 2025-05-07T20:23:12.1022509Z 2025-05-07T20:23:12.1022626Z processor : 12 2025-05-07T20:23:12.1022924Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1023239Z cpu family : 23 2025-05-07T20:23:12.1023522Z model : 49 2025-05-07T20:23:12.1023733Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1023976Z stepping : 0 2025-05-07T20:23:12.1024187Z microcode : 0x830107f 2025-05-07T20:23:12.1024411Z cpu MHz : 2057.325 2025-05-07T20:23:12.1024621Z cache size : 512 KB 2025-05-07T20:23:12.1024836Z physical id : 0 2025-05-07T20:23:12.1025046Z siblings : 16 2025-05-07T20:23:12.1025243Z core id : 4 2025-05-07T20:23:12.1025443Z cpu cores : 8 2025-05-07T20:23:12.1025645Z apicid : 9 2025-05-07T20:23:12.1025838Z initial apicid : 9 2025-05-07T20:23:12.1026044Z fpu : yes 2025-05-07T20:23:12.1026239Z fpu_exception : yes 2025-05-07T20:23:12.1026454Z cpuid level : 13 2025-05-07T20:23:12.1026662Z wp : yes 2025-05-07T20:23:12.1028607Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1030937Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1031420Z bogomips : 5600.00 2025-05-07T20:23:12.1031634Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1031873Z clflush size : 64 2025-05-07T20:23:12.1032088Z cache_alignment : 64 2025-05-07T20:23:12.1032430Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1032858Z power management: 2025-05-07T20:23:12.1033040Z 2025-05-07T20:23:12.1033164Z processor : 13 2025-05-07T20:23:12.1033449Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1033763Z cpu family : 23 2025-05-07T20:23:12.1033972Z model : 49 2025-05-07T20:23:12.1034172Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1034415Z stepping : 0 2025-05-07T20:23:12.1034677Z microcode : 0x830107f 2025-05-07T20:23:12.1035001Z cpu MHz : 3281.511 2025-05-07T20:23:12.1035286Z cache size : 512 KB 2025-05-07T20:23:12.1035577Z physical id : 0 2025-05-07T20:23:12.1035820Z siblings : 16 2025-05-07T20:23:12.1036014Z core id : 5 2025-05-07T20:23:12.1036215Z cpu cores : 8 2025-05-07T20:23:12.1036418Z apicid : 11 2025-05-07T20:23:12.1036619Z initial apicid : 11 2025-05-07T20:23:12.1036836Z fpu : yes 2025-05-07T20:23:12.1037037Z fpu_exception : yes 2025-05-07T20:23:12.1037247Z cpuid level : 13 2025-05-07T20:23:12.1037455Z wp : yes 2025-05-07T20:23:12.1039414Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1041650Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1042128Z bogomips : 5600.00 2025-05-07T20:23:12.1042348Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1042584Z clflush size : 64 2025-05-07T20:23:12.1042797Z cache_alignment : 64 2025-05-07T20:23:12.1043093Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1043438Z power management: 2025-05-07T20:23:12.1043568Z 2025-05-07T20:23:12.1043659Z processor : 14 2025-05-07T20:23:12.1043875Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1044115Z cpu family : 23 2025-05-07T20:23:12.1044320Z model : 49 2025-05-07T20:23:12.1044521Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1044764Z stepping : 0 2025-05-07T20:23:12.1044970Z microcode : 0x830107f 2025-05-07T20:23:12.1045191Z cpu MHz : 3297.405 2025-05-07T20:23:12.1045414Z cache size : 512 KB 2025-05-07T20:23:12.1045632Z physical id : 0 2025-05-07T20:23:12.1045836Z siblings : 16 2025-05-07T20:23:12.1046036Z core id : 6 2025-05-07T20:23:12.1046237Z cpu cores : 8 2025-05-07T20:23:12.1046436Z apicid : 13 2025-05-07T20:23:12.1046638Z initial apicid : 13 2025-05-07T20:23:12.1046857Z fpu : yes 2025-05-07T20:23:12.1047125Z fpu_exception : yes 2025-05-07T20:23:12.1047427Z cpuid level : 13 2025-05-07T20:23:12.1047794Z wp : yes 2025-05-07T20:23:12.1050148Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1052525Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1053027Z bogomips : 5600.00 2025-05-07T20:23:12.1053278Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1053509Z clflush size : 64 2025-05-07T20:23:12.1053718Z cache_alignment : 64 2025-05-07T20:23:12.1053985Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1054300Z power management: 2025-05-07T20:23:12.1054431Z 2025-05-07T20:23:12.1054607Z processor : 15 2025-05-07T20:23:12.1054826Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1055065Z cpu family : 23 2025-05-07T20:23:12.1055269Z model : 49 2025-05-07T20:23:12.1055476Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1055718Z stepping : 0 2025-05-07T20:23:12.1055922Z microcode : 0x830107f 2025-05-07T20:23:12.1056147Z cpu MHz : 3249.562 2025-05-07T20:23:12.1056363Z cache size : 512 KB 2025-05-07T20:23:12.1056575Z physical id : 0 2025-05-07T20:23:12.1056793Z siblings : 16 2025-05-07T20:23:12.1056999Z core id : 7 2025-05-07T20:23:12.1057205Z cpu cores : 8 2025-05-07T20:23:12.1057404Z apicid : 15 2025-05-07T20:23:12.1057610Z initial apicid : 15 2025-05-07T20:23:12.1057826Z fpu : yes 2025-05-07T20:23:12.1058019Z fpu_exception : yes 2025-05-07T20:23:12.1058240Z cpuid level : 13 2025-05-07T20:23:12.1058452Z wp : yes 2025-05-07T20:23:12.1060430Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1063335Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1064020Z bogomips : 5600.00 2025-05-07T20:23:12.1064319Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1064594Z clflush size : 64 2025-05-07T20:23:12.1064810Z cache_alignment : 64 2025-05-07T20:23:12.1065087Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1065400Z power management: 2025-05-07T20:23:12.1065542Z 2025-05-07T20:23:12.1065546Z 2025-05-07T20:23:12.1065670Z ################################################################################ 2025-05-07T20:23:12.1065987Z [INFO] Print PCI info ... 2025-05-07T20:23:12.1066234Z + lspci -v 2025-05-07T20:23:12.1066350Z 2025-05-07T20:23:12.1066584Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:12.1066965Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:12.1067287Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:12.1067495Z 2025-05-07T20:23:12.1067703Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:12.1068088Z Physical Slot: 1 2025-05-07T20:23:12.1068323Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.1068535Z 2025-05-07T20:23:12.1068782Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:12.1069216Z Physical Slot: 1 2025-05-07T20:23:12.1069469Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:12.1069700Z 2025-05-07T20:23:12.1069966Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:12.1070414Z Physical Slot: 3 2025-05-07T20:23:12.1070661Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.1071009Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.1071369Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:12.1071592Z 2025-05-07T20:23:12.1071894Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.1072507Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.1072800Z Physical Slot: 4 2025-05-07T20:23:12.1073059Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:12.1073437Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.1073793Z Capabilities: 2025-05-07T20:23:12.1074064Z Kernel driver in use: nvme 2025-05-07T20:23:12.1074229Z 2025-05-07T20:23:12.1074567Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.1075043Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.1075391Z Physical Slot: 5 2025-05-07T20:23:12.1075637Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.1075990Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.1076383Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.1076716Z Capabilities: 2025-05-07T20:23:12.1076985Z Kernel driver in use: ena 2025-05-07T20:23:12.1077234Z Kernel modules: ena 2025-05-07T20:23:12.1077374Z 2025-05-07T20:23:12.1077551Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:12.1077934Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:12.1078227Z Physical Slot: 30 2025-05-07T20:23:12.1078491Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:12.1078872Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:12.1079307Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:12.1079835Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:12.1080297Z Capabilities: 2025-05-07T20:23:12.1080655Z Kernel driver in use: nvidia 2025-05-07T20:23:12.1080958Z Kernel modules: nvidia 2025-05-07T20:23:12.1081105Z 2025-05-07T20:23:12.1081416Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.1081935Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.1082226Z Physical Slot: 31 2025-05-07T20:23:12.1082473Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.1082835Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.1083212Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:12.1083547Z Capabilities: 2025-05-07T20:23:12.1083817Z Kernel driver in use: nvme 2025-05-07T20:23:12.1083979Z 2025-05-07T20:23:12.1083983Z 2025-05-07T20:23:12.1084106Z ################################################################################ 2025-05-07T20:23:12.1084437Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:12.1084729Z + uname -a 2025-05-07T20:23:12.1084842Z 2025-05-07T20:23:12.1085260Z Linux ip-10-0-29-135.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:12.1085761Z 2025-05-07T20:23:12.1085845Z + uname -m 2025-05-07T20:23:12.1085969Z 2025-05-07T20:23:12.1086046Z x86_64 2025-05-07T20:23:12.1086154Z 2025-05-07T20:23:12.1086256Z + cat /proc/version 2025-05-07T20:23:12.1086390Z 2025-05-07T20:23:12.1086943Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:12.1087663Z 2025-05-07T20:23:12.1087751Z + cat /etc/os-release 2025-05-07T20:23:12.1087903Z 2025-05-07T20:23:12.1087995Z NAME="Amazon Linux" 2025-05-07T20:23:12.1088217Z VERSION="2023" 2025-05-07T20:23:12.1088421Z ID="amzn" 2025-05-07T20:23:12.1088623Z ID_LIKE="fedora" 2025-05-07T20:23:12.1096510Z VERSION_ID="2023" 2025-05-07T20:23:12.1096835Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:12.1097137Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:12.1097433Z ANSI_COLOR="0;33" 2025-05-07T20:23:12.1097689Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:12.1098225Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:12.1098667Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:12.1099086Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:12.1099536Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:12.1099918Z VENDOR_NAME="AWS" 2025-05-07T20:23:12.1100173Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:12.1100463Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:12.1100629Z 2025-05-07T20:23:12.1100844Z ################################################################################ 2025-05-07T20:23:12.1101165Z # Print EC2 Instance Info 2025-05-07T20:23:12.1101406Z # 2025-05-07T20:23:12.1101635Z # [2025-05-07T20:23:12.107Z] + print_ec2_info 2025-05-07T20:23:12.1101968Z ################################################################################ 2025-05-07T20:23:12.1102186Z 2025-05-07T20:23:12.1242355Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:12.1363555Z instance-id: i-0e49e9d70b38203df 2025-05-07T20:23:12.1481075Z instance-type: g5.4xlarge 2025-05-07T20:23:12.1532700Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.1533289Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.1546618Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.1547148Z env: 2025-05-07T20:23:12.1547475Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.1547946Z BUILD_ENV: build_binary 2025-05-07T20:23:12.1548340Z BUILD_TARGET: genai 2025-05-07T20:23:12.1548691Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.1549060Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:12.1549464Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.1549938Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.1550450Z ##[endgroup] 2025-05-07T20:23:12.4863267Z ################################################################################ 2025-05-07T20:23:12.4863689Z [INFO] Printing general display info ... 2025-05-07T20:23:12.4890705Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:12.5993397Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:12.6001480Z /usr/bin/sudo 2025-05-07T20:23:12.6013315Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:12.6022707Z /usr/bin/yum 2025-05-07T20:23:12.6024289Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:12.6045233Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:13.0669826Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:13.1439103Z ================================================================================ 2025-05-07T20:23:13.1439602Z WARNING: 2025-05-07T20:23:13.1439960Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:13.1440290Z 2025-05-07T20:23:13.1440414Z Available Versions: 2025-05-07T20:23:13.1440620Z 2025-05-07T20:23:13.1440742Z Version 2023.7.20250331: 2025-05-07T20:23:13.1441128Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:13.1441397Z 2025-05-07T20:23:13.1441534Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:13.1441743Z 2025-05-07T20:23:13.1441828Z Release notes: 2025-05-07T20:23:13.1442236Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:13.1442604Z 2025-05-07T20:23:13.1442701Z Version 2023.7.20250414: 2025-05-07T20:23:13.1443015Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:13.1443262Z 2025-05-07T20:23:13.1443401Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:13.1443636Z 2025-05-07T20:23:13.1443723Z Release notes: 2025-05-07T20:23:13.1444118Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:13.1444479Z 2025-05-07T20:23:13.1444568Z Version 2023.7.20250428: 2025-05-07T20:23:13.1444880Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:13.1445339Z 2025-05-07T20:23:13.1445451Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:13.1445660Z 2025-05-07T20:23:13.1445753Z Release notes: 2025-05-07T20:23:13.1446139Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:13.1446503Z 2025-05-07T20:23:13.1446615Z ================================================================================ 2025-05-07T20:23:13.2584203Z Dependencies resolved. 2025-05-07T20:23:13.2871738Z ================================================================================ 2025-05-07T20:23:13.2872753Z Package Arch Version Repository Size 2025-05-07T20:23:13.2873683Z ================================================================================ 2025-05-07T20:23:13.2874063Z Upgrading: 2025-05-07T20:23:13.2874478Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:13.2875289Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:13.2875794Z 2025-05-07T20:23:13.2876181Z Transaction Summary 2025-05-07T20:23:13.2876511Z ================================================================================ 2025-05-07T20:23:13.2876812Z Upgrade 2 Packages 2025-05-07T20:23:13.2876953Z 2025-05-07T20:23:13.2877061Z Total download size: 6.9 M 2025-05-07T20:23:13.2877320Z Downloading Packages: 2025-05-07T20:23:13.3248931Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 34 MB/s | 1.2 MB 00:00 2025-05-07T20:23:13.3669582Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 73 MB/s | 5.7 MB 00:00 2025-05-07T20:23:13.3679956Z -------------------------------------------------------------------------------- 2025-05-07T20:23:13.3681197Z Total 86 MB/s | 6.9 MB 00:00 2025-05-07T20:23:13.3684335Z Running transaction check 2025-05-07T20:23:13.3783525Z Transaction check succeeded. 2025-05-07T20:23:13.3784155Z Running transaction test 2025-05-07T20:23:13.4079515Z Transaction test succeeded. 2025-05-07T20:23:13.4082324Z Running transaction 2025-05-07T20:23:13.9621198Z Preparing : 1/1 2025-05-07T20:23:14.0676000Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.0696256Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.0930817Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.0931583Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.1033532Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.1054466Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.2831516Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:14.2832094Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.2832658Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:14.2833189Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:14.4268432Z ================================================================================ 2025-05-07T20:23:14.4268884Z WARNING: 2025-05-07T20:23:14.4269119Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:14.4269343Z 2025-05-07T20:23:14.4269437Z Available Versions: 2025-05-07T20:23:14.4269588Z 2025-05-07T20:23:14.4269675Z Version 2023.7.20250331: 2025-05-07T20:23:14.4269982Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:14.4270227Z 2025-05-07T20:23:14.4270355Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:14.4270561Z 2025-05-07T20:23:14.4270646Z Release notes: 2025-05-07T20:23:14.4271050Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:14.4271692Z 2025-05-07T20:23:14.4271797Z Version 2023.7.20250414: 2025-05-07T20:23:14.4272092Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:14.4272341Z 2025-05-07T20:23:14.4272453Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:14.4272663Z 2025-05-07T20:23:14.4272746Z Release notes: 2025-05-07T20:23:14.4273134Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:14.4273489Z 2025-05-07T20:23:14.4273575Z Version 2023.7.20250428: 2025-05-07T20:23:14.4273876Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:14.4274115Z 2025-05-07T20:23:14.4274233Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:14.4274435Z 2025-05-07T20:23:14.4274520Z Release notes: 2025-05-07T20:23:14.4274903Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:14.4275267Z 2025-05-07T20:23:14.4275606Z ================================================================================ 2025-05-07T20:23:14.4842873Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.4843220Z 2025-05-07T20:23:14.4843315Z Upgraded: 2025-05-07T20:23:14.4843667Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:14.4844280Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:14.4844613Z 2025-05-07T20:23:14.4844695Z Complete! 2025-05-07T20:23:14.5279559Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:14.5303893Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:15.0104356Z Last metadata expiration check: 0:00:10 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:15.0345531Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:15.0743317Z Dependencies resolved. 2025-05-07T20:23:15.0920056Z ================================================================================ 2025-05-07T20:23:15.0920557Z Package Architecture Version Repository Size 2025-05-07T20:23:15.0920975Z ================================================================================ 2025-05-07T20:23:15.0921277Z Installing: 2025-05-07T20:23:15.0921568Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:15.0921833Z 2025-05-07T20:23:15.0921930Z Transaction Summary 2025-05-07T20:23:15.0922174Z ================================================================================ 2025-05-07T20:23:15.0922480Z Install 1 Package 2025-05-07T20:23:15.0922612Z 2025-05-07T20:23:15.0922938Z Total download size: 319 k 2025-05-07T20:23:15.0923221Z Installed size: 837 k 2025-05-07T20:23:15.0924499Z Downloading Packages: 2025-05-07T20:23:15.1706194Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 6.5 MB/s | 319 kB 00:00 2025-05-07T20:23:15.1711966Z -------------------------------------------------------------------------------- 2025-05-07T20:23:15.1714712Z Total 4.0 MB/s | 319 kB 00:00 2025-05-07T20:23:15.1869355Z Running transaction check 2025-05-07T20:23:15.1924236Z Transaction check succeeded. 2025-05-07T20:23:15.1924851Z Running transaction test 2025-05-07T20:23:15.2381871Z Transaction test succeeded. 2025-05-07T20:23:15.2385470Z Running transaction 2025-05-07T20:23:15.3404077Z Preparing : 1/1 2025-05-07T20:23:15.3909663Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.5389247Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.6641561Z ================================================================================ 2025-05-07T20:23:15.6641925Z WARNING: 2025-05-07T20:23:15.6642210Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:15.6642725Z 2025-05-07T20:23:15.6642829Z Available Versions: 2025-05-07T20:23:15.6642993Z 2025-05-07T20:23:15.6643109Z Version 2023.7.20250331: 2025-05-07T20:23:15.6643416Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:15.6643673Z 2025-05-07T20:23:15.6643791Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:15.6644006Z 2025-05-07T20:23:15.6644107Z Release notes: 2025-05-07T20:23:15.6644543Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:15.6644910Z 2025-05-07T20:23:15.6644999Z Version 2023.7.20250414: 2025-05-07T20:23:15.6645301Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:15.6645543Z 2025-05-07T20:23:15.6645662Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:15.6645865Z 2025-05-07T20:23:15.6645952Z Release notes: 2025-05-07T20:23:15.6646333Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:15.6646702Z 2025-05-07T20:23:15.6646956Z Version 2023.7.20250428: 2025-05-07T20:23:15.6647268Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:15.6647620Z 2025-05-07T20:23:15.6647733Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:15.6647942Z 2025-05-07T20:23:15.6648025Z Release notes: 2025-05-07T20:23:15.6648407Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:15.6648759Z 2025-05-07T20:23:15.6648874Z ================================================================================ 2025-05-07T20:23:15.6986560Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.6986913Z 2025-05-07T20:23:15.6987004Z Installed: 2025-05-07T20:23:15.6987320Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:15.6987610Z 2025-05-07T20:23:15.6987704Z Complete! 2025-05-07T20:23:15.7428565Z + hostname 2025-05-07T20:23:15.7428726Z 2025-05-07T20:23:15.7442135Z ip-10-0-29-135.ec2.internal 2025-05-07T20:23:15.7443083Z 2025-05-07T20:23:15.7443568Z + sudo lshw -C display 2025-05-07T20:23:15.7443780Z 2025-05-07T20:23:16.3041306Z *-display:0 UNCLAIMED 2025-05-07T20:23:16.3041662Z description: VGA compatible controller 2025-05-07T20:23:16.3042000Z product: Amazon.com, Inc. 2025-05-07T20:23:16.3042285Z vendor: Amazon.com, Inc. 2025-05-07T20:23:16.3042551Z physical id: 3 2025-05-07T20:23:16.3042790Z bus info: pci@0000:00:03.0 2025-05-07T20:23:16.3043055Z version: 00 2025-05-07T20:23:16.3043278Z width: 32 bits 2025-05-07T20:23:16.3043503Z clock: 33MHz 2025-05-07T20:23:16.3043760Z capabilities: vga_controller bus_master 2025-05-07T20:23:16.3044082Z configuration: latency=0 2025-05-07T20:23:16.3044410Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:16.3044739Z *-display:1 2025-05-07T20:23:16.3044997Z description: 3D controller 2025-05-07T20:23:16.3045290Z product: GA102GL [A10G] 2025-05-07T20:23:16.3045554Z vendor: NVIDIA Corporation 2025-05-07T20:23:16.3045823Z physical id: 1e 2025-05-07T20:23:16.3046072Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:16.3046324Z version: a1 2025-05-07T20:23:16.3046544Z width: 64 bits 2025-05-07T20:23:16.3046765Z clock: 33MHz 2025-05-07T20:23:16.3047049Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:16.3047420Z configuration: driver=nvidia latency=0 2025-05-07T20:23:16.3048135Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:16.3079330Z 2025-05-07T20:23:16.3079798Z ################################################################################ 2025-05-07T20:23:16.3208016Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:16.3208398Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:16.3376081Z Wed May 7 20:23:16 2025 2025-05-07T20:23:16.3376830Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.3377788Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:16.3378732Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.3379690Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:16.3380710Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:16.3381531Z | | | MIG M. | 2025-05-07T20:23:16.3382180Z |=========================================+========================+======================| 2025-05-07T20:23:16.3454812Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:16.3455468Z | 0% 30C P0 57W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:16.3455860Z | | | N/A | 2025-05-07T20:23:16.3456256Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.3456648Z 2025-05-07T20:23:16.3457044Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.3457461Z | Processes: | 2025-05-07T20:23:16.3457896Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:16.3458309Z | ID ID Usage | 2025-05-07T20:23:16.3458661Z |=========================================================================================| 2025-05-07T20:23:16.3459575Z | No running processes found | 2025-05-07T20:23:16.3460039Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.4850021Z ################################################################################ 2025-05-07T20:23:16.4850378Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:16.4991811Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.4992459Z [CHECK] rocminfo not found 2025-05-07T20:23:16.5000290Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.5002466Z [CHECK] rocm-smi not found 2025-05-07T20:23:16.5066651Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.5067080Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.5078939Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:16.5079287Z env: 2025-05-07T20:23:16.5079518Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:16.5079820Z BUILD_ENV: build_binary 2025-05-07T20:23:16.5080070Z BUILD_TARGET: genai 2025-05-07T20:23:16.5080304Z BUILD_VARIANT: cuda 2025-05-07T20:23:16.5080537Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:16.5080798Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:16.5081099Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:16.5081434Z ##[endgroup] 2025-05-07T20:23:16.8402599Z ################################################################################ 2025-05-07T20:23:16.8402976Z # Setup Miniconda 2025-05-07T20:23:16.8403258Z # 2025-05-07T20:23:16.8417539Z # [2025-05-07T20:23:16.841Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:16.8418102Z ################################################################################ 2025-05-07T20:23:16.8418405Z 2025-05-07T20:23:16.8431916Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:16.9398662Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:16.9399633Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:16.9400176Z 2025-05-07T20:23:16.9416160Z 2025-05-07T20:23:16.9416664Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:16.9437313Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:17.9807991Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:17.9808351Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:17.9808604Z 2025-05-07T20:23:17.9952496Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:18.4401761Z Unpacking payload ... 2025-05-07T20:23:18.9576374Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:19.7539011Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:21.8549942Z 2025-05-07T20:23:21.8550398Z Installing base environment... 2025-05-07T20:23:21.8550634Z 2025-05-07T20:23:22.9340877Z Preparing transaction: ...working... done 2025-05-07T20:23:25.7917388Z Executing transaction: ...working... done 2025-05-07T20:23:26.4502269Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:26.5381877Z installation finished. 2025-05-07T20:23:26.5389507Z 2025-05-07T20:23:26.5389743Z + rm -f miniconda.sh 2025-05-07T20:23:26.5389980Z 2025-05-07T20:23:26.5697060Z 2025-05-07T20:23:26.5697608Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:26.5698093Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:26.5698410Z 2025-05-07T20:23:26.9319175Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:26.9319563Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:26.9319980Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:26.9320425Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:26.9320783Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:26.9321175Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:26.9321596Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:26.9322037Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:26.9322867Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:26.9323650Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:26.9324165Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:26.9324534Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:26.9324721Z 2025-05-07T20:23:26.9324921Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:26.9325213Z 2025-05-07T20:23:26.9973467Z 2025-05-07T20:23:26.9973939Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:26.9974148Z 2025-05-07T20:23:27.8285433Z 2025-05-07T20:23:27.8285916Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:27.8310963Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:41.7114854Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:23:43.2974971Z Solving environment: - \ | / - \ | / - \ | / done 2025-05-07T20:23:43.4026207Z 2025-05-07T20:23:43.4026801Z ## Package Plan ## 2025-05-07T20:23:43.4027229Z 2025-05-07T20:23:43.4027615Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:43.4028283Z 2025-05-07T20:23:43.4028533Z added / updated specs: 2025-05-07T20:23:43.4029191Z - conda-libmamba-solver 2025-05-07T20:23:43.4029717Z - libarchive 2025-05-07T20:23:43.4030158Z - libmamba 2025-05-07T20:23:43.4030574Z - libmambapy 2025-05-07T20:23:43.4030830Z 2025-05-07T20:23:43.4030840Z 2025-05-07T20:23:43.4031120Z The following packages will be downloaded: 2025-05-07T20:23:43.4031557Z 2025-05-07T20:23:43.4031791Z package | build 2025-05-07T20:23:43.4032287Z ---------------------------|----------------- 2025-05-07T20:23:43.4032828Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:43.4033305Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:43.4033739Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:43.4034371Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:43.4035052Z ------------------------------------------------------------ 2025-05-07T20:23:43.4035557Z Total: 1.4 MB 2025-05-07T20:23:43.4035879Z 2025-05-07T20:23:43.4036051Z The following packages will be UPDATED: 2025-05-07T20:23:43.4036374Z 2025-05-07T20:23:43.4041696Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:43.4042746Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:43.4043309Z 2025-05-07T20:23:43.4043669Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:43.4044204Z 2025-05-07T20:23:43.4044720Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:43.4046046Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:43.4046851Z 2025-05-07T20:23:43.4046858Z 2025-05-07T20:23:43.4046876Z 2025-05-07T20:23:43.4047101Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:43.4047794Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:43.4048149Z 2025-05-07T20:23:43.4048939Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:43.4049346Z 2025-05-07T20:23:43.4049352Z 2025-05-07T20:23:43.4062779Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:43.4063169Z 2025-05-07T20:23:43.4063175Z 2025-05-07T20:23:43.4063180Z 2025-05-07T20:23:43.4648236Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:43.4648620Z 2025-05-07T20:23:43.4648926Z 2025-05-07T20:23:43.4724738Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.4765675Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.4765975Z 2025-05-07T20:23:43.4766112Z 2025-05-07T20:23:43.4807670Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.4810502Z 2025-05-07T20:23:43.4849830Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.4850076Z 2025-05-07T20:23:43.4850153Z 2025-05-07T20:23:43.4850195Z 2025-05-07T20:23:43.5141106Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.5141471Z 2025-05-07T20:23:43.5145864Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.5146212Z 2025-05-07T20:23:43.5205909Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.5206525Z 2025-05-07T20:23:43.5206529Z 2025-05-07T20:23:43.5206533Z 2025-05-07T20:23:43.5208810Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.5209169Z 2025-05-07T20:23:43.5209174Z 2025-05-07T20:23:43.5209180Z 2025-05-07T20:23:43.6159204Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.6160146Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.6164083Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.6164509Z 2025-05-07T20:23:43.6164782Z 2025-05-07T20:23:43.6165029Z  2025-05-07T20:23:43.6165306Z 2025-05-07T20:23:43.6165323Z 2025-05-07T20:23:43.6165546Z  2025-05-07T20:23:43.6165823Z 2025-05-07T20:23:43.6165828Z 2025-05-07T20:23:43.6165842Z 2025-05-07T20:23:43.6166074Z  done 2025-05-07T20:23:43.7169233Z Preparing transaction: \ done 2025-05-07T20:23:43.8174098Z Verifying transaction: / done 2025-05-07T20:23:45.1194536Z Executing transaction: \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:46.8417605Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:46.8442273Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:47.7875839Z Channels: 2025-05-07T20:23:47.7876179Z - defaults 2025-05-07T20:23:47.7876496Z Platform: linux-64 2025-05-07T20:23:49.0563550Z Collecting package metadata (repodata.json): - \ | / - \ | / done 2025-05-07T20:23:49.1816038Z Solving environment: \ | Channels: 2025-05-07T20:23:49.1816447Z - defaults 2025-05-07T20:23:49.1816709Z Platform: linux-64 2025-05-07T20:23:49.4964432Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:23:49.7121593Z Solving environment: \ | / - done 2025-05-07T20:23:49.7896818Z done 2025-05-07T20:23:49.8600536Z 2025-05-07T20:23:49.8600871Z ## Package Plan ## 2025-05-07T20:23:49.8601084Z 2025-05-07T20:23:49.8601278Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:49.8601618Z 2025-05-07T20:23:49.8601757Z added / updated specs: 2025-05-07T20:23:49.8602074Z - conda 2025-05-07T20:23:49.8602238Z 2025-05-07T20:23:49.8602245Z 2025-05-07T20:23:49.8602374Z The following packages will be downloaded: 2025-05-07T20:23:49.8602590Z 2025-05-07T20:23:49.8602704Z package | build 2025-05-07T20:23:49.8603019Z ---------------------------|----------------- 2025-05-07T20:23:49.8603353Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:49.8604157Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:49.8604684Z ------------------------------------------------------------ 2025-05-07T20:23:49.8605088Z Total: 1.4 MB 2025-05-07T20:23:49.8605307Z 2025-05-07T20:23:49.8605422Z The following packages will be UPDATED: 2025-05-07T20:23:49.8605862Z 2025-05-07T20:23:49.8606229Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:49.8606737Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:49.8606983Z 2025-05-07T20:23:49.8606987Z 2025-05-07T20:23:49.8606991Z 2025-05-07T20:23:49.8607141Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:49.8607725Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:49.8608029Z 2025-05-07T20:23:49.9079063Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:49.9081049Z 2025-05-07T20:23:49.9127675Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1776608Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.1782236Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.1910685Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.1911013Z 2025-05-07T20:23:50.1911939Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1912278Z 2025-05-07T20:23:50.1916113Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1916581Z 2025-05-07T20:23:50.1916847Z 2025-05-07T20:23:50.1917082Z  done 2025-05-07T20:23:50.2922018Z Preparing transaction: | done 2025-05-07T20:23:50.3927378Z Verifying transaction: - done 2025-05-07T20:23:52.6955883Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:23:53.3027933Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:53.3032365Z + conda clean --packages --tarball -y 2025-05-07T20:23:53.3032660Z 2025-05-07T20:23:54.3103597Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:54.3104047Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:54.3731697Z 2025-05-07T20:23:54.3739892Z + conda clean --all -y 2025-05-07T20:23:54.3740136Z 2025-05-07T20:23:54.9054132Z There are no unused tarball(s) to remove. 2025-05-07T20:23:54.9054587Z Will remove 1 index cache(s). 2025-05-07T20:23:54.9054965Z There are no unused package(s) to remove. 2025-05-07T20:23:54.9055340Z There are no tempfile(s) to remove. 2025-05-07T20:23:54.9055638Z There are no logfile(s) to remove. 2025-05-07T20:23:54.9676698Z 2025-05-07T20:23:54.9681137Z + conda info 2025-05-07T20:23:54.9681332Z 2025-05-07T20:23:55.7161960Z 2025-05-07T20:23:55.7162703Z active environment : base 2025-05-07T20:23:55.7163230Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:55.7163629Z shell level : 1 2025-05-07T20:23:55.7163905Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:55.7164293Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:55.7164664Z conda version : 25.3.1 2025-05-07T20:23:55.7164941Z conda-build version : not installed 2025-05-07T20:23:55.7165246Z python version : 3.13.2.final.0 2025-05-07T20:23:55.7165544Z solver : libmamba (default) 2025-05-07T20:23:55.7165893Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:55.7166189Z __conda=25.3.1=0 2025-05-07T20:23:55.7166463Z __cuda=12.8=0 2025-05-07T20:23:55.7166726Z __glibc=2.34=0 2025-05-07T20:23:55.7167001Z __linux=6.1.130=0 2025-05-07T20:23:55.7167270Z __unix=0=0 2025-05-07T20:23:55.7168021Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:55.7168436Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:55.7168788Z conda av metadata url : None 2025-05-07T20:23:55.7169159Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:55.7169583Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:55.7169964Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:55.7170342Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:55.7170704Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:55.7171040Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:55.7171383Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:55.7171713Z /home/ec2-user/.conda/envs 2025-05-07T20:23:55.7172017Z platform : linux-64 2025-05-07T20:23:55.7172852Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:55.7173826Z UID:GID : 1000:1000 2025-05-07T20:23:55.7174097Z netrc file : None 2025-05-07T20:23:55.7174358Z offline mode : False 2025-05-07T20:23:55.7174523Z 2025-05-07T20:23:55.7814399Z 2025-05-07T20:23:55.7814749Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:55.7815930Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_fcaa5f2a-b6b0-49f7-919d-c15ce02ab8c2 ... 2025-05-07T20:23:55.7816713Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:55.7888715Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.11 2025-05-07T20:23:55.7889226Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.11 2025-05-07T20:23:55.7908315Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:55.7908655Z env: 2025-05-07T20:23:55.7908876Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:55.7909178Z BUILD_ENV: build_binary 2025-05-07T20:23:55.7909438Z BUILD_TARGET: genai 2025-05-07T20:23:55.7909666Z BUILD_VARIANT: cuda 2025-05-07T20:23:55.7909891Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:55.7910143Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:55.7910437Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:55.7910759Z ##[endgroup] 2025-05-07T20:23:56.1272818Z ################################################################################ 2025-05-07T20:23:56.1273189Z # Create Conda Environment 2025-05-07T20:23:56.1273435Z # 2025-05-07T20:23:56.1289909Z # [2025-05-07T20:23:56.128Z] + create_conda_environment build_binary 3.11 2025-05-07T20:23:56.1290320Z ################################################################################ 2025-05-07T20:23:56.1290548Z 2025-05-07T20:23:56.1308025Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:56.2185934Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:56.2186305Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:56.2186626Z + conda info --envs 2025-05-07T20:23:56.2186769Z 2025-05-07T20:23:56.9625536Z 2025-05-07T20:23:56.9626093Z # conda environments: 2025-05-07T20:23:56.9626368Z # 2025-05-07T20:23:56.9626592Z base /home/ec2-user/miniconda 2025-05-07T20:23:56.9626827Z 2025-05-07T20:23:57.0280252Z 2025-05-07T20:23:57.0281457Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:58.6515675Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:58.6515942Z 2025-05-07T20:23:58.6529228Z 2025-05-07T20:23:58.6538384Z [SETUP] Creating new Conda environment (Python 3.11) ... 2025-05-07T20:23:58.6561181Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.11 2025-05-07T20:23:59.4094589Z Channels: 2025-05-07T20:23:59.4094842Z - defaults 2025-05-07T20:23:59.4095115Z Platform: linux-64 2025-05-07T20:24:00.9778419Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | done 2025-05-07T20:24:01.1018179Z Solving environment: - done 2025-05-07T20:24:01.1308823Z 2025-05-07T20:24:01.1309057Z ## Package Plan ## 2025-05-07T20:24:01.1309279Z 2025-05-07T20:24:01.1309577Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:01.1309978Z 2025-05-07T20:24:01.1310105Z added / updated specs: 2025-05-07T20:24:01.1310425Z - python=3.11 2025-05-07T20:24:01.1310557Z 2025-05-07T20:24:01.1310561Z 2025-05-07T20:24:01.1310689Z The following packages will be downloaded: 2025-05-07T20:24:01.1310904Z 2025-05-07T20:24:01.1311041Z package | build 2025-05-07T20:24:01.1311357Z ---------------------------|----------------- 2025-05-07T20:24:01.1311715Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:01.1312118Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:01.1312689Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:01.1313236Z python-3.11.11 | he870216_0 32.9 MB 2025-05-07T20:24:01.1314092Z setuptools-78.1.1 | py311h06a4308_0 2.3 MB 2025-05-07T20:24:01.1314490Z wheel-0.45.1 | py311h06a4308_0 151 KB 2025-05-07T20:24:01.1314850Z ------------------------------------------------------------ 2025-05-07T20:24:01.1315190Z Total: 35.4 MB 2025-05-07T20:24:01.1315395Z 2025-05-07T20:24:01.1315531Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:01.1315753Z 2025-05-07T20:24:01.1316148Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:01.1316602Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:01.1317015Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:01.1317514Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:01.1318063Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:01.1318524Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:01.1318948Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.1319465Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:01.1320092Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.1320714Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:01.1321149Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:01.1321568Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:01.1321968Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:01.1322367Z python pkgs/main/linux-64::python-3.11.11-he870216_0 2025-05-07T20:24:01.1322800Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:01.1323298Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py311h06a4308_0 2025-05-07T20:24:01.1323833Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:01.1324224Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:01.1324607Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:01.1325016Z wheel pkgs/main/linux-64::wheel-0.45.1-py311h06a4308_0 2025-05-07T20:24:01.1325415Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:01.1325794Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:01.1326063Z 2025-05-07T20:24:01.1326070Z 2025-05-07T20:24:01.1326075Z 2025-05-07T20:24:01.1326282Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:01.1326654Z python-3.11.11 | 32.9 MB | | 0% 2025-05-07T20:24:01.1326890Z 2025-05-07T20:24:01.1327259Z setuptools-78.1.1 | 2.3 MB | | 0%  2025-05-07T20:24:01.1327592Z 2025-05-07T20:24:01.1327602Z 2025-05-07T20:24:01.1327804Z wheel-0.45.1 | 151 KB | | 0%  2025-05-07T20:24:01.1328036Z 2025-05-07T20:24:01.1328040Z 2025-05-07T20:24:01.1328044Z 2025-05-07T20:24:01.1331184Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:01.1331562Z 2025-05-07T20:24:01.1331566Z 2025-05-07T20:24:01.1331570Z 2025-05-07T20:24:01.1331574Z 2025-05-07T20:24:01.1346836Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:01.1347107Z 2025-05-07T20:24:01.1347141Z 2025-05-07T20:24:01.1347154Z 2025-05-07T20:24:01.1347167Z 2025-05-07T20:24:01.1347306Z 2025-05-07T20:24:01.1661213Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:01.1661609Z 2025-05-07T20:24:01.1661616Z 2025-05-07T20:24:01.1661621Z 2025-05-07T20:24:01.1662956Z 2025-05-07T20:24:01.1714642Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.1715014Z 2025-05-07T20:24:01.1715295Z 2025-05-07T20:24:01.1901202Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:01.1901543Z 2025-05-07T20:24:01.1901548Z 2025-05-07T20:24:01.1901554Z 2025-05-07T20:24:01.1901559Z 2025-05-07T20:24:01.1905923Z 2025-05-07T20:24:01.2193818Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.2194206Z 2025-05-07T20:24:01.2194212Z 2025-05-07T20:24:01.2194228Z 2025-05-07T20:24:01.2194234Z 2025-05-07T20:24:01.2194239Z 2025-05-07T20:24:01.2247397Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.2247910Z 2025-05-07T20:24:01.2247915Z 2025-05-07T20:24:01.2251601Z 2025-05-07T20:24:01.2316767Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.2318244Z python-3.11.11 | 32.9 MB | 3 | 3% 2025-05-07T20:24:01.2318719Z 2025-05-07T20:24:01.2764720Z setuptools-78.1.1 | 2.3 MB | ######9 | 69%  2025-05-07T20:24:01.2765426Z 2025-05-07T20:24:01.2765460Z 2025-05-07T20:24:01.2765471Z 2025-05-07T20:24:01.2765974Z 2025-05-07T20:24:01.2771649Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.2772022Z 2025-05-07T20:24:01.2772028Z 2025-05-07T20:24:01.2772034Z 2025-05-07T20:24:01.2780712Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.2781083Z 2025-05-07T20:24:01.2781089Z 2025-05-07T20:24:01.2781094Z 2025-05-07T20:24:01.2781768Z 2025-05-07T20:24:01.2783779Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.2784154Z 2025-05-07T20:24:01.2784161Z 2025-05-07T20:24:01.2784226Z 2025-05-07T20:24:01.2850036Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.2850424Z 2025-05-07T20:24:01.3122658Z setuptools-78.1.1 | 2.3 MB | ########## | 100%  2025-05-07T20:24:01.3123013Z 2025-05-07T20:24:01.3123018Z 2025-05-07T20:24:01.3127157Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:01.3127585Z 2025-05-07T20:24:01.3127590Z 2025-05-07T20:24:01.3363463Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:01.4364311Z python-3.11.11 | 32.9 MB | #7 | 18% 2025-05-07T20:24:01.6150725Z python-3.11.11 | 32.9 MB | #####3 | 53% 2025-05-07T20:24:01.6151752Z python-3.11.11 | 32.9 MB | ########## | 100% 2025-05-07T20:24:01.6462009Z python-3.11.11 | 32.9 MB | ########## | 100% 2025-05-07T20:24:01.6462662Z 2025-05-07T20:24:02.2722566Z setuptools-78.1.1 | 2.3 MB | ########## | 100%  2025-05-07T20:24:02.2728708Z python-3.11.11 | 32.9 MB | ########## | 100% 2025-05-07T20:24:02.2729189Z 2025-05-07T20:24:02.2729468Z 2025-05-07T20:24:02.2729720Z  2025-05-07T20:24:02.2730006Z 2025-05-07T20:24:02.2730027Z 2025-05-07T20:24:02.2730199Z  2025-05-07T20:24:02.2730416Z 2025-05-07T20:24:02.2730419Z 2025-05-07T20:24:02.2730423Z 2025-05-07T20:24:02.2730598Z  2025-05-07T20:24:02.2730847Z 2025-05-07T20:24:02.2730852Z 2025-05-07T20:24:02.2730858Z 2025-05-07T20:24:02.2730863Z 2025-05-07T20:24:02.2731103Z  2025-05-07T20:24:02.2731317Z 2025-05-07T20:24:02.2731321Z 2025-05-07T20:24:02.2731325Z 2025-05-07T20:24:02.2731328Z 2025-05-07T20:24:02.2731332Z 2025-05-07T20:24:02.2731535Z  done 2025-05-07T20:24:02.4837329Z Preparing transaction: | / done 2025-05-07T20:24:03.8499766Z Verifying transaction: \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:06.1652336Z Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:06.2161114Z # 2025-05-07T20:24:06.2161767Z # To activate this environment, use 2025-05-07T20:24:06.2163098Z # 2025-05-07T20:24:06.2163528Z # $ conda activate build_binary 2025-05-07T20:24:06.2164043Z # 2025-05-07T20:24:06.2164459Z # To deactivate an active environment, use 2025-05-07T20:24:06.2164996Z # 2025-05-07T20:24:06.2165368Z # $ conda deactivate 2025-05-07T20:24:06.2165678Z 2025-05-07T20:24:06.3207698Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:06.3229034Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:09.2256834Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (25.1) 2025-05-07T20:24:09.2257463Z Collecting pip 2025-05-07T20:24:09.2257783Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:09.2258193Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:09.2259109Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 69.0 MB/s eta 0:00:00 2025-05-07T20:24:09.2259596Z Installing collected packages: pip 2025-05-07T20:24:09.2260016Z Attempting uninstall: pip 2025-05-07T20:24:09.2260419Z Found existing installation: pip 25.1 2025-05-07T20:24:09.2260840Z Uninstalling pip-25.1: 2025-05-07T20:24:09.2261227Z Successfully uninstalled pip-25.1 2025-05-07T20:24:09.2261675Z Successfully installed pip-25.1.1 2025-05-07T20:24:09.2261970Z 2025-05-07T20:24:09.2905195Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:09.2927636Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:10.1536035Z Channels: 2025-05-07T20:24:10.1536275Z - conda-forge 2025-05-07T20:24:10.1536554Z Platform: linux-64 2025-05-07T20:24:20.6405885Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:22.3183180Z Solving environment: / - \ | / done 2025-05-07T20:24:22.3791687Z 2025-05-07T20:24:22.3792172Z ## Package Plan ## 2025-05-07T20:24:22.3792388Z 2025-05-07T20:24:22.3792600Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:22.3792904Z 2025-05-07T20:24:22.3793016Z added / updated specs: 2025-05-07T20:24:22.3793288Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:22.3793513Z 2025-05-07T20:24:22.3793518Z 2025-05-07T20:24:22.3793667Z The following packages will be downloaded: 2025-05-07T20:24:22.3793893Z 2025-05-07T20:24:22.3794014Z package | build 2025-05-07T20:24:22.3794353Z ---------------------------|----------------- 2025-05-07T20:24:22.3794712Z cffi-1.17.1 | py311hf29c0ef_0 295 KB conda-forge 2025-05-07T20:24:22.3795163Z cryptography-44.0.3 | py311hafd3f86_0 1.5 MB conda-forge 2025-05-07T20:24:22.3795612Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:22.3796022Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:22.3796452Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:22.3796868Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:22.3797301Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:22.3797729Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:22.3798157Z python_abi-3.11 | 2_cp311 5 KB conda-forge 2025-05-07T20:24:22.3798620Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:22.3799104Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:22.3799526Z ------------------------------------------------------------ 2025-05-07T20:24:22.3799872Z Total: 6.4 MB 2025-05-07T20:24:22.3800076Z 2025-05-07T20:24:22.3800206Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:22.3800760Z 2025-05-07T20:24:22.3800950Z cffi conda-forge/linux-64::cffi-1.17.1-py311hf29c0ef_0 2025-05-07T20:24:22.3801439Z cryptography conda-forge/linux-64::cryptography-44.0.3-py311hafd3f86_0 2025-05-07T20:24:22.3801933Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:22.3802374Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:22.3802832Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:22.3803451Z python_abi conda-forge/linux-64::python_abi-3.11-2_cp311 2025-05-07T20:24:22.3804111Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:22.3804682Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:22.3805012Z 2025-05-07T20:24:22.3805126Z The following packages will be UPDATED: 2025-05-07T20:24:22.3805329Z 2025-05-07T20:24:22.3805923Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:22.3806689Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:22.3807322Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:22.3808027Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:22.3808387Z 2025-05-07T20:24:22.3808391Z 2025-05-07T20:24:22.3808402Z 2025-05-07T20:24:22.3808542Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:22.3808916Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:22.3809141Z 2025-05-07T20:24:22.3815048Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:22.3815311Z 2025-05-07T20:24:22.3815316Z 2025-05-07T20:24:22.3820696Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:22.3820959Z 2025-05-07T20:24:22.3820963Z 2025-05-07T20:24:22.3822901Z 2025-05-07T20:24:22.3848636Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:22.3849037Z 2025-05-07T20:24:22.3849043Z 2025-05-07T20:24:22.3849048Z 2025-05-07T20:24:22.3849053Z 2025-05-07T20:24:22.3861826Z cffi-1.17.1 | 295 KB | | 0%  2025-05-07T20:24:22.3862156Z 2025-05-07T20:24:22.3862161Z 2025-05-07T20:24:22.3862164Z 2025-05-07T20:24:22.3862168Z 2025-05-07T20:24:22.3866020Z 2025-05-07T20:24:22.3874727Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:22.3875151Z 2025-05-07T20:24:22.3875158Z 2025-05-07T20:24:22.3875164Z 2025-05-07T20:24:22.3875169Z 2025-05-07T20:24:22.3875175Z 2025-05-07T20:24:22.3875180Z 2025-05-07T20:24:22.3878900Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:22.3879238Z 2025-05-07T20:24:22.3879243Z 2025-05-07T20:24:22.3879247Z 2025-05-07T20:24:22.3879267Z 2025-05-07T20:24:22.3879271Z 2025-05-07T20:24:22.3879283Z 2025-05-07T20:24:22.3879286Z 2025-05-07T20:24:22.3882207Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:22.3882519Z 2025-05-07T20:24:22.3882523Z 2025-05-07T20:24:22.3882527Z 2025-05-07T20:24:22.3882537Z 2025-05-07T20:24:22.3882541Z 2025-05-07T20:24:22.3882545Z 2025-05-07T20:24:22.3882548Z 2025-05-07T20:24:22.3882552Z 2025-05-07T20:24:22.3883632Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:22.3883939Z 2025-05-07T20:24:22.3883943Z 2025-05-07T20:24:22.3883947Z 2025-05-07T20:24:22.3883950Z 2025-05-07T20:24:22.3883954Z 2025-05-07T20:24:22.3883958Z 2025-05-07T20:24:22.3883962Z 2025-05-07T20:24:22.3883965Z 2025-05-07T20:24:22.3883972Z 2025-05-07T20:24:22.3885535Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:22.3885880Z 2025-05-07T20:24:22.3885884Z 2025-05-07T20:24:22.3886188Z 2025-05-07T20:24:22.3886192Z 2025-05-07T20:24:22.3886195Z 2025-05-07T20:24:22.3886199Z 2025-05-07T20:24:22.3886203Z 2025-05-07T20:24:22.3886206Z 2025-05-07T20:24:22.3886210Z 2025-05-07T20:24:22.3886213Z 2025-05-07T20:24:22.4493917Z python_abi-3.11 | 5 KB | | 0%  2025-05-07T20:24:22.4494219Z 2025-05-07T20:24:22.4494370Z 2025-05-07T20:24:22.4699174Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.4700774Z 2025-05-07T20:24:22.4828529Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.4828798Z 2025-05-07T20:24:22.4828803Z 2025-05-07T20:24:22.4829001Z 2025-05-07T20:24:22.4880435Z libgomp-15.1.0 | 442 KB | ##1 | 22%  2025-05-07T20:24:22.4880844Z 2025-05-07T20:24:22.4880851Z 2025-05-07T20:24:22.4880857Z 2025-05-07T20:24:22.4880862Z 2025-05-07T20:24:22.4881650Z 2025-05-07T20:24:22.4890941Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:24:22.4959849Z openssl-3.5.0 | 3.0 MB | | 1% 2025-05-07T20:24:22.4960185Z 2025-05-07T20:24:22.4960190Z 2025-05-07T20:24:22.4960194Z 2025-05-07T20:24:22.4960197Z 2025-05-07T20:24:22.4964634Z 2025-05-07T20:24:22.5033186Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.5033562Z 2025-05-07T20:24:22.5033566Z 2025-05-07T20:24:22.5033570Z 2025-05-07T20:24:22.5091434Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.5091729Z 2025-05-07T20:24:22.5091735Z 2025-05-07T20:24:22.5091752Z 2025-05-07T20:24:22.5091758Z 2025-05-07T20:24:22.5275090Z cffi-1.17.1 | 295 KB | 5 | 5%  2025-05-07T20:24:22.5275434Z 2025-05-07T20:24:22.5275441Z 2025-05-07T20:24:22.5275446Z 2025-05-07T20:24:22.5275451Z 2025-05-07T20:24:22.5275456Z 2025-05-07T20:24:22.5275461Z 2025-05-07T20:24:22.5331993Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:22.5332316Z 2025-05-07T20:24:22.5332320Z 2025-05-07T20:24:22.5332324Z 2025-05-07T20:24:22.5332327Z 2025-05-07T20:24:22.5337260Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:22.5337523Z 2025-05-07T20:24:22.5337527Z 2025-05-07T20:24:22.5348184Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.5348479Z 2025-05-07T20:24:22.5348483Z 2025-05-07T20:24:22.5348487Z 2025-05-07T20:24:22.5348490Z 2025-05-07T20:24:22.5348494Z 2025-05-07T20:24:22.5348551Z 2025-05-07T20:24:22.5355361Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.5355663Z 2025-05-07T20:24:22.5355667Z 2025-05-07T20:24:22.5355671Z 2025-05-07T20:24:22.5355675Z 2025-05-07T20:24:22.5355679Z 2025-05-07T20:24:22.5355682Z 2025-05-07T20:24:22.5365672Z 2025-05-07T20:24:22.5380884Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:22.5381255Z 2025-05-07T20:24:22.5383747Z 2025-05-07T20:24:22.5448335Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.5448682Z 2025-05-07T20:24:22.5448687Z 2025-05-07T20:24:22.5448690Z 2025-05-07T20:24:22.5448694Z 2025-05-07T20:24:22.5448698Z 2025-05-07T20:24:22.5448701Z 2025-05-07T20:24:22.5448705Z 2025-05-07T20:24:22.5622358Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.5622750Z 2025-05-07T20:24:22.5622758Z 2025-05-07T20:24:22.5622764Z 2025-05-07T20:24:22.5622768Z 2025-05-07T20:24:22.5622771Z 2025-05-07T20:24:22.5622775Z 2025-05-07T20:24:22.5622790Z 2025-05-07T20:24:22.5622794Z 2025-05-07T20:24:22.5696947Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:22.5697262Z 2025-05-07T20:24:22.5697266Z 2025-05-07T20:24:22.5697270Z 2025-05-07T20:24:22.5697274Z 2025-05-07T20:24:22.5697277Z 2025-05-07T20:24:22.5697281Z 2025-05-07T20:24:22.5697285Z 2025-05-07T20:24:22.5701755Z 2025-05-07T20:24:22.5874236Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.5874830Z 2025-05-07T20:24:22.5874837Z 2025-05-07T20:24:22.5874842Z 2025-05-07T20:24:22.5874859Z 2025-05-07T20:24:22.5874863Z 2025-05-07T20:24:22.5935327Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.5935705Z 2025-05-07T20:24:22.5935711Z 2025-05-07T20:24:22.5935725Z 2025-05-07T20:24:22.5935731Z 2025-05-07T20:24:22.5935736Z 2025-05-07T20:24:22.5935741Z 2025-05-07T20:24:22.5935746Z 2025-05-07T20:24:22.5935751Z 2025-05-07T20:24:22.5935756Z 2025-05-07T20:24:22.5937353Z 2025-05-07T20:24:22.5956217Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:22.5956558Z 2025-05-07T20:24:22.5956562Z 2025-05-07T20:24:22.5956566Z 2025-05-07T20:24:22.5956570Z 2025-05-07T20:24:22.5956574Z 2025-05-07T20:24:22.5956577Z 2025-05-07T20:24:22.5956581Z 2025-05-07T20:24:22.5956585Z 2025-05-07T20:24:22.5956588Z 2025-05-07T20:24:22.5956890Z 2025-05-07T20:24:22.6092013Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:22.6092304Z 2025-05-07T20:24:22.6092308Z 2025-05-07T20:24:22.6092312Z 2025-05-07T20:24:22.6092315Z 2025-05-07T20:24:22.6092319Z 2025-05-07T20:24:22.6092323Z 2025-05-07T20:24:22.6092326Z 2025-05-07T20:24:22.6092330Z 2025-05-07T20:24:22.6092509Z 2025-05-07T20:24:22.6131946Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:22.6132247Z 2025-05-07T20:24:22.6132252Z 2025-05-07T20:24:22.6132256Z 2025-05-07T20:24:22.6132270Z 2025-05-07T20:24:22.6132274Z 2025-05-07T20:24:22.6132278Z 2025-05-07T20:24:22.6132288Z 2025-05-07T20:24:22.6132292Z 2025-05-07T20:24:22.6132296Z 2025-05-07T20:24:22.6402248Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.6402667Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.6473569Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.6473825Z 2025-05-07T20:24:22.6473831Z 2025-05-07T20:24:22.6476379Z 2025-05-07T20:24:22.6479348Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.6479703Z 2025-05-07T20:24:22.6479708Z 2025-05-07T20:24:22.6479712Z 2025-05-07T20:24:22.6747580Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.6747924Z 2025-05-07T20:24:22.6748075Z 2025-05-07T20:24:22.6748079Z 2025-05-07T20:24:22.6748128Z 2025-05-07T20:24:22.6752426Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:22.6752717Z 2025-05-07T20:24:22.6752721Z 2025-05-07T20:24:22.6752725Z 2025-05-07T20:24:22.6754036Z 2025-05-07T20:24:22.6921380Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:22.6921633Z 2025-05-07T20:24:22.6921637Z 2025-05-07T20:24:22.6921641Z 2025-05-07T20:24:22.6921644Z 2025-05-07T20:24:22.6921648Z 2025-05-07T20:24:22.6921651Z 2025-05-07T20:24:22.6921807Z 2025-05-07T20:24:22.6925872Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.6926189Z 2025-05-07T20:24:22.6926193Z 2025-05-07T20:24:22.6926197Z 2025-05-07T20:24:22.6926200Z 2025-05-07T20:24:22.6926204Z 2025-05-07T20:24:22.6926208Z 2025-05-07T20:24:22.6926211Z 2025-05-07T20:24:22.7108612Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.7109211Z 2025-05-07T20:24:22.7109219Z 2025-05-07T20:24:22.7109226Z 2025-05-07T20:24:22.7109234Z 2025-05-07T20:24:22.7109241Z 2025-05-07T20:24:22.7109264Z 2025-05-07T20:24:22.7109272Z 2025-05-07T20:24:22.7109279Z 2025-05-07T20:24:22.7115267Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.7115562Z 2025-05-07T20:24:22.7115566Z 2025-05-07T20:24:22.7115570Z 2025-05-07T20:24:22.7115573Z 2025-05-07T20:24:22.7115577Z 2025-05-07T20:24:22.7115581Z 2025-05-07T20:24:22.7115591Z 2025-05-07T20:24:22.7115595Z 2025-05-07T20:24:22.7248118Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.7248598Z 2025-05-07T20:24:22.7248602Z 2025-05-07T20:24:22.7248612Z 2025-05-07T20:24:22.7248616Z 2025-05-07T20:24:22.7248620Z 2025-05-07T20:24:22.7248623Z 2025-05-07T20:24:22.7248627Z 2025-05-07T20:24:22.7248630Z 2025-05-07T20:24:22.7248634Z 2025-05-07T20:24:22.7248638Z 2025-05-07T20:24:22.7694942Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:22.7695306Z 2025-05-07T20:24:22.7695313Z 2025-05-07T20:24:22.7695536Z 2025-05-07T20:24:22.7695542Z 2025-05-07T20:24:22.7695546Z 2025-05-07T20:24:22.7695550Z 2025-05-07T20:24:22.7695554Z 2025-05-07T20:24:22.7695557Z 2025-05-07T20:24:22.7695570Z 2025-05-07T20:24:22.7705816Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.7706161Z 2025-05-07T20:24:22.7706166Z 2025-05-07T20:24:22.7706172Z 2025-05-07T20:24:22.7706177Z 2025-05-07T20:24:22.7706191Z 2025-05-07T20:24:22.7706207Z 2025-05-07T20:24:22.7706212Z 2025-05-07T20:24:22.7706217Z 2025-05-07T20:24:22.7706222Z 2025-05-07T20:24:22.7723770Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.7724153Z 2025-05-07T20:24:22.7724158Z 2025-05-07T20:24:22.7724161Z 2025-05-07T20:24:22.7724165Z 2025-05-07T20:24:22.7724169Z 2025-05-07T20:24:22.7724405Z 2025-05-07T20:24:22.7727925Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.7728214Z 2025-05-07T20:24:22.7728219Z 2025-05-07T20:24:22.7728235Z 2025-05-07T20:24:22.7728241Z 2025-05-07T20:24:22.7728246Z 2025-05-07T20:24:22.7728251Z 2025-05-07T20:24:22.8157950Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.8158491Z 2025-05-07T20:24:22.8159523Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.8160049Z 2025-05-07T20:24:22.8576064Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.8582424Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.8582977Z 2025-05-07T20:24:22.8583319Z 2025-05-07T20:24:22.8583669Z  2025-05-07T20:24:22.8584015Z 2025-05-07T20:24:22.8584022Z 2025-05-07T20:24:22.8584298Z  2025-05-07T20:24:22.8584645Z 2025-05-07T20:24:22.8584661Z 2025-05-07T20:24:22.8584669Z 2025-05-07T20:24:22.8585004Z  2025-05-07T20:24:22.8585348Z 2025-05-07T20:24:22.8585354Z 2025-05-07T20:24:22.8585359Z 2025-05-07T20:24:22.8585374Z 2025-05-07T20:24:22.8585652Z  2025-05-07T20:24:22.8585976Z 2025-05-07T20:24:22.8585981Z 2025-05-07T20:24:22.8585986Z 2025-05-07T20:24:22.8585992Z 2025-05-07T20:24:22.8585998Z 2025-05-07T20:24:22.8586269Z  2025-05-07T20:24:22.8586591Z 2025-05-07T20:24:22.8586596Z 2025-05-07T20:24:22.8586601Z 2025-05-07T20:24:22.8586607Z 2025-05-07T20:24:22.8586612Z 2025-05-07T20:24:22.8586618Z 2025-05-07T20:24:22.8586894Z  2025-05-07T20:24:22.8587206Z 2025-05-07T20:24:22.8587211Z 2025-05-07T20:24:22.8587216Z 2025-05-07T20:24:22.8587221Z 2025-05-07T20:24:22.8587227Z 2025-05-07T20:24:22.8587232Z 2025-05-07T20:24:22.8587237Z 2025-05-07T20:24:22.8587509Z  2025-05-07T20:24:22.8587743Z 2025-05-07T20:24:22.8587748Z 2025-05-07T20:24:22.8587753Z 2025-05-07T20:24:22.8587759Z 2025-05-07T20:24:22.8587764Z 2025-05-07T20:24:22.8587769Z 2025-05-07T20:24:22.8587784Z 2025-05-07T20:24:22.8587789Z 2025-05-07T20:24:22.8588053Z  2025-05-07T20:24:22.8588361Z 2025-05-07T20:24:22.8588669Z 2025-05-07T20:24:22.8588675Z 2025-05-07T20:24:22.8588680Z 2025-05-07T20:24:22.8588695Z 2025-05-07T20:24:22.8588700Z 2025-05-07T20:24:22.8588705Z 2025-05-07T20:24:22.8588711Z 2025-05-07T20:24:22.8588716Z 2025-05-07T20:24:22.8589014Z  2025-05-07T20:24:22.8589346Z 2025-05-07T20:24:22.8589352Z 2025-05-07T20:24:22.8589366Z 2025-05-07T20:24:22.8589372Z 2025-05-07T20:24:22.8589377Z 2025-05-07T20:24:22.8589383Z 2025-05-07T20:24:22.8589388Z 2025-05-07T20:24:22.8589594Z 2025-05-07T20:24:22.8589601Z 2025-05-07T20:24:22.8589607Z 2025-05-07T20:24:22.8589915Z  done 2025-05-07T20:24:22.9593394Z Preparing transaction: \ done 2025-05-07T20:24:23.0596848Z Verifying transaction: / done 2025-05-07T20:24:24.5624681Z Executing transaction: \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:24.7378003Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:26.4590964Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:26.4602887Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:26.4626676Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:27.3256676Z Channels: 2025-05-07T20:24:27.3256920Z - conda-forge 2025-05-07T20:24:27.3257156Z Platform: linux-64 2025-05-07T20:24:30.6406405Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:31.0169683Z Solving environment: \ | done 2025-05-07T20:24:31.0781915Z 2025-05-07T20:24:31.0782187Z ## Package Plan ## 2025-05-07T20:24:31.0782343Z 2025-05-07T20:24:31.0782555Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:31.0782863Z 2025-05-07T20:24:31.0782960Z added / updated specs: 2025-05-07T20:24:31.0783207Z - libxcrypt 2025-05-07T20:24:31.0783333Z 2025-05-07T20:24:31.0783338Z 2025-05-07T20:24:31.0783477Z The following packages will be downloaded: 2025-05-07T20:24:31.0783691Z 2025-05-07T20:24:31.0783811Z package | build 2025-05-07T20:24:31.0784123Z ---------------------------|----------------- 2025-05-07T20:24:31.0784497Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:31.0784891Z ------------------------------------------------------------ 2025-05-07T20:24:31.0785226Z Total: 98 KB 2025-05-07T20:24:31.0785439Z 2025-05-07T20:24:31.0785568Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:31.0785793Z 2025-05-07T20:24:31.0786003Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:31.0786281Z 2025-05-07T20:24:31.0786285Z 2025-05-07T20:24:31.0786289Z 2025-05-07T20:24:31.0786430Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:31.2071419Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:31.2090928Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:31.2196838Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:31.2199052Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:31.2199388Z 2025-05-07T20:24:31.2199670Z done 2025-05-07T20:24:31.3204995Z Preparing transaction: - done 2025-05-07T20:24:31.4210069Z Verifying transaction: | done 2025-05-07T20:24:31.5215262Z Executing transaction: - done 2025-05-07T20:24:34.9374997Z [SETUP] Copying over ... 2025-05-07T20:24:34.9375737Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.11/crypt.h 2025-05-07T20:24:34.9376282Z 2025-05-07T20:24:34.9403002Z 2025-05-07T20:24:36.5797646Z [SETUP] Installed Python version: Python 3.11.11 2025-05-07T20:24:36.5798129Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:36.5832080Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:36.5832554Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:36.5845737Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:36.5846087Z env: 2025-05-07T20:24:36.5846317Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:36.5846627Z BUILD_ENV: build_binary 2025-05-07T20:24:36.5846874Z BUILD_TARGET: genai 2025-05-07T20:24:36.5847109Z BUILD_VARIANT: cuda 2025-05-07T20:24:36.5847349Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:36.5847707Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:36.5848001Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:36.5848331Z ##[endgroup] 2025-05-07T20:24:36.9204585Z ################################################################################ 2025-05-07T20:24:36.9204953Z # Install C/C++ Compilers 2025-05-07T20:24:36.9205217Z # 2025-05-07T20:24:36.9220065Z # [2025-05-07T20:24:36.921Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:36.9220474Z ################################################################################ 2025-05-07T20:24:36.9220700Z 2025-05-07T20:24:36.9235038Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:37.0120379Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:37.0131133Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:37.0152745Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:37.8796838Z Channels: 2025-05-07T20:24:37.8797096Z - conda-forge 2025-05-07T20:24:37.8797327Z Platform: linux-64 2025-05-07T20:24:41.1992619Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:41.5656558Z Solving environment: \ done 2025-05-07T20:24:41.6277644Z 2025-05-07T20:24:41.6278223Z ## Package Plan ## 2025-05-07T20:24:41.6278664Z 2025-05-07T20:24:41.6279126Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:41.6279447Z 2025-05-07T20:24:41.6279552Z added / updated specs: 2025-05-07T20:24:41.6279815Z - sysroot_linux-64=2.17 2025-05-07T20:24:41.6279979Z 2025-05-07T20:24:41.6279983Z 2025-05-07T20:24:41.6280105Z The following packages will be downloaded: 2025-05-07T20:24:41.6280323Z 2025-05-07T20:24:41.6280438Z package | build 2025-05-07T20:24:41.6280760Z ---------------------------|----------------- 2025-05-07T20:24:41.6281223Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:41.6281761Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:41.6282180Z ------------------------------------------------------------ 2025-05-07T20:24:41.6282526Z Total: 15.4 MB 2025-05-07T20:24:41.6282736Z 2025-05-07T20:24:41.6282864Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:41.6283099Z 2025-05-07T20:24:41.6283387Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:41.6283953Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:41.6284259Z 2025-05-07T20:24:41.6284263Z 2025-05-07T20:24:41.6284267Z 2025-05-07T20:24:41.6284418Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:41.6284796Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:41.6285790Z 2025-05-07T20:24:41.8616781Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:41.8901005Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:41.8901384Z 2025-05-07T20:24:41.9089540Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:24:41.9089816Z 2025-05-07T20:24:41.9919321Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.9922175Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:42.2395195Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:42.2395623Z 2025-05-07T20:24:42.2397314Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:42.2397606Z 2025-05-07T20:24:42.6578149Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:42.6582133Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:42.6582612Z 2025-05-07T20:24:42.6582819Z 2025-05-07T20:24:42.6583248Z  done 2025-05-07T20:24:42.7589728Z Preparing transaction: / done 2025-05-07T20:24:42.9597159Z Verifying transaction: \ | done 2025-05-07T20:24:43.1650395Z Executing transaction: - \ done 2025-05-07T20:24:43.3188228Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:43.3188529Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:44.9917509Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:44.9930455Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:44.9953083Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:45.8858859Z Channels: 2025-05-07T20:24:45.8859104Z - conda-forge 2025-05-07T20:24:45.8859334Z Platform: linux-64 2025-05-07T20:24:49.3022460Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:50.2592621Z Solving environment: \ | / done 2025-05-07T20:24:50.3240457Z 2025-05-07T20:24:50.3240637Z ## Package Plan ## 2025-05-07T20:24:50.3240791Z 2025-05-07T20:24:50.3241079Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:50.3241487Z 2025-05-07T20:24:50.3241590Z added / updated specs: 2025-05-07T20:24:50.3241855Z - gxx_linux-64=11.4.0 2025-05-07T20:24:50.3242019Z 2025-05-07T20:24:50.3242032Z 2025-05-07T20:24:50.3242157Z The following packages will be downloaded: 2025-05-07T20:24:50.3242403Z 2025-05-07T20:24:50.3242541Z package | build 2025-05-07T20:24:50.3242862Z ---------------------------|----------------- 2025-05-07T20:24:50.3243263Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:50.3243746Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:50.3244202Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:50.3244646Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:50.3245089Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:50.3245528Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:50.3245955Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:50.3246427Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:50.3246910Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:50.3247352Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:50.3247926Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:50.3248414Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:50.3248825Z ------------------------------------------------------------ 2025-05-07T20:24:50.3249164Z Total: 91.6 MB 2025-05-07T20:24:50.3249387Z 2025-05-07T20:24:50.3249515Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:50.3249742Z 2025-05-07T20:24:50.3250017Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:50.3250574Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:50.3251475Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:50.3252149Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:50.3252652Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:50.3253154Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:50.3253675Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:50.3254232Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:50.3254725Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:50.3255263Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:50.3255630Z 2025-05-07T20:24:50.3255745Z The following packages will be UPDATED: 2025-05-07T20:24:50.3255955Z 2025-05-07T20:24:50.3256272Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:50.3256994Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:50.3257400Z 2025-05-07T20:24:50.3257404Z 2025-05-07T20:24:50.3257408Z 2025-05-07T20:24:50.3257557Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:50.3257935Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:50.3258173Z 2025-05-07T20:24:50.3258602Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:50.3258852Z 2025-05-07T20:24:50.3258856Z 2025-05-07T20:24:50.3264842Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:50.3265306Z 2025-05-07T20:24:50.3265312Z 2025-05-07T20:24:50.3271318Z 2025-05-07T20:24:50.3286317Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:50.3286606Z 2025-05-07T20:24:50.3286617Z 2025-05-07T20:24:50.3286621Z 2025-05-07T20:24:50.3286947Z 2025-05-07T20:24:50.3294478Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:50.3294854Z 2025-05-07T20:24:50.3294870Z 2025-05-07T20:24:50.3294874Z 2025-05-07T20:24:50.3294877Z 2025-05-07T20:24:50.3294881Z 2025-05-07T20:24:50.3315850Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:50.3316148Z 2025-05-07T20:24:50.3316158Z 2025-05-07T20:24:50.3316162Z 2025-05-07T20:24:50.3316165Z 2025-05-07T20:24:50.3316169Z 2025-05-07T20:24:50.3316173Z 2025-05-07T20:24:50.3317251Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:50.3317535Z 2025-05-07T20:24:50.3317546Z 2025-05-07T20:24:50.3317550Z 2025-05-07T20:24:50.3317554Z 2025-05-07T20:24:50.3317558Z 2025-05-07T20:24:50.3317561Z 2025-05-07T20:24:50.3317567Z 2025-05-07T20:24:50.3319248Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:50.3319534Z 2025-05-07T20:24:50.3319538Z 2025-05-07T20:24:50.3319554Z 2025-05-07T20:24:50.3319562Z 2025-05-07T20:24:50.3319566Z 2025-05-07T20:24:50.3319570Z 2025-05-07T20:24:50.3319574Z 2025-05-07T20:24:50.3319577Z 2025-05-07T20:24:50.3320604Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:50.3320885Z 2025-05-07T20:24:50.3320889Z 2025-05-07T20:24:50.3320893Z 2025-05-07T20:24:50.3320896Z 2025-05-07T20:24:50.3320900Z 2025-05-07T20:24:50.3320903Z 2025-05-07T20:24:50.3320909Z 2025-05-07T20:24:50.3320913Z 2025-05-07T20:24:50.3320916Z 2025-05-07T20:24:50.3331779Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:50.3332078Z 2025-05-07T20:24:50.3332082Z 2025-05-07T20:24:50.3332086Z 2025-05-07T20:24:50.3332090Z 2025-05-07T20:24:50.3332093Z 2025-05-07T20:24:50.3332097Z 2025-05-07T20:24:50.3332101Z 2025-05-07T20:24:50.3332104Z 2025-05-07T20:24:50.3332108Z 2025-05-07T20:24:50.3364232Z 2025-05-07T20:24:50.3376455Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:50.3376936Z 2025-05-07T20:24:50.3376941Z 2025-05-07T20:24:50.3376945Z 2025-05-07T20:24:50.3376948Z 2025-05-07T20:24:50.3376952Z 2025-05-07T20:24:50.3376956Z 2025-05-07T20:24:50.3376959Z 2025-05-07T20:24:50.3376971Z 2025-05-07T20:24:50.3376975Z 2025-05-07T20:24:50.3376979Z 2025-05-07T20:24:50.3376982Z 2025-05-07T20:24:50.4338086Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:50.4338523Z 2025-05-07T20:24:50.4338529Z 2025-05-07T20:24:50.4338534Z 2025-05-07T20:24:50.5454464Z binutils_impl_linux- | 6.0 MB | 4 | 4%  2025-05-07T20:24:50.5454845Z 2025-05-07T20:24:50.5454850Z 2025-05-07T20:24:50.5454855Z 2025-05-07T20:24:50.5454860Z 2025-05-07T20:24:50.5736819Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:50.5737175Z 2025-05-07T20:24:50.5737179Z 2025-05-07T20:24:50.5892442Z 2025-05-07T20:24:50.6365919Z binutils_impl_linux- | 6.0 MB | 8 | 8%  2025-05-07T20:24:50.6366272Z 2025-05-07T20:24:50.6455813Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:50.6456140Z 2025-05-07T20:24:50.6456144Z 2025-05-07T20:24:50.6456148Z 2025-05-07T20:24:50.6456711Z 2025-05-07T20:24:50.6607893Z libstdcxx-15.1.0 | 3.7 MB | #####5 | 55%  2025-05-07T20:24:50.6736456Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:50.6736703Z 2025-05-07T20:24:50.6736707Z 2025-05-07T20:24:50.6736711Z 2025-05-07T20:24:50.6843082Z binutils_impl_linux- | 6.0 MB | #######8 | 78%  2025-05-07T20:24:50.6843367Z 2025-05-07T20:24:50.6843371Z 2025-05-07T20:24:50.7370540Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:50.7370811Z 2025-05-07T20:24:50.7447976Z gxx_impl_linux-64-11 | 11.2 MB | #####3 | 53%  2025-05-07T20:24:50.7448218Z 2025-05-07T20:24:50.7448228Z 2025-05-07T20:24:50.7448232Z 2025-05-07T20:24:50.7448236Z 2025-05-07T20:24:50.7611951Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:50.7846029Z gcc_impl_linux-64-11 | 53.0 MB | 6 | 7% 2025-05-07T20:24:50.7846265Z 2025-05-07T20:24:50.7846782Z 2025-05-07T20:24:50.8110435Z libstdcxx-devel_linu | 11.1 MB | ##5 | 25%  2025-05-07T20:24:50.8110725Z 2025-05-07T20:24:50.8110730Z 2025-05-07T20:24:50.8110734Z 2025-05-07T20:24:50.8110738Z 2025-05-07T20:24:50.8110741Z 2025-05-07T20:24:50.8377031Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:50.8379613Z 2025-05-07T20:24:50.8410071Z gxx_impl_linux-64-11 | 11.2 MB | ########4 | 84%  2025-05-07T20:24:50.8410320Z 2025-05-07T20:24:50.8410324Z 2025-05-07T20:24:50.8417966Z 2025-05-07T20:24:50.8612433Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:50.8850199Z gcc_impl_linux-64-11 | 53.0 MB | #3 | 13% 2025-05-07T20:24:50.8850438Z 2025-05-07T20:24:50.8851930Z 2025-05-07T20:24:50.8880906Z libstdcxx-devel_linu | 11.1 MB | #####1 | 51%  2025-05-07T20:24:50.8881299Z 2025-05-07T20:24:50.8881306Z 2025-05-07T20:24:50.8881311Z 2025-05-07T20:24:50.8881316Z 2025-05-07T20:24:50.8881322Z 2025-05-07T20:24:50.8883263Z 2025-05-07T20:24:50.9115782Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:50.9116080Z 2025-05-07T20:24:50.9116406Z 2025-05-07T20:24:50.9116413Z 2025-05-07T20:24:50.9116419Z 2025-05-07T20:24:50.9116424Z 2025-05-07T20:24:50.9615724Z libsanitizer-11.4.0 | 3.5 MB | #######8 | 79%  2025-05-07T20:24:51.0135822Z gcc_impl_linux-64-11 | 53.0 MB | ## | 20% 2025-05-07T20:24:51.0136207Z 2025-05-07T20:24:51.0136214Z 2025-05-07T20:24:51.0136220Z 2025-05-07T20:24:51.0136225Z 2025-05-07T20:24:51.0136230Z 2025-05-07T20:24:51.0137990Z 2025-05-07T20:24:51.0138656Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:51.0139066Z 2025-05-07T20:24:51.0139414Z 2025-05-07T20:24:51.0139611Z 2025-05-07T20:24:51.0139616Z 2025-05-07T20:24:51.0139620Z 2025-05-07T20:24:51.0139623Z 2025-05-07T20:24:51.0239961Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:51.0240610Z 2025-05-07T20:24:51.0240616Z 2025-05-07T20:24:51.0240621Z 2025-05-07T20:24:51.0240626Z 2025-05-07T20:24:51.0242062Z 2025-05-07T20:24:51.0545276Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:51.0545707Z 2025-05-07T20:24:51.0545713Z 2025-05-07T20:24:51.0545718Z 2025-05-07T20:24:51.0545723Z 2025-05-07T20:24:51.0545728Z 2025-05-07T20:24:51.0545733Z 2025-05-07T20:24:51.0545739Z 2025-05-07T20:24:51.0622213Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:51.0737536Z gcc_impl_linux-64-11 | 53.0 MB | ##7 | 27% 2025-05-07T20:24:51.0737890Z 2025-05-07T20:24:51.0737896Z 2025-05-07T20:24:51.0737902Z 2025-05-07T20:24:51.0737906Z 2025-05-07T20:24:51.0737912Z 2025-05-07T20:24:51.0737960Z 2025-05-07T20:24:51.0737966Z 2025-05-07T20:24:51.0738164Z 2025-05-07T20:24:51.0793366Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:51.0793773Z 2025-05-07T20:24:51.0793778Z 2025-05-07T20:24:51.0793784Z 2025-05-07T20:24:51.0793789Z 2025-05-07T20:24:51.0793794Z 2025-05-07T20:24:51.0793799Z 2025-05-07T20:24:51.0793804Z 2025-05-07T20:24:51.0793809Z 2025-05-07T20:24:51.1045177Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:51.1045603Z 2025-05-07T20:24:51.1045609Z 2025-05-07T20:24:51.1045614Z 2025-05-07T20:24:51.1045619Z 2025-05-07T20:24:51.1045624Z 2025-05-07T20:24:51.1045629Z 2025-05-07T20:24:51.1045634Z 2025-05-07T20:24:51.1361205Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:51.1361612Z 2025-05-07T20:24:51.1362207Z 2025-05-07T20:24:51.1376075Z libstdcxx-devel_linu | 11.1 MB | #######1 | 71%  2025-05-07T20:24:51.1376486Z 2025-05-07T20:24:51.1376509Z 2025-05-07T20:24:51.1376515Z 2025-05-07T20:24:51.1376520Z 2025-05-07T20:24:51.1376525Z 2025-05-07T20:24:51.1376530Z 2025-05-07T20:24:51.1376535Z 2025-05-07T20:24:51.1376540Z 2025-05-07T20:24:51.1376546Z 2025-05-07T20:24:51.1424782Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:51.1425179Z 2025-05-07T20:24:51.1425184Z 2025-05-07T20:24:51.1425189Z 2025-05-07T20:24:51.1425194Z 2025-05-07T20:24:51.1425199Z 2025-05-07T20:24:51.1425205Z 2025-05-07T20:24:51.1425210Z 2025-05-07T20:24:51.1425215Z 2025-05-07T20:24:51.1425220Z 2025-05-07T20:24:51.1469645Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:51.1470046Z 2025-05-07T20:24:51.1470051Z 2025-05-07T20:24:51.1470056Z 2025-05-07T20:24:51.1470061Z 2025-05-07T20:24:51.1470067Z 2025-05-07T20:24:51.1470072Z 2025-05-07T20:24:51.1470089Z 2025-05-07T20:24:51.1470094Z 2025-05-07T20:24:51.1470100Z 2025-05-07T20:24:51.1471490Z 2025-05-07T20:24:51.1510085Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:51.1510410Z 2025-05-07T20:24:51.1510414Z 2025-05-07T20:24:51.1510418Z 2025-05-07T20:24:51.1510422Z 2025-05-07T20:24:51.1510425Z 2025-05-07T20:24:51.1510429Z 2025-05-07T20:24:51.1510433Z 2025-05-07T20:24:51.1510436Z 2025-05-07T20:24:51.1510440Z 2025-05-07T20:24:51.1510444Z 2025-05-07T20:24:51.1686302Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:51.2061627Z gcc_impl_linux-64-11 | 53.0 MB | ###3 | 33% 2025-05-07T20:24:51.2061953Z 2025-05-07T20:24:51.2061958Z 2025-05-07T20:24:51.2061962Z 2025-05-07T20:24:51.2061965Z 2025-05-07T20:24:51.2061969Z 2025-05-07T20:24:51.2061973Z 2025-05-07T20:24:51.2061977Z 2025-05-07T20:24:51.2061980Z 2025-05-07T20:24:51.2061984Z 2025-05-07T20:24:51.2061988Z 2025-05-07T20:24:51.2063075Z 2025-05-07T20:24:51.2093954Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:51.2094455Z 2025-05-07T20:24:51.2094460Z 2025-05-07T20:24:51.2094464Z 2025-05-07T20:24:51.2094467Z 2025-05-07T20:24:51.2094471Z 2025-05-07T20:24:51.2094475Z 2025-05-07T20:24:51.2094478Z 2025-05-07T20:24:51.2094482Z 2025-05-07T20:24:51.2094486Z 2025-05-07T20:24:51.2094489Z 2025-05-07T20:24:51.2094493Z 2025-05-07T20:24:51.2405448Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:51.2410490Z 2025-05-07T20:24:51.2688807Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:51.2707685Z gcc_impl_linux-64-11 | 53.0 MB | #### | 41% 2025-05-07T20:24:51.2707941Z 2025-05-07T20:24:51.2707945Z 2025-05-07T20:24:51.2707949Z 2025-05-07T20:24:51.2707953Z 2025-05-07T20:24:51.2712833Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:51.2713220Z 2025-05-07T20:24:51.2713225Z 2025-05-07T20:24:51.2713228Z 2025-05-07T20:24:51.2713267Z 2025-05-07T20:24:51.3417705Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:51.3418025Z 2025-05-07T20:24:51.3418030Z 2025-05-07T20:24:51.3418033Z 2025-05-07T20:24:51.3418037Z 2025-05-07T20:24:51.3418041Z 2025-05-07T20:24:51.3418045Z 2025-05-07T20:24:51.3691942Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:51.4374857Z gcc_impl_linux-64-11 | 53.0 MB | ####9 | 49% 2025-05-07T20:24:51.4375250Z 2025-05-07T20:24:51.4375256Z 2025-05-07T20:24:51.4375262Z 2025-05-07T20:24:51.4375267Z 2025-05-07T20:24:51.4375272Z 2025-05-07T20:24:51.4375277Z 2025-05-07T20:24:51.4375283Z 2025-05-07T20:24:51.4377835Z 2025-05-07T20:24:51.4386615Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:51.4387022Z 2025-05-07T20:24:51.4387028Z 2025-05-07T20:24:51.4387033Z 2025-05-07T20:24:51.4387038Z 2025-05-07T20:24:51.4387043Z 2025-05-07T20:24:51.4387048Z 2025-05-07T20:24:51.4387054Z 2025-05-07T20:24:51.4388959Z 2025-05-07T20:24:51.4547069Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:51.4547478Z 2025-05-07T20:24:51.4547698Z 2025-05-07T20:24:51.4548398Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:51.4548772Z 2025-05-07T20:24:51.4548784Z 2025-05-07T20:24:51.4692496Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:51.5143785Z gcc_impl_linux-64-11 | 53.0 MB | #####7 | 57% 2025-05-07T20:24:51.5144153Z 2025-05-07T20:24:51.5144159Z 2025-05-07T20:24:51.5144164Z 2025-05-07T20:24:51.5144171Z 2025-05-07T20:24:51.5144188Z 2025-05-07T20:24:51.5144193Z 2025-05-07T20:24:51.5144198Z 2025-05-07T20:24:51.5147722Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:51.5148117Z 2025-05-07T20:24:51.5148124Z 2025-05-07T20:24:51.5148142Z 2025-05-07T20:24:51.5148148Z 2025-05-07T20:24:51.5148153Z 2025-05-07T20:24:51.5148160Z 2025-05-07T20:24:51.5148197Z 2025-05-07T20:24:51.5695313Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:51.5810072Z gcc_impl_linux-64-11 | 53.0 MB | ######8 | 69% 2025-05-07T20:24:51.5810589Z 2025-05-07T20:24:51.5810595Z 2025-05-07T20:24:51.5810600Z 2025-05-07T20:24:51.5810605Z 2025-05-07T20:24:51.5810610Z 2025-05-07T20:24:51.5810615Z 2025-05-07T20:24:51.5810621Z 2025-05-07T20:24:51.5810626Z 2025-05-07T20:24:51.5810631Z 2025-05-07T20:24:51.5814826Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:51.5815277Z 2025-05-07T20:24:51.5815282Z 2025-05-07T20:24:51.5815287Z 2025-05-07T20:24:51.5815292Z 2025-05-07T20:24:51.5815297Z 2025-05-07T20:24:51.5815303Z 2025-05-07T20:24:51.5815311Z 2025-05-07T20:24:51.5815318Z 2025-05-07T20:24:51.5815324Z 2025-05-07T20:24:51.6389994Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:51.6390627Z 2025-05-07T20:24:51.6390631Z 2025-05-07T20:24:51.6391020Z 2025-05-07T20:24:51.6391025Z 2025-05-07T20:24:51.6391088Z 2025-05-07T20:24:51.6391092Z 2025-05-07T20:24:51.6391096Z 2025-05-07T20:24:51.6391100Z 2025-05-07T20:24:51.6391103Z 2025-05-07T20:24:51.6391767Z 2025-05-07T20:24:51.6396452Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:51.6396938Z 2025-05-07T20:24:51.6396943Z 2025-05-07T20:24:51.6396949Z 2025-05-07T20:24:51.6396954Z 2025-05-07T20:24:51.6396959Z 2025-05-07T20:24:51.6396964Z 2025-05-07T20:24:51.6396969Z 2025-05-07T20:24:51.6396975Z 2025-05-07T20:24:51.6396980Z 2025-05-07T20:24:51.6396985Z 2025-05-07T20:24:51.6699662Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:51.6908440Z gcc_impl_linux-64-11 | 53.0 MB | #######8 | 78% 2025-05-07T20:24:51.6908906Z 2025-05-07T20:24:51.6908912Z 2025-05-07T20:24:51.6908917Z 2025-05-07T20:24:51.6908922Z 2025-05-07T20:24:51.6908927Z 2025-05-07T20:24:51.6908961Z 2025-05-07T20:24:51.6908981Z 2025-05-07T20:24:51.6908987Z 2025-05-07T20:24:51.6908992Z 2025-05-07T20:24:51.6908997Z 2025-05-07T20:24:51.6909002Z 2025-05-07T20:24:51.6912951Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:51.6913444Z 2025-05-07T20:24:51.6913449Z 2025-05-07T20:24:51.6913454Z 2025-05-07T20:24:51.6913460Z 2025-05-07T20:24:51.6913465Z 2025-05-07T20:24:51.6913470Z 2025-05-07T20:24:51.6913475Z 2025-05-07T20:24:51.6913480Z 2025-05-07T20:24:51.6913485Z 2025-05-07T20:24:51.6913490Z 2025-05-07T20:24:51.6913495Z 2025-05-07T20:24:51.7659478Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:51.7659979Z 2025-05-07T20:24:51.7659983Z 2025-05-07T20:24:51.7659987Z 2025-05-07T20:24:51.7659991Z 2025-05-07T20:24:51.7660728Z 2025-05-07T20:24:51.7701394Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:51.8705430Z gcc_impl_linux-64-11 | 53.0 MB | ########7 | 88% 2025-05-07T20:24:51.9321374Z gcc_impl_linux-64-11 | 53.0 MB | #########9 | 99% 2025-05-07T20:24:51.9321771Z 2025-05-07T20:24:51.9321778Z 2025-05-07T20:24:51.9321783Z 2025-05-07T20:24:52.1053325Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:52.1128447Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:52.1128768Z 2025-05-07T20:24:52.3807981Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:52.3808438Z 2025-05-07T20:24:52.3808444Z 2025-05-07T20:24:52.7902142Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:52.7909063Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:52.7909639Z 2025-05-07T20:24:52.7909885Z 2025-05-07T20:24:52.7910141Z  2025-05-07T20:24:52.7910402Z 2025-05-07T20:24:52.7910406Z 2025-05-07T20:24:52.7910626Z  2025-05-07T20:24:52.7910887Z 2025-05-07T20:24:52.7910891Z 2025-05-07T20:24:52.7910895Z 2025-05-07T20:24:52.7911150Z  2025-05-07T20:24:52.7911419Z 2025-05-07T20:24:52.7911459Z 2025-05-07T20:24:52.7911463Z 2025-05-07T20:24:52.7911467Z 2025-05-07T20:24:52.7911694Z  2025-05-07T20:24:52.7911969Z 2025-05-07T20:24:52.7911973Z 2025-05-07T20:24:52.7911977Z 2025-05-07T20:24:52.7911980Z 2025-05-07T20:24:52.7911984Z 2025-05-07T20:24:52.7912215Z  2025-05-07T20:24:52.7912463Z 2025-05-07T20:24:52.7912554Z 2025-05-07T20:24:52.7912558Z 2025-05-07T20:24:52.7912562Z 2025-05-07T20:24:52.7912565Z 2025-05-07T20:24:52.7912569Z 2025-05-07T20:24:52.7912818Z  2025-05-07T20:24:52.7913121Z 2025-05-07T20:24:52.7913322Z 2025-05-07T20:24:52.7913495Z 2025-05-07T20:24:52.7913500Z 2025-05-07T20:24:52.7913504Z 2025-05-07T20:24:52.7913508Z 2025-05-07T20:24:52.7913511Z 2025-05-07T20:24:52.7913731Z  2025-05-07T20:24:52.7913974Z 2025-05-07T20:24:52.7913977Z 2025-05-07T20:24:52.7914026Z 2025-05-07T20:24:52.7914029Z 2025-05-07T20:24:52.7914033Z 2025-05-07T20:24:52.7914037Z 2025-05-07T20:24:52.7914040Z 2025-05-07T20:24:52.7914044Z 2025-05-07T20:24:52.7914311Z  2025-05-07T20:24:52.7914609Z 2025-05-07T20:24:52.7914613Z 2025-05-07T20:24:52.7914616Z 2025-05-07T20:24:52.7914620Z 2025-05-07T20:24:52.7914623Z 2025-05-07T20:24:52.7914627Z 2025-05-07T20:24:52.7914630Z 2025-05-07T20:24:52.7914634Z 2025-05-07T20:24:52.7914638Z 2025-05-07T20:24:52.7914852Z  2025-05-07T20:24:52.7915127Z 2025-05-07T20:24:52.7915136Z 2025-05-07T20:24:52.7915145Z 2025-05-07T20:24:52.7915149Z 2025-05-07T20:24:52.7915152Z 2025-05-07T20:24:52.7915156Z 2025-05-07T20:24:52.7915160Z 2025-05-07T20:24:52.7915163Z 2025-05-07T20:24:52.7915167Z 2025-05-07T20:24:52.7915170Z 2025-05-07T20:24:52.7915412Z  2025-05-07T20:24:52.7915725Z 2025-05-07T20:24:52.7915729Z 2025-05-07T20:24:52.7915732Z 2025-05-07T20:24:52.7915736Z 2025-05-07T20:24:52.7915740Z 2025-05-07T20:24:52.7915743Z 2025-05-07T20:24:52.7915747Z 2025-05-07T20:24:52.7915750Z 2025-05-07T20:24:52.7915754Z 2025-05-07T20:24:52.7915757Z 2025-05-07T20:24:52.7915761Z 2025-05-07T20:24:52.7916040Z  done 2025-05-07T20:24:52.8916655Z Preparing transaction: \ done 2025-05-07T20:24:53.1925411Z Verifying transaction: / - \ done 2025-05-07T20:24:53.2935335Z Executing transaction: / done 2025-05-07T20:24:53.4579202Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:57.3527595Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:57.3528391Z 2025-05-07T20:24:57.3539285Z 2025-05-07T20:24:57.3558013Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:57.3558614Z 2025-05-07T20:24:57.3571103Z 2025-05-07T20:24:57.3588505Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:57.3589056Z 2025-05-07T20:24:57.3600308Z 2025-05-07T20:24:57.3618941Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:57.3619528Z 2025-05-07T20:24:57.3631977Z 2025-05-07T20:24:59.2503873Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:59.2504405Z 2025-05-07T20:24:59.3123096Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:01.1852487Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:01.1852833Z 2025-05-07T20:25:01.2468187Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:03.1209111Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:03.1209417Z 2025-05-07T20:25:03.1841424Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:05.0631602Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:05.0631912Z 2025-05-07T20:25:05.1258782Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:05.1263086Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:05.1263747Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:05.1263989Z 2025-05-07T20:25:07.0066323Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:07.0066897Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:07.0067499Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:07.0068248Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:07.0068831Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:07.0069421Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:07.0069781Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:07.0070178Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:07.0070540Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:07.0071023Z #define __CHAR_BIT__ 8 2025-05-07T20:25:07.0071430Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:07.0071964Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:07.0072484Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:07.0072953Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:07.0073506Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:07.0074073Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.0074588Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:07.0075147Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:07.0075582Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:07.0076030Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:07.0076537Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:07.0077080Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:07.0077485Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:07.0077851Z #define __GCC_IEC_559 2 2025-05-07T20:25:07.0078228Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:07.0078597Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:07.0078998Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:07.0079358Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:07.0079780Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.0080236Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:07.0080590Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.0080962Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:07.0081385Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:07.0081707Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:07.0082072Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:07.0082500Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:07.0082861Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:07.0083171Z #define __INT8_C(c) c 2025-05-07T20:25:07.0083574Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:07.0083962Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.0084341Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:07.0084817Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.0085269Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:07.0085605Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.0086033Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.0086408Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:07.0086795Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:07.0087322Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:07.0087977Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:07.0088377Z #define __linux 1 2025-05-07T20:25:07.0088739Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:07.0089114Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:07.0089574Z #define __unix 1 2025-05-07T20:25:07.0089924Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:07.0090294Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:07.0090672Z #define __WINT_MIN__ 0U 2025-05-07T20:25:07.0091071Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.0091412Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:07.0091787Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:07.0092206Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:07.0092542Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:07.0092906Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:07.0093371Z #define __INT64_C(c) c ## L 2025-05-07T20:25:07.0093950Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:07.0094404Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:07.0094833Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:07.0095450Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:07.0095994Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:07.0096431Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:07.0096781Z #define __DBL_DIG__ 15 2025-05-07T20:25:07.0097054Z #define __FLT32_DIG__ 6 2025-05-07T20:25:07.0097533Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:07.0097972Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:07.0098263Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:07.0098767Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:07.0099281Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:07.0099688Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.0100035Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:07.0100501Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:07.0101048Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:07.0101399Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:07.0101751Z #define __unix__ 1 2025-05-07T20:25:07.0102125Z #define __INT_WIDTH__ 32 2025-05-07T20:25:07.0102446Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:07.0102782Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:07.0103187Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:07.0103528Z #define __UINT16_C(c) c 2025-05-07T20:25:07.0103961Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:07.0104340Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:07.0104801Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:07.0105236Z #define __gnu_linux__ 1 2025-05-07T20:25:07.0105987Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:07.0106455Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.0106837Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.0107248Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:07.0107613Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:07.0107944Z #define __GNUC__ 11 2025-05-07T20:25:07.0108288Z #define __pie__ 2 2025-05-07T20:25:07.0108711Z #define __MMX__ 1 2025-05-07T20:25:07.0109030Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:07.0109313Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:07.0109588Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:07.0109860Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:07.0110210Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.0110611Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0110921Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.0111188Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:07.0111454Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:07.0111746Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:07.0112018Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:07.0112282Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:07.0121645Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:07.0121976Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:07.0122266Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:07.0122641Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:07.0122903Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:07.0123172Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:07.0123452Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:07.0123713Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:07.0123976Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:07.0124294Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.0124654Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:07.0124926Z #define __SSE2_MATH__ 1 2025-05-07T20:25:07.0125169Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:07.0125471Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0125768Z #define __amd64 1 2025-05-07T20:25:07.0125989Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:07.0126261Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:07.0126572Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:07.0126877Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:07.0127637Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:07.0128080Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:07.0128338Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:07.0128602Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:07.0128874Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:07.0129144Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:07.0129403Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:07.0129684Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:07.0129931Z #define __x86_64 1 2025-05-07T20:25:07.0130158Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:07.0130527Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:07.0130986Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:07.0131432Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:07.0131900Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.0132295Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:07.0132558Z #define __LP64__ 1 2025-05-07T20:25:07.0132784Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.0133137Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:07.0133514Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:07.0133786Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:07.0134062Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.0134344Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:07.0134613Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:07.0134884Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:07.0135143Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:07.0135401Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:07.0135665Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.0135994Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:07.0136382Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:07.0136676Z #define __FLT_DIG__ 6 2025-05-07T20:25:07.0136916Z #define __NO_INLINE__ 1 2025-05-07T20:25:07.0137169Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:07.0137487Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:07.0137835Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:07.0138097Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:07.0138359Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:07.0138617Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:07.0138880Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:07.0139135Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:07.0139431Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:07.0139719Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:07.0139984Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:07.0140286Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.0140619Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:07.0140889Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:07.0141145Z #define __FLT128_DIG__ 33 2025-05-07T20:25:07.0141398Z #define __INT32_C(c) c 2025-05-07T20:25:07.0141646Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:07.0141922Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:07.0142202Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:07.0142490Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:07.0142803Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:07.0143114Z #define unix 1 2025-05-07T20:25:07.0143349Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:07.0143656Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.0143962Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:07.0144277Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:07.0144599Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:07.0144853Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:07.0145119Z #define __ELF__ 1 2025-05-07T20:25:07.0145344Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:07.0145631Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:07.0146051Z #define __FLT_RADIX__ 2 2025-05-07T20:25:07.0146483Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:07.0146833Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:07.0147195Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:07.0147453Z #define __SSE_MATH__ 1 2025-05-07T20:25:07.0147679Z #define __k8 1 2025-05-07T20:25:07.0147979Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:07.0148354Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:07.0148645Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:07.0148943Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:07.0149205Z #define __LDBL_DIG__ 18 2025-05-07T20:25:07.0149444Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:07.0149705Z #define __x86_64__ 1 2025-05-07T20:25:07.0149943Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:07.0150241Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:07.0150570Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0150882Z #define __FLT64_DIG__ 15 2025-05-07T20:25:07.0151173Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.0151515Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.0151830Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.0152100Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:07.0152371Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0152673Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:07.0153042Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:07.0153436Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:07.0153735Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:07.0154073Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:07.0154407Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:07.0154706Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:07.0154995Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:07.0155314Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:07.0155601Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:07.0155850Z #define __SEG_FS 1 2025-05-07T20:25:07.0156090Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:07.0156366Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:07.0156654Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0156950Z #define __SEG_GS 1 2025-05-07T20:25:07.0157263Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:07.0157646Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:07.0157931Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:07.0158213Z #define __INT16_TYPE__ short int 2025-05-07T20:25:07.0158497Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:07.0158798Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:07.0159067Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:07.0159314Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:07.0159582Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:07.0159930Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.0160320Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0160615Z #define linux 1 2025-05-07T20:25:07.0160848Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.0161126Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.0161405Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:07.0161663Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:07.0161923Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:07.0162190Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:07.0162542Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.0162957Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:07.0163283Z #define __code_model_small__ 1 2025-05-07T20:25:07.0163569Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:07.0163863Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:07.0164112Z #define __k8__ 1 2025-05-07T20:25:07.0164351Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:07.0164763Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:07.0165142Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:07.0165390Z #define __pic__ 2 2025-05-07T20:25:07.0165648Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.0165955Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:07.0166259Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0166598Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:07.0166961Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.0167329Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:07.0167759Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:07.0168059Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.0168366Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:07.0168627Z #define __linux__ 1 2025-05-07T20:25:07.0168862Z #define __INT64_TYPE__ long int 2025-05-07T20:25:07.0169122Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:07.0169383Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:07.0169664Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:07.0169920Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:07.0170215Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0170542Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:07.0170833Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:07.0171097Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:07.0171390Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:07.0171683Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:07.0172014Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.0172375Z #define __SSE__ 1 2025-05-07T20:25:07.0172601Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:07.0172931Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.0173272Z #define __amd64__ 1 2025-05-07T20:25:07.0173495Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:07.0173740Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:07.0174014Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:07.0174288Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:07.0174549Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:07.0174821Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:07.0175085Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:07.0175355Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:07.0175619Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:07.0175965Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:07.0176429Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:07.0176788Z #define _LP64 1 2025-05-07T20:25:07.0177004Z #define __UINT8_C(c) c 2025-05-07T20:25:07.0177238Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:07.0177506Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:07.0177776Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:07.0178046Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:07.0178340Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:07.0178702Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.0179173Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.0179545Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.0179844Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.0180161Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:07.0180522Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:07.0180893Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:07.0181163Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:07.0181500Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:07.0181856Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:07.0182116Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:07.0182363Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:07.0182606Z #define __FXSR__ 1 2025-05-07T20:25:07.0182905Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.0185085Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.0185591Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.0185893Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:07.0186153Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:07.0186478Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:07.0186833Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:07.0187078Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:07.0187317Z #define __PIC__ 2 2025-05-07T20:25:07.0187565Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:07.0187965Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.0188353Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:07.0188684Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.0189015Z #define __SSE2__ 1 2025-05-07T20:25:07.0189244Z #define __INT32_TYPE__ int 2025-05-07T20:25:07.0189495Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:07.0189763Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.0190098Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:07.0190454Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:07.0190728Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:07.0191001Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:07.0191271Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.0191542Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:07.0191791Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:07.0192039Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:07.0192323Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.0192618Z #define __PIE__ 2 2025-05-07T20:25:07.0192940Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:07.0193325Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:07.0193667Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:07.0194036Z #define __INT16_C(c) c 2025-05-07T20:25:07.0194263Z #define __STDC__ 1 2025-05-07T20:25:07.0194495Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:07.0194767Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:07.0195018Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.0195321Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:07.0195667Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:07.0196000Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:07.0196261Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.0196541Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:07.0196817Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:07.0197092Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:07.0197384Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.0197656Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:07.0197946Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.0198339Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.0198716Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:07.0199018Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:07.0199315Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:07.0199569Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:07.0199728Z 2025-05-07T20:25:07.0698778Z 2025-05-07T20:25:07.0699371Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:07.0699811Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:07.0700048Z 2025-05-07T20:25:08.9606024Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:08.9606464Z #define __cpp_attributes 200809L 2025-05-07T20:25:08.9606931Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:08.9607376Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:08.9607767Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:08.9608031Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:08.9608361Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:08.9611212Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:08.9611782Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:08.9612214Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:08.9612620Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:08.9612950Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:08.9613198Z #define __CHAR_BIT__ 8 2025-05-07T20:25:08.9613422Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:08.9613668Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:08.9613913Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:08.9614171Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:08.9614441Z #define __cpp_static_assert 201411L 2025-05-07T20:25:08.9614727Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:08.9615019Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.9615316Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:08.9615609Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:08.9615930Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:08.9616267Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:08.9616678Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:08.9617088Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:08.9617392Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:08.9617671Z #define __GCC_IEC_559 2 2025-05-07T20:25:08.9617920Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:08.9618194Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:08.9618472Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:08.9618764Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:08.9619056Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:08.9619384Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:08.9619692Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:08.9620015Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.9620339Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:08.9620614Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.9620897Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:08.9621176Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:08.9621472Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:08.9621738Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:08.9621993Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:08.9622266Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:08.9622594Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:08.9622917Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:08.9623170Z #define __INT8_C(c) c 2025-05-07T20:25:08.9623409Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:08.9623672Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:08.9623992Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.9624316Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:08.9624587Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:08.9624875Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:08.9625198Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:08.9625557Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:08.9625832Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:08.9626112Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:08.9626375Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.9626645Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:08.9626923Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:08.9627310Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:08.9627719Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:08.9628006Z #define __linux 1 2025-05-07T20:25:08.9628233Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:08.9628503Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:08.9628783Z #define __unix 1 2025-05-07T20:25:08.9629010Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:08.9629298Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:08.9629578Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:08.9629972Z #define __WINT_MIN__ 0U 2025-05-07T20:25:08.9630296Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.9630573Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:08.9630848Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:08.9631114Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:08.9631433Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:08.9631718Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:08.9632020Z #define __INT64_C(c) c ## L 2025-05-07T20:25:08.9632284Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:08.9632584Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:08.9632863Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:08.9633172Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:08.9633445Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:08.9633713Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:08.9634067Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:08.9634439Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:08.9634702Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:08.9634992Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:08.9635262Z #define __DBL_DIG__ 15 2025-05-07T20:25:08.9635516Z #define __FLT32_DIG__ 6 2025-05-07T20:25:08.9635817Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:08.9636165Z #define __GXX_WEAK__ 1 2025-05-07T20:25:08.9636404Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:08.9636657Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:08.9637037Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:08.9637387Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:08.9637655Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:08.9637954Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:08.9638285Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:08.9638697Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:08.9639097Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:08.9639387Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:08.9639652Z #define __unix__ 1 2025-05-07T20:25:08.9639873Z #define __INT_WIDTH__ 32 2025-05-07T20:25:08.9640123Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:08.9640371Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:08.9649397Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:08.9649723Z #define __UINT16_C(c) c 2025-05-07T20:25:08.9649987Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:08.9650263Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:08.9650633Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:08.9651016Z #define __gnu_linux__ 1 2025-05-07T20:25:08.9651277Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:08.9651548Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:08.9651845Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.9652145Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.9652420Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:08.9652696Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:08.9652963Z #define __GNUC__ 11 2025-05-07T20:25:08.9653193Z #define __GXX_RTTI 1 2025-05-07T20:25:08.9653425Z #define __pie__ 2 2025-05-07T20:25:08.9653638Z #define __MMX__ 1 2025-05-07T20:25:08.9653867Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:08.9654143Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:08.9654422Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:08.9654696Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:08.9654954Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:08.9655253Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:08.9655576Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:08.9655926Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:08.9656301Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:08.9656611Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9656933Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:08.9657203Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:08.9657676Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:08.9658094Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:08.9658396Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:08.9658664Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:08.9658932Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:08.9659225Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:08.9659520Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:08.9659793Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:08.9660083Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:08.9660337Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:08.9660608Z #define __cplusplus 201703L 2025-05-07T20:25:08.9660883Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:08.9661163Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:08.9661421Z #define __DEPRECATED 1 2025-05-07T20:25:08.9661680Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:08.9661975Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:08.9662226Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:08.9662549Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:08.9662914Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:08.9663177Z #define __SSE2_MATH__ 1 2025-05-07T20:25:08.9663428Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:08.9663729Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9664012Z #define __amd64 1 2025-05-07T20:25:08.9664237Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:08.9664505Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:08.9664763Z #define __GNUG__ 11 2025-05-07T20:25:08.9665017Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:08.9665332Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:08.9665580Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:08.9665837Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:08.9666115Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:08.9666369Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:08.9666641Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:08.9666991Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:08.9667258Z #define __cpp_hex_float 201603L 2025-05-07T20:25:08.9667519Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:08.9667784Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:08.9668055Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:08.9668314Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:08.9668580Z #define __x86_64 1 2025-05-07T20:25:08.9668810Z #define __cpp_lambdas 200907L 2025-05-07T20:25:08.9669074Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:08.9669444Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:08.9669836Z #define __cpp_template_auto 201606L 2025-05-07T20:25:08.9670188Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:08.9670636Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:08.9671100Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:08.9671492Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:08.9671749Z #define __LP64__ 1 2025-05-07T20:25:08.9671978Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.9672336Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:08.9672709Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:08.9672983Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.9673267Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:08.9673536Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:08.9673804Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:08.9674064Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:08.9674323Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:08.9674650Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:08.9675009Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:08.9675287Z #define __FLT_DIG__ 6 2025-05-07T20:25:08.9675512Z #define __NO_INLINE__ 1 2025-05-07T20:25:08.9675754Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:08.9676289Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:08.9676719Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:08.9676973Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:08.9677239Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:08.9677490Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:08.9677765Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:08.9678059Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:08.9678311Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:08.9678606Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:08.9678888Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:08.9679154Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:08.9679451Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:08.9679785Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:08.9680069Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:08.9680326Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:08.9680584Z #define __FLT128_DIG__ 33 2025-05-07T20:25:08.9680827Z #define __INT32_C(c) c 2025-05-07T20:25:08.9681066Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:08.9681342Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:08.9681618Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:08.9681892Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:08.9682202Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:08.9682510Z #define unix 1 2025-05-07T20:25:08.9682723Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:08.9682980Z #define __cpp_rtti 199711L 2025-05-07T20:25:08.9683242Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:08.9683551Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.9683847Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:08.9684152Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:08.9684484Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:08.9684730Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:08.9685015Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:08.9685299Z #define __ELF__ 1 2025-05-07T20:25:08.9685528Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:08.9685809Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:08.9686084Z #define __FLT_RADIX__ 2 2025-05-07T20:25:08.9686326Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:08.9686682Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:08.9687050Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:08.9687315Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:08.9687759Z #define __k8 1 2025-05-07T20:25:08.9688069Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:08.9688444Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:08.9688740Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:08.9689047Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:08.9689307Z #define __LDBL_DIG__ 18 2025-05-07T20:25:08.9689545Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:08.9689804Z #define __x86_64__ 1 2025-05-07T20:25:08.9690050Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:08.9690348Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:08.9690684Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9690989Z #define __FLT64_DIG__ 15 2025-05-07T20:25:08.9691265Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.9691609Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:08.9691921Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.9692186Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:08.9692455Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9692749Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:08.9693118Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:08.9693508Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:08.9693799Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:08.9694116Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:08.9694548Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:08.9694961Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:08.9695261Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:08.9695536Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:08.9695842Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:08.9696123Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:08.9696358Z #define __SEG_FS 1 2025-05-07T20:25:08.9696593Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:08.9696896Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:08.9697198Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9697480Z #define __SEG_GS 1 2025-05-07T20:25:08.9697792Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:08.9698174Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:08.9698450Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:08.9698741Z #define __INT16_TYPE__ short int 2025-05-07T20:25:08.9699024Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:08.9699338Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:08.9699646Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:08.9699898Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:08.9700163Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:08.9700509Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:08.9700903Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9701221Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:08.9701544Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:08.9701847Z #define linux 1 2025-05-07T20:25:08.9702078Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.9702353Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:08.9702629Z #define __EXCEPTIONS 1 2025-05-07T20:25:08.9702878Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:08.9703136Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:08.9703417Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:08.9703701Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:08.9704059Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:08.9704448Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:08.9704790Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:08.9705117Z #define __code_model_small__ 1 2025-05-07T20:25:08.9705389Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:08.9706094Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:08.9706513Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:08.9706805Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:08.9707099Z #define __k8__ 1 2025-05-07T20:25:08.9707324Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:08.9707614Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:08.9707915Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:08.9708157Z #define __pic__ 2 2025-05-07T20:25:08.9708413Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.9708723Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:08.9708996Z #define __cpp_decltype 200707L 2025-05-07T20:25:08.9709297Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9709638Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:08.9710002Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:08.9710365Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:08.9710663Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:08.9710987Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:08.9711275Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:08.9711531Z #define __linux__ 1 2025-05-07T20:25:08.9711762Z #define __INT64_TYPE__ long int 2025-05-07T20:25:08.9712026Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:08.9712293Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:08.9712576Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:08.9712861Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:08.9713181Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:08.9713481Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9714046Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:08.9714460Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:08.9714768Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:08.9715070Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:08.9715415Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:08.9715784Z #define __SSE__ 1 2025-05-07T20:25:08.9716025Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:08.9716369Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:08.9716723Z #define __amd64__ 1 2025-05-07T20:25:08.9716961Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:08.9717227Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:08.9717510Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:08.9717788Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:08.9718063Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:08.9718333Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:08.9718614Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:08.9718898Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:08.9719253Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:08.9719725Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:08.9720091Z #define _LP64 1 2025-05-07T20:25:08.9720312Z #define __UINT8_C(c) c 2025-05-07T20:25:08.9720564Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:08.9720844Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:08.9721117Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:08.9721391Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:08.9721756Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:08.9722222Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:08.9722605Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.9722912Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.9723226Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:08.9723556Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:08.9723951Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:08.9724329Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:08.9724601Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:08.9724877Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:08.9725225Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:08.9725596Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:08.9725869Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:08.9726132Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:08.9726388Z #define __FXSR__ 1 2025-05-07T20:25:08.9726701Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:08.9727162Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:08.9727700Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:08.9728012Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:08.9728299Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:08.9728614Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:08.9728917Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:08.9729200Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:08.9729575Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:08.9729945Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:08.9730226Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:08.9730487Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:08.9730729Z #define __PIC__ 2 2025-05-07T20:25:08.9730995Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:08.9731413Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:08.9731813Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:08.9732155Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:08.9732521Z #define __cpp_constexpr 201603L 2025-05-07T20:25:08.9732789Z #define __SSE2__ 1 2025-05-07T20:25:08.9733129Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:08.9733576Z #define __INT32_TYPE__ int 2025-05-07T20:25:08.9733837Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:08.9734106Z #define __cpp_exceptions 199711L 2025-05-07T20:25:08.9734391Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:08.9734738Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:08.9735101Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:08.9735384Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:08.9735663Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:08.9735936Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.9736222Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:08.9736482Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:08.9736748Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:08.9737047Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:08.9737351Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.9737657Z #define __PIE__ 2 2025-05-07T20:25:08.9737989Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:08.9738420Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:08.9738737Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:08.9739090Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:08.9739467Z #define __INT16_C(c) c 2025-05-07T20:25:08.9739710Z #define __STDC__ 1 2025-05-07T20:25:08.9739940Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:08.9740209Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:08.9740493Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:08.9740752Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.9741060Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:08.9741416Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:08.9741763Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:08.9742032Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.9742333Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:08.9742631Z #define __SSE_MATH__ 1 2025-05-07T20:25:08.9742883Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:08.9743182Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:08.9743500Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:08.9743788Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:08.9744091Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.9744375Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:08.9744678Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.9745085Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:08.9745470Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:08.9745784Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:08.9746080Z #define _GNU_SOURCE 1 2025-05-07T20:25:08.9746340Z #define __cpp_init_captures 201304L 2025-05-07T20:25:08.9746633Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:08.9746891Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:08.9747062Z 2025-05-07T20:25:09.0237307Z 2025-05-07T20:25:09.0238120Z + conda run -n build_binary c++ --version 2025-05-07T20:25:09.0238388Z 2025-05-07T20:25:10.8977669Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:10.8978052Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:10.8978499Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:10.8979042Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:10.8979364Z 2025-05-07T20:25:10.8979368Z 2025-05-07T20:25:10.9623264Z 2025-05-07T20:25:10.9623666Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:10.9624205Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:10.9624519Z 2025-05-07T20:25:12.9069149Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:12.9071100Z 2025-05-07T20:25:12.9071923Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:12.9073828Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:12.9074650Z 2025-05-07T20:25:14.8470694Z #define __cplusplus 201703L 2025-05-07T20:25:14.8473592Z 2025-05-07T20:25:14.8475227Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:14.8520661Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:14.8521085Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:14.8532626Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:14.8532974Z env: 2025-05-07T20:25:14.8533201Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:14.8533497Z BUILD_ENV: build_binary 2025-05-07T20:25:14.8533744Z BUILD_TARGET: genai 2025-05-07T20:25:14.8533973Z BUILD_VARIANT: cuda 2025-05-07T20:25:14.8534200Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:14.8534458Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:14.8534758Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:14.8535080Z ##[endgroup] 2025-05-07T20:25:15.1867272Z ################################################################################ 2025-05-07T20:25:15.1867650Z # Install CUDA 2025-05-07T20:25:15.1867864Z # 2025-05-07T20:25:15.1883337Z # [2025-05-07T20:25:15.188Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:15.1883721Z ################################################################################ 2025-05-07T20:25:15.1883941Z 2025-05-07T20:25:15.1898736Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:15.2818769Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:15.2819646Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:15.2823710Z + conda clean --packages --tarball -y 2025-05-07T20:25:15.2823923Z 2025-05-07T20:25:15.9898505Z Will remove 32 (148.9 MB) tarball(s). 2025-05-07T20:25:15.9898875Z Will remove 6 (619 KB) package(s). 2025-05-07T20:25:16.0524669Z 2025-05-07T20:25:16.0533732Z + conda clean --all -y 2025-05-07T20:25:16.0533927Z 2025-05-07T20:25:16.7201784Z There are no unused tarball(s) to remove. 2025-05-07T20:25:16.7202179Z Will remove 1 index cache(s). 2025-05-07T20:25:16.7202475Z There are no unused package(s) to remove. 2025-05-07T20:25:16.7202793Z There are no tempfile(s) to remove. 2025-05-07T20:25:16.7203096Z There are no logfile(s) to remove. 2025-05-07T20:25:16.7830861Z 2025-05-07T20:25:16.7844856Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:25:16.7868119Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:25:17.7204886Z Channels: 2025-05-07T20:25:17.7205135Z - conda-forge 2025-05-07T20:25:17.7205368Z Platform: linux-64 2025-05-07T20:25:28.2989504Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:25:29.4088329Z Solving environment: \ | / - done 2025-05-07T20:25:29.4833058Z 2025-05-07T20:25:29.4833400Z ## Package Plan ## 2025-05-07T20:25:29.4833561Z 2025-05-07T20:25:29.4833812Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:29.4834147Z 2025-05-07T20:25:29.4834253Z added / updated specs: 2025-05-07T20:25:29.4834502Z - cuda=12.6.3 2025-05-07T20:25:29.4834638Z 2025-05-07T20:25:29.4834658Z 2025-05-07T20:25:29.4834778Z The following packages will be downloaded: 2025-05-07T20:25:29.4834995Z 2025-05-07T20:25:29.4835119Z package | build 2025-05-07T20:25:29.4835435Z ---------------------------|----------------- 2025-05-07T20:25:29.4835806Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:29.4836221Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:29.4836620Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:29.4837038Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:29.4837444Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:29.4838237Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:25:29.4838945Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:29.4839442Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:29.4839914Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:25:29.4840390Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:25:29.4840834Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:29.4841303Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:29.4841796Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:25:29.4842303Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:29.4842842Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:25:29.4843394Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:25:29.4843875Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:25:29.4844322Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:25:29.4844770Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:25:29.4845227Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:25:29.4845691Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:29.4846178Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:25:29.4846646Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:25:29.4847085Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:29.4847658Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:29.4848136Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:29.4848573Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:29.4849033Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:25:29.4849499Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:25:29.4849958Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:25:29.4850422Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:29.4850881Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:25:29.4851329Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:25:29.4851786Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:25:29.4852241Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:25:29.4852678Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:25:29.4853120Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:25:29.4853564Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:29.4854022Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:25:29.4854492Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:25:29.4854950Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:25:29.4855394Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:25:29.4855820Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:29.4856367Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:25:29.4857045Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:29.4857566Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:29.4858110Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:25:29.4858587Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:29.4859020Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:29.4859443Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:25:29.4859900Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:29.4860357Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:29.4860767Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:29.4861155Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:29.4861621Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:29.4862136Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:29.4862648Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:29.4863131Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:29.4863576Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:29.4864037Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:29.4864499Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:29.4864934Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:29.4865334Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:29.4865733Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:25:29.4866130Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:29.4866526Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:29.4866921Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:29.4867314Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:29.4867701Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:29.4868118Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:25:29.4868563Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:25:29.4868999Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:25:29.4869442Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:29.4869886Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:25:29.4870322Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:25:29.4870776Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:25:29.4871227Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:25:29.4871685Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:25:29.4872140Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:25:29.4872604Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:25:29.4873062Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:29.4873616Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:29.4874049Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:29.4874558Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:29.4875007Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:29.4875452Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:29.4875885Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:29.4876310Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:29.4876734Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:29.4877134Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:29.4877538Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:25:29.4877969Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:25:29.4878385Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:29.4878794Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:29.4879222Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:25:29.4879682Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:29.4880137Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:25:29.4880603Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:29.4881059Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:25:29.4881500Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:29.4881938Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:29.4882394Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:29.4882838Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:29.4883310Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:29.4883726Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:29.4884126Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:29.4884548Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:29.4884991Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:29.4885404Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:29.4885811Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:29.4886211Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:29.4886653Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:25:29.4887085Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:29.4887464Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:29.4887938Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:29.4888382Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:29.4888827Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:29.4889260Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:29.4889711Z python-3.11.8 |hab00c5b_0_cpython 29.3 MB conda-forge 2025-05-07T20:25:29.4890136Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:29.4890649Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:29.4891158Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:29.4891570Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:29.4891982Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:29.4892427Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:29.4892888Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:29.4893346Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:29.4893833Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:29.4894300Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:29.4894761Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:29.4895214Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:29.4895650Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:29.4896082Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:29.4896514Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:29.4896982Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:29.4897465Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:29.4897928Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:29.4898372Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:29.4898827Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:29.4899281Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:29.4899728Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:29.4900201Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:29.4900660Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:29.4901079Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:29.4901461Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:29.4901842Z ------------------------------------------------------------ 2025-05-07T20:25:29.4902193Z Total: 1.64 GB 2025-05-07T20:25:29.4902404Z 2025-05-07T20:25:29.4902541Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:29.4902765Z 2025-05-07T20:25:29.4902973Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:29.4903390Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:29.4903805Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:29.4904259Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:29.4904682Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:25:29.4905148Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:25:29.4906161Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:25:29.4906772Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:25:29.4907319Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:29.4907876Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:25:29.4908655Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:25:29.4909262Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:25:29.4910064Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:29.4910678Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:25:29.4914739Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:29.4915345Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:29.4915905Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.4916421Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:25:29.4916925Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:25:29.4917455Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.4917998Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:25:29.4918576Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:29.4919106Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:25:29.4919594Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:25:29.4920158Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:25:29.4920701Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:25:29.4921185Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:25:29.4921693Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:25:29.4922248Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:25:29.4922813Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:25:29.4923389Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:25:29.4923923Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.4924439Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.4924945Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:25:29.4925438Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.4925935Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:25:29.4926434Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:25:29.4926928Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:25:29.4927438Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:29.4928084Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:25:29.4928627Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:25:29.4929132Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:25:29.4929603Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:25:29.4930122Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:25:29.4930687Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:25:29.4931227Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:25:29.4931764Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:25:29.4932307Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:25:29.4932894Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:25:29.4933365Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:25:29.4933967Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:25:29.4934511Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:29.4934958Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:29.4935358Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:29.4935863Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:29.4936462Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:29.4937059Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:29.4937626Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:29.4938124Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:29.4938622Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:29.4939118Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:29.4939574Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:29.4939993Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:29.4940418Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:25:29.4940846Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:29.4941223Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:29.4941634Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:29.4942049Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:29.4942457Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:29.4942950Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:25:29.4943456Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:25:29.4943953Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:25:29.4944441Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:25:29.4944927Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:25:29.4945427Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:25:29.4945927Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:25:29.4946424Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:25:29.4946940Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:25:29.4947480Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:25:29.4948011Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:25:29.4948542Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:25:29.4949054Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:29.4949515Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:29.4949986Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:29.4950478Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:29.4950988Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:29.4951468Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:29.4951928Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:29.4952477Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:29.4952905Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:29.4953414Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:25:29.4953881Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:25:29.4954329Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:29.4954755Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:29.4955221Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:25:29.4955742Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:25:29.4956279Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:25:29.4956820Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:25:29.4957352Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:25:29.4957858Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:25:29.4958347Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:29.4958882Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:29.4959421Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:29.4959961Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:29.4960390Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:29.4960848Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:29.4961330Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:29.4961769Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:29.4962198Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:29.4962610Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:29.4963092Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:25:29.4963562Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:29.4963936Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:29.4964328Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:29.4964808Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:29.4965293Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:29.4965755Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:29.4975602Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:29.4976058Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:29.4976502Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:29.4976993Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:29.4977521Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:29.4978058Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:29.4978629Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:29.4979170Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:29.4979680Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:29.4980203Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:29.4980681Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:29.4981307Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:29.4981791Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:29.4982418Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:29.4983001Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:29.4983543Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:29.4984062Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:29.4984574Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:29.4985083Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:29.4985584Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:29.4986133Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:29.4986664Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:29.4987117Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:29.4987372Z 2025-05-07T20:25:29.4987500Z The following packages will be UPDATED: 2025-05-07T20:25:29.4987705Z 2025-05-07T20:25:29.4987984Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:29.4988569Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:29.4988902Z 2025-05-07T20:25:29.4989116Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:29.4989431Z 2025-05-07T20:25:29.4989715Z python pkgs/main::python-3.11.11-he870216_0 --> conda-forge::python-3.11.8-hab00c5b_0_cpython 2025-05-07T20:25:29.4990341Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:29.4990915Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:29.4991230Z 2025-05-07T20:25:29.4991257Z 2025-05-07T20:25:29.4991261Z 2025-05-07T20:25:29.4991417Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:29.4991793Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:29.4992035Z 2025-05-07T20:25:29.4992419Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:29.4992684Z 2025-05-07T20:25:29.4992695Z 2025-05-07T20:25:29.4992931Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:29.4993171Z 2025-05-07T20:25:29.4993175Z 2025-05-07T20:25:29.4993179Z 2025-05-07T20:25:29.4993412Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:29.4993672Z 2025-05-07T20:25:29.4993676Z 2025-05-07T20:25:29.4993679Z 2025-05-07T20:25:29.4993683Z 2025-05-07T20:25:29.4993923Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:29.4994190Z 2025-05-07T20:25:29.4994194Z 2025-05-07T20:25:29.4994198Z 2025-05-07T20:25:29.4994201Z 2025-05-07T20:25:29.4994205Z 2025-05-07T20:25:29.4994450Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:29.4994708Z 2025-05-07T20:25:29.4994712Z 2025-05-07T20:25:29.4994716Z 2025-05-07T20:25:29.4994719Z 2025-05-07T20:25:29.4994723Z 2025-05-07T20:25:29.4994727Z 2025-05-07T20:25:29.4994971Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:29.4995252Z 2025-05-07T20:25:29.4995256Z 2025-05-07T20:25:29.4995260Z 2025-05-07T20:25:29.4995263Z 2025-05-07T20:25:29.4995267Z 2025-05-07T20:25:29.4995270Z 2025-05-07T20:25:29.4995274Z 2025-05-07T20:25:29.4995506Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:29.4995771Z 2025-05-07T20:25:29.4995774Z 2025-05-07T20:25:29.4995778Z 2025-05-07T20:25:29.4995782Z 2025-05-07T20:25:29.4995785Z 2025-05-07T20:25:29.4995882Z 2025-05-07T20:25:29.4995886Z 2025-05-07T20:25:29.4995889Z 2025-05-07T20:25:29.4996148Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:29.4996432Z 2025-05-07T20:25:29.4996516Z 2025-05-07T20:25:29.4996521Z 2025-05-07T20:25:29.4996524Z 2025-05-07T20:25:29.4996528Z 2025-05-07T20:25:29.4996532Z 2025-05-07T20:25:29.4996535Z 2025-05-07T20:25:29.4996539Z 2025-05-07T20:25:29.4996543Z 2025-05-07T20:25:29.4996801Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:29.4997075Z 2025-05-07T20:25:29.4997078Z 2025-05-07T20:25:29.4997082Z 2025-05-07T20:25:29.4997086Z 2025-05-07T20:25:29.4997089Z 2025-05-07T20:25:29.4997093Z 2025-05-07T20:25:29.4997097Z 2025-05-07T20:25:29.4997100Z 2025-05-07T20:25:29.4997104Z 2025-05-07T20:25:29.4997123Z 2025-05-07T20:25:29.4997368Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:29.4997645Z 2025-05-07T20:25:29.4997654Z 2025-05-07T20:25:29.4997658Z 2025-05-07T20:25:29.4997662Z 2025-05-07T20:25:29.4997665Z 2025-05-07T20:25:29.4997669Z 2025-05-07T20:25:29.4997680Z 2025-05-07T20:25:29.4997684Z 2025-05-07T20:25:29.4997687Z 2025-05-07T20:25:29.4997696Z 2025-05-07T20:25:29.4997699Z 2025-05-07T20:25:29.4997937Z python-3.11.8 | 29.3 MB | | 0%  2025-05-07T20:25:29.4998205Z 2025-05-07T20:25:29.4998216Z 2025-05-07T20:25:29.4998220Z 2025-05-07T20:25:29.4998224Z 2025-05-07T20:25:29.4998227Z 2025-05-07T20:25:29.4998231Z 2025-05-07T20:25:29.4998234Z 2025-05-07T20:25:29.4998238Z 2025-05-07T20:25:29.4998242Z 2025-05-07T20:25:29.4998245Z 2025-05-07T20:25:29.4998249Z 2025-05-07T20:25:29.4998255Z 2025-05-07T20:25:29.4998525Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:29.4998860Z 2025-05-07T20:25:29.4998866Z 2025-05-07T20:25:29.4998871Z 2025-05-07T20:25:29.4998876Z 2025-05-07T20:25:29.4998889Z 2025-05-07T20:25:29.4998894Z 2025-05-07T20:25:29.4998899Z 2025-05-07T20:25:29.4998904Z 2025-05-07T20:25:29.4998909Z 2025-05-07T20:25:29.4998914Z 2025-05-07T20:25:29.4998919Z 2025-05-07T20:25:29.4998924Z 2025-05-07T20:25:29.4998936Z 2025-05-07T20:25:29.5000719Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:29.5001108Z 2025-05-07T20:25:29.5001112Z 2025-05-07T20:25:29.5001116Z 2025-05-07T20:25:29.5001119Z 2025-05-07T20:25:29.5001123Z 2025-05-07T20:25:29.5001126Z 2025-05-07T20:25:29.5001130Z 2025-05-07T20:25:29.5001134Z 2025-05-07T20:25:29.5001146Z 2025-05-07T20:25:29.5001149Z 2025-05-07T20:25:29.5001153Z 2025-05-07T20:25:29.5001157Z 2025-05-07T20:25:29.5001160Z 2025-05-07T20:25:29.5001164Z 2025-05-07T20:25:29.5002166Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:29.5002518Z 2025-05-07T20:25:29.5002522Z 2025-05-07T20:25:29.5002526Z 2025-05-07T20:25:29.5002541Z 2025-05-07T20:25:29.5002545Z 2025-05-07T20:25:29.5002549Z 2025-05-07T20:25:29.5002552Z 2025-05-07T20:25:29.5002556Z 2025-05-07T20:25:29.5002559Z 2025-05-07T20:25:29.5002563Z 2025-05-07T20:25:29.5002566Z 2025-05-07T20:25:29.5002574Z 2025-05-07T20:25:29.5002577Z 2025-05-07T20:25:29.5002581Z 2025-05-07T20:25:29.5002584Z 2025-05-07T20:25:29.5006265Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:29.5006582Z 2025-05-07T20:25:29.5006585Z 2025-05-07T20:25:29.5006589Z 2025-05-07T20:25:29.5006593Z 2025-05-07T20:25:29.5006596Z 2025-05-07T20:25:29.5006600Z 2025-05-07T20:25:29.5006603Z 2025-05-07T20:25:29.5006607Z 2025-05-07T20:25:29.5006611Z 2025-05-07T20:25:29.5006614Z 2025-05-07T20:25:29.5006618Z 2025-05-07T20:25:29.5006622Z 2025-05-07T20:25:29.5006625Z 2025-05-07T20:25:29.5006633Z 2025-05-07T20:25:29.5006636Z 2025-05-07T20:25:29.5021083Z 2025-05-07T20:25:29.5023023Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:29.5023554Z 2025-05-07T20:25:29.5023558Z 2025-05-07T20:25:29.5023561Z 2025-05-07T20:25:29.5023565Z 2025-05-07T20:25:29.5023569Z 2025-05-07T20:25:29.5023694Z 2025-05-07T20:25:29.5023698Z 2025-05-07T20:25:29.5023702Z 2025-05-07T20:25:29.5023705Z 2025-05-07T20:25:29.5023717Z 2025-05-07T20:25:29.5023721Z 2025-05-07T20:25:29.5023724Z 2025-05-07T20:25:29.5023728Z 2025-05-07T20:25:29.5023732Z 2025-05-07T20:25:29.5023735Z 2025-05-07T20:25:29.5023739Z 2025-05-07T20:25:29.5023743Z 2025-05-07T20:25:29.5025302Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:29.5025759Z 2025-05-07T20:25:29.5025762Z 2025-05-07T20:25:29.5025766Z 2025-05-07T20:25:29.5025770Z 2025-05-07T20:25:29.5025773Z 2025-05-07T20:25:29.5025777Z 2025-05-07T20:25:29.5025781Z 2025-05-07T20:25:29.5025784Z 2025-05-07T20:25:29.5025788Z 2025-05-07T20:25:29.5025791Z 2025-05-07T20:25:29.5025802Z 2025-05-07T20:25:29.5025806Z 2025-05-07T20:25:29.5025809Z 2025-05-07T20:25:29.5025813Z 2025-05-07T20:25:29.5025816Z 2025-05-07T20:25:29.5025820Z 2025-05-07T20:25:29.5025824Z 2025-05-07T20:25:29.5025827Z 2025-05-07T20:25:29.5026672Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:29.5027061Z 2025-05-07T20:25:29.5027065Z 2025-05-07T20:25:29.5027075Z 2025-05-07T20:25:29.5027078Z 2025-05-07T20:25:29.5027091Z 2025-05-07T20:25:29.5027095Z 2025-05-07T20:25:29.5027098Z 2025-05-07T20:25:29.5027102Z 2025-05-07T20:25:29.5027106Z 2025-05-07T20:25:29.5027109Z 2025-05-07T20:25:29.5027113Z 2025-05-07T20:25:29.5027116Z 2025-05-07T20:25:29.5027120Z 2025-05-07T20:25:29.5027124Z 2025-05-07T20:25:29.5027127Z 2025-05-07T20:25:29.5027131Z 2025-05-07T20:25:29.5027134Z 2025-05-07T20:25:29.5027138Z 2025-05-07T20:25:29.5027141Z 2025-05-07T20:25:29.5933187Z ... (more hidden) ... 2025-05-07T20:25:29.5933525Z 2025-05-07T20:25:29.5953844Z libcublas-12.6.4.1 | 256.2 MB | 1 | 1%  2025-05-07T20:25:29.5955821Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:29.5956186Z 2025-05-07T20:25:29.5956191Z 2025-05-07T20:25:29.5956195Z 2025-05-07T20:25:29.5974370Z libcusparse-12.5.4.2 | 118.6 MB | 1 | 2%  2025-05-07T20:25:29.5974749Z 2025-05-07T20:25:29.5974753Z 2025-05-07T20:25:29.5974757Z 2025-05-07T20:25:29.5975110Z 2025-05-07T20:25:29.6236917Z cuda-nsight-12.6.77 | 113.2 MB | | 1%  2025-05-07T20:25:29.6237258Z 2025-05-07T20:25:29.6237463Z 2025-05-07T20:25:29.6936939Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:29.6938107Z 2025-05-07T20:25:29.6954886Z libcublas-12.6.4.1 | 256.2 MB | 2 | 3%  2025-05-07T20:25:29.6958845Z nsight-compute-2024. | 443.1 MB | | 1% 2025-05-07T20:25:29.6959124Z 2025-05-07T20:25:29.6959129Z 2025-05-07T20:25:29.6961277Z 2025-05-07T20:25:29.6975736Z libcusparse-12.5.4.2 | 118.6 MB | 5 | 5%  2025-05-07T20:25:29.6976017Z 2025-05-07T20:25:29.6976021Z 2025-05-07T20:25:29.6976025Z 2025-05-07T20:25:29.6976252Z 2025-05-07T20:25:29.7237442Z cuda-nsight-12.6.77 | 113.2 MB | 4 | 5%  2025-05-07T20:25:29.7237837Z 2025-05-07T20:25:29.7237843Z 2025-05-07T20:25:29.7938551Z libcufft-11.3.0.4 | 156.2 MB | 1 | 2%  2025-05-07T20:25:29.7939253Z 2025-05-07T20:25:29.7957305Z libcublas-12.6.4.1 | 256.2 MB | 4 | 4%  2025-05-07T20:25:29.7961050Z nsight-compute-2024. | 443.1 MB | 1 | 2% 2025-05-07T20:25:29.7961341Z 2025-05-07T20:25:29.7961541Z 2025-05-07T20:25:29.7964570Z 2025-05-07T20:25:29.7977689Z libcusparse-12.5.4.2 | 118.6 MB | 8 | 9%  2025-05-07T20:25:29.7978110Z 2025-05-07T20:25:29.7978123Z 2025-05-07T20:25:29.7978127Z 2025-05-07T20:25:29.7978131Z 2025-05-07T20:25:29.8241150Z cuda-nsight-12.6.77 | 113.2 MB | 7 | 8%  2025-05-07T20:25:29.8241817Z 2025-05-07T20:25:29.8241822Z 2025-05-07T20:25:29.8944668Z libcufft-11.3.0.4 | 156.2 MB | 4 | 4%  2025-05-07T20:25:29.8946349Z 2025-05-07T20:25:29.8958769Z libcublas-12.6.4.1 | 256.2 MB | 5 | 6%  2025-05-07T20:25:29.8963302Z nsight-compute-2024. | 443.1 MB | 2 | 2% 2025-05-07T20:25:29.8963568Z 2025-05-07T20:25:29.8963575Z 2025-05-07T20:25:29.8967145Z 2025-05-07T20:25:29.8982822Z libcusparse-12.5.4.2 | 118.6 MB | #1 | 12%  2025-05-07T20:25:29.8983176Z 2025-05-07T20:25:29.8983183Z 2025-05-07T20:25:29.8983188Z 2025-05-07T20:25:29.8983618Z 2025-05-07T20:25:29.9241375Z cuda-nsight-12.6.77 | 113.2 MB | # | 11%  2025-05-07T20:25:29.9241727Z 2025-05-07T20:25:29.9241732Z 2025-05-07T20:25:29.9985832Z libcufft-11.3.0.4 | 156.2 MB | 6 | 7%  2025-05-07T20:25:29.9988611Z nsight-compute-2024. | 443.1 MB | 3 | 3% 2025-05-07T20:25:29.9991611Z 2025-05-07T20:25:30.0004303Z libcublas-12.6.4.1 | 256.2 MB | 7 | 7%  2025-05-07T20:25:30.0004658Z 2025-05-07T20:25:30.0004663Z 2025-05-07T20:25:30.0004667Z 2025-05-07T20:25:30.0007068Z 2025-05-07T20:25:30.0107036Z cuda-nsight-12.6.77 | 113.2 MB | #3 | 14%  2025-05-07T20:25:30.0107328Z 2025-05-07T20:25:30.0107332Z 2025-05-07T20:25:30.0108540Z 2025-05-07T20:25:30.0244078Z libcusparse-12.5.4.2 | 118.6 MB | #4 | 15%  2025-05-07T20:25:30.0244370Z 2025-05-07T20:25:30.0245014Z 2025-05-07T20:25:30.0987352Z libcufft-11.3.0.4 | 156.2 MB | 8 | 9%  2025-05-07T20:25:30.1098021Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:25:30.1098390Z 2025-05-07T20:25:30.1115764Z libcublas-12.6.4.1 | 256.2 MB | 8 | 8%  2025-05-07T20:25:30.1116101Z 2025-05-07T20:25:30.1116106Z 2025-05-07T20:25:30.1116110Z 2025-05-07T20:25:30.1122631Z libcusparse-12.5.4.2 | 118.6 MB | #8 | 18%  2025-05-07T20:25:30.1122937Z 2025-05-07T20:25:30.1122942Z 2025-05-07T20:25:30.1122946Z 2025-05-07T20:25:30.1126343Z 2025-05-07T20:25:30.1247745Z cuda-nsight-12.6.77 | 113.2 MB | #6 | 17%  2025-05-07T20:25:30.1248085Z 2025-05-07T20:25:30.1249534Z 2025-05-07T20:25:30.1990446Z libcufft-11.3.0.4 | 156.2 MB | # | 11%  2025-05-07T20:25:30.2098010Z nsight-compute-2024. | 443.1 MB | 4 | 5% 2025-05-07T20:25:30.2099767Z 2025-05-07T20:25:30.2118517Z libcublas-12.6.4.1 | 256.2 MB | 9 | 10%  2025-05-07T20:25:30.2119259Z 2025-05-07T20:25:30.2119441Z 2025-05-07T20:25:30.2120416Z 2025-05-07T20:25:30.2125691Z libcusparse-12.5.4.2 | 118.6 MB | ##1 | 21%  2025-05-07T20:25:30.2126190Z 2025-05-07T20:25:30.2126194Z 2025-05-07T20:25:30.2126198Z 2025-05-07T20:25:30.2126202Z 2025-05-07T20:25:30.2249825Z cuda-nsight-12.6.77 | 113.2 MB | #9 | 20%  2025-05-07T20:25:30.2250261Z 2025-05-07T20:25:30.2250292Z 2025-05-07T20:25:30.3026694Z libcufft-11.3.0.4 | 156.2 MB | #3 | 13%  2025-05-07T20:25:30.3128560Z nsight-compute-2024. | 443.1 MB | 5 | 5% 2025-05-07T20:25:30.3129107Z 2025-05-07T20:25:30.3129140Z 2025-05-07T20:25:30.3129147Z 2025-05-07T20:25:30.3129152Z 2025-05-07T20:25:30.3150151Z cuda-nsight-12.6.77 | 113.2 MB | ##3 | 23%  2025-05-07T20:25:30.3150703Z 2025-05-07T20:25:30.3150708Z 2025-05-07T20:25:30.3151839Z 2025-05-07T20:25:30.3201959Z libcusparse-12.5.4.2 | 118.6 MB | ##4 | 25%  2025-05-07T20:25:30.3204329Z 2025-05-07T20:25:30.3409263Z libcublas-12.6.4.1 | 256.2 MB | #1 | 11%  2025-05-07T20:25:30.3409837Z 2025-05-07T20:25:30.3412917Z 2025-05-07T20:25:30.4028121Z libcufft-11.3.0.4 | 156.2 MB | #5 | 15%  2025-05-07T20:25:30.4142652Z nsight-compute-2024. | 443.1 MB | 6 | 6% 2025-05-07T20:25:30.4142917Z 2025-05-07T20:25:30.4142921Z 2025-05-07T20:25:30.4142925Z 2025-05-07T20:25:30.4143246Z 2025-05-07T20:25:30.4160309Z cuda-nsight-12.6.77 | 113.2 MB | ##6 | 26%  2025-05-07T20:25:30.4160603Z 2025-05-07T20:25:30.4160607Z 2025-05-07T20:25:30.4161720Z 2025-05-07T20:25:30.4262726Z libcusparse-12.5.4.2 | 118.6 MB | ##7 | 28%  2025-05-07T20:25:30.4265758Z 2025-05-07T20:25:30.4411661Z libcublas-12.6.4.1 | 256.2 MB | #2 | 13%  2025-05-07T20:25:30.4412010Z 2025-05-07T20:25:30.4412014Z 2025-05-07T20:25:30.5031705Z libcufft-11.3.0.4 | 156.2 MB | #7 | 17%  2025-05-07T20:25:30.5166457Z nsight-compute-2024. | 443.1 MB | 7 | 7% 2025-05-07T20:25:30.5166737Z 2025-05-07T20:25:30.5166900Z 2025-05-07T20:25:30.5166906Z 2025-05-07T20:25:30.5169455Z 2025-05-07T20:25:30.5186104Z cuda-nsight-12.6.77 | 113.2 MB | ##9 | 29%  2025-05-07T20:25:30.5186392Z 2025-05-07T20:25:30.5186396Z 2025-05-07T20:25:30.5187879Z 2025-05-07T20:25:30.5263917Z libcusparse-12.5.4.2 | 118.6 MB | ### | 31%  2025-05-07T20:25:30.5264817Z 2025-05-07T20:25:30.5415189Z libcublas-12.6.4.1 | 256.2 MB | #4 | 14%  2025-05-07T20:25:30.5415559Z 2025-05-07T20:25:30.5415563Z 2025-05-07T20:25:30.6035710Z libcufft-11.3.0.4 | 156.2 MB | #9 | 20%  2025-05-07T20:25:30.6168626Z nsight-compute-2024. | 443.1 MB | 7 | 8% 2025-05-07T20:25:30.6168960Z 2025-05-07T20:25:30.6169024Z 2025-05-07T20:25:30.6169032Z 2025-05-07T20:25:30.6169055Z 2025-05-07T20:25:30.6358176Z cuda-nsight-12.6.77 | 113.2 MB | ###1 | 32%  2025-05-07T20:25:30.6358461Z 2025-05-07T20:25:30.6358465Z 2025-05-07T20:25:30.6359173Z 2025-05-07T20:25:30.6401167Z libcusparse-12.5.4.2 | 118.6 MB | ###3 | 34%  2025-05-07T20:25:30.6401512Z 2025-05-07T20:25:30.6416592Z libcublas-12.6.4.1 | 256.2 MB | #5 | 15%  2025-05-07T20:25:30.6416847Z 2025-05-07T20:25:30.6420204Z 2025-05-07T20:25:30.7039221Z libcufft-11.3.0.4 | 156.2 MB | ##1 | 22%  2025-05-07T20:25:30.7169081Z nsight-compute-2024. | 443.1 MB | 8 | 9% 2025-05-07T20:25:30.7169480Z 2025-05-07T20:25:30.7169486Z 2025-05-07T20:25:30.7169492Z 2025-05-07T20:25:30.7170739Z 2025-05-07T20:25:30.7362622Z cuda-nsight-12.6.77 | 113.2 MB | ###5 | 35%  2025-05-07T20:25:30.7362931Z 2025-05-07T20:25:30.7362935Z 2025-05-07T20:25:30.7362939Z 2025-05-07T20:25:30.7402000Z libcusparse-12.5.4.2 | 118.6 MB | ###6 | 37%  2025-05-07T20:25:30.7403061Z 2025-05-07T20:25:30.7420110Z libcublas-12.6.4.1 | 256.2 MB | #6 | 17%  2025-05-07T20:25:30.7421020Z 2025-05-07T20:25:30.7421028Z 2025-05-07T20:25:30.8129288Z libcufft-11.3.0.4 | 156.2 MB | ##4 | 24%  2025-05-07T20:25:30.8169935Z nsight-compute-2024. | 443.1 MB | 9 | 10% 2025-05-07T20:25:30.8170226Z 2025-05-07T20:25:30.8170230Z 2025-05-07T20:25:30.8170234Z 2025-05-07T20:25:30.8170238Z 2025-05-07T20:25:30.8364084Z cuda-nsight-12.6.77 | 113.2 MB | ###8 | 38%  2025-05-07T20:25:30.8364403Z 2025-05-07T20:25:30.8364411Z 2025-05-07T20:25:30.8365350Z 2025-05-07T20:25:30.8402852Z libcusparse-12.5.4.2 | 118.6 MB | ###9 | 40%  2025-05-07T20:25:30.8404061Z 2025-05-07T20:25:30.8423613Z libcublas-12.6.4.1 | 256.2 MB | #8 | 18%  2025-05-07T20:25:30.8423897Z 2025-05-07T20:25:30.8429153Z 2025-05-07T20:25:30.9157707Z libcufft-11.3.0.4 | 156.2 MB | ##6 | 27%  2025-05-07T20:25:30.9231390Z nsight-compute-2024. | 443.1 MB | # | 10% 2025-05-07T20:25:30.9231655Z 2025-05-07T20:25:30.9231659Z 2025-05-07T20:25:30.9231663Z 2025-05-07T20:25:30.9231667Z 2025-05-07T20:25:30.9395686Z cuda-nsight-12.6.77 | 113.2 MB | ####1 | 42%  2025-05-07T20:25:30.9396064Z 2025-05-07T20:25:30.9396068Z 2025-05-07T20:25:30.9398998Z 2025-05-07T20:25:30.9542672Z libcusparse-12.5.4.2 | 118.6 MB | ####2 | 43%  2025-05-07T20:25:30.9543988Z 2025-05-07T20:25:30.9577814Z libcublas-12.6.4.1 | 256.2 MB | #9 | 20%  2025-05-07T20:25:30.9578900Z 2025-05-07T20:25:30.9578905Z 2025-05-07T20:25:31.0158698Z libcufft-11.3.0.4 | 156.2 MB | ##8 | 29%  2025-05-07T20:25:31.0231980Z nsight-compute-2024. | 443.1 MB | #1 | 11% 2025-05-07T20:25:31.0232347Z 2025-05-07T20:25:31.0232353Z 2025-05-07T20:25:31.0232369Z 2025-05-07T20:25:31.0232374Z 2025-05-07T20:25:31.0433885Z cuda-nsight-12.6.77 | 113.2 MB | ####4 | 45%  2025-05-07T20:25:31.0434270Z 2025-05-07T20:25:31.0434276Z 2025-05-07T20:25:31.0434281Z 2025-05-07T20:25:31.0546575Z libcusparse-12.5.4.2 | 118.6 MB | ####5 | 46%  2025-05-07T20:25:31.0550490Z 2025-05-07T20:25:31.0618086Z libcublas-12.6.4.1 | 256.2 MB | ##1 | 21%  2025-05-07T20:25:31.0618751Z 2025-05-07T20:25:31.0620596Z 2025-05-07T20:25:31.1159815Z libcufft-11.3.0.4 | 156.2 MB | ###1 | 31%  2025-05-07T20:25:31.1233350Z nsight-compute-2024. | 443.1 MB | #1 | 12% 2025-05-07T20:25:31.1233656Z 2025-05-07T20:25:31.1233660Z 2025-05-07T20:25:31.1233664Z 2025-05-07T20:25:31.1233668Z 2025-05-07T20:25:31.1434730Z cuda-nsight-12.6.77 | 113.2 MB | ####8 | 48%  2025-05-07T20:25:31.1435039Z 2025-05-07T20:25:31.1435044Z 2025-05-07T20:25:31.1435768Z 2025-05-07T20:25:31.1547247Z libcusparse-12.5.4.2 | 118.6 MB | ####9 | 49%  2025-05-07T20:25:31.1547563Z 2025-05-07T20:25:31.1619012Z libcublas-12.6.4.1 | 256.2 MB | ##2 | 23%  2025-05-07T20:25:31.1619362Z 2025-05-07T20:25:31.1620433Z 2025-05-07T20:25:31.2166156Z libcufft-11.3.0.4 | 156.2 MB | ###3 | 33%  2025-05-07T20:25:31.2235694Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:25:31.2235969Z 2025-05-07T20:25:31.2235976Z 2025-05-07T20:25:31.2235980Z 2025-05-07T20:25:31.2235984Z 2025-05-07T20:25:31.2436053Z cuda-nsight-12.6.77 | 113.2 MB | #####1 | 51%  2025-05-07T20:25:31.2436746Z 2025-05-07T20:25:31.2436753Z 2025-05-07T20:25:31.2436790Z 2025-05-07T20:25:31.2549169Z libcusparse-12.5.4.2 | 118.6 MB | #####2 | 52%  2025-05-07T20:25:31.2551494Z 2025-05-07T20:25:31.2623051Z libcublas-12.6.4.1 | 256.2 MB | ##4 | 24%  2025-05-07T20:25:31.2623761Z 2025-05-07T20:25:31.2623892Z 2025-05-07T20:25:31.3177485Z libcufft-11.3.0.4 | 156.2 MB | ###5 | 36%  2025-05-07T20:25:31.3261864Z nsight-compute-2024. | 443.1 MB | #3 | 14% 2025-05-07T20:25:31.3262130Z 2025-05-07T20:25:31.3262134Z 2025-05-07T20:25:31.3262138Z 2025-05-07T20:25:31.3262877Z 2025-05-07T20:25:31.3437223Z cuda-nsight-12.6.77 | 113.2 MB | #####4 | 54%  2025-05-07T20:25:31.3437557Z 2025-05-07T20:25:31.3437561Z 2025-05-07T20:25:31.3437564Z 2025-05-07T20:25:31.3628544Z libcusparse-12.5.4.2 | 118.6 MB | #####5 | 56%  2025-05-07T20:25:31.3632590Z 2025-05-07T20:25:31.3701279Z libcublas-12.6.4.1 | 256.2 MB | ##5 | 25%  2025-05-07T20:25:31.3701549Z 2025-05-07T20:25:31.3704409Z 2025-05-07T20:25:31.4194897Z libcufft-11.3.0.4 | 156.2 MB | ###8 | 38%  2025-05-07T20:25:31.4264589Z nsight-compute-2024. | 443.1 MB | #4 | 14% 2025-05-07T20:25:31.4264848Z 2025-05-07T20:25:31.4264873Z 2025-05-07T20:25:31.4264878Z 2025-05-07T20:25:31.4265995Z 2025-05-07T20:25:31.4470498Z cuda-nsight-12.6.77 | 113.2 MB | #####7 | 58%  2025-05-07T20:25:31.4470880Z 2025-05-07T20:25:31.4470885Z 2025-05-07T20:25:31.4471405Z 2025-05-07T20:25:31.4694572Z libcusparse-12.5.4.2 | 118.6 MB | #####8 | 59%  2025-05-07T20:25:31.4695548Z 2025-05-07T20:25:31.4904211Z libcublas-12.6.4.1 | 256.2 MB | ##6 | 27%  2025-05-07T20:25:31.4904567Z 2025-05-07T20:25:31.4910524Z 2025-05-07T20:25:31.5199199Z libcufft-11.3.0.4 | 156.2 MB | #### | 40%  2025-05-07T20:25:31.5271292Z nsight-compute-2024. | 443.1 MB | #5 | 15% 2025-05-07T20:25:31.5271652Z 2025-05-07T20:25:31.5271663Z 2025-05-07T20:25:31.5272019Z 2025-05-07T20:25:31.5274583Z 2025-05-07T20:25:31.5471290Z cuda-nsight-12.6.77 | 113.2 MB | ###### | 61%  2025-05-07T20:25:31.5471675Z 2025-05-07T20:25:31.5471681Z 2025-05-07T20:25:31.5471686Z 2025-05-07T20:25:31.5695235Z libcusparse-12.5.4.2 | 118.6 MB | ######1 | 62%  2025-05-07T20:25:31.5697137Z 2025-05-07T20:25:31.6203051Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 28%  2025-05-07T20:25:31.6275583Z nsight-compute-2024. | 443.1 MB | #6 | 16% 2025-05-07T20:25:31.6275922Z 2025-05-07T20:25:31.6275928Z 2025-05-07T20:25:31.6275934Z 2025-05-07T20:25:31.6277907Z 2025-05-07T20:25:31.6472964Z cuda-nsight-12.6.77 | 113.2 MB | ######4 | 64%  2025-05-07T20:25:31.6473263Z 2025-05-07T20:25:31.6473267Z 2025-05-07T20:25:31.6473401Z 2025-05-07T20:25:31.6633160Z libcusparse-12.5.4.2 | 118.6 MB | ######5 | 65%  2025-05-07T20:25:31.6633548Z 2025-05-07T20:25:31.6633553Z 2025-05-07T20:25:31.6755878Z libcufft-11.3.0.4 | 156.2 MB | ####2 | 42%  2025-05-07T20:25:31.6756180Z 2025-05-07T20:25:31.7426153Z libcublas-12.6.4.1 | 256.2 MB | ##9 | 30%  2025-05-07T20:25:31.7431676Z nsight-compute-2024. | 443.1 MB | #7 | 17% 2025-05-07T20:25:31.7432023Z 2025-05-07T20:25:31.7432029Z 2025-05-07T20:25:31.7432035Z 2025-05-07T20:25:31.7432040Z 2025-05-07T20:25:31.7479591Z cuda-nsight-12.6.77 | 113.2 MB | ######7 | 68%  2025-05-07T20:25:31.7479875Z 2025-05-07T20:25:31.7479887Z 2025-05-07T20:25:31.7479891Z 2025-05-07T20:25:31.7634030Z libcusparse-12.5.4.2 | 118.6 MB | ######8 | 69%  2025-05-07T20:25:31.7634473Z 2025-05-07T20:25:31.7634479Z 2025-05-07T20:25:31.7916536Z libcufft-11.3.0.4 | 156.2 MB | ####4 | 44%  2025-05-07T20:25:31.7921256Z 2025-05-07T20:25:31.8491291Z libcublas-12.6.4.1 | 256.2 MB | ### | 31%  2025-05-07T20:25:31.8517326Z nsight-compute-2024. | 443.1 MB | #7 | 18% 2025-05-07T20:25:31.8517678Z 2025-05-07T20:25:31.8517710Z 2025-05-07T20:25:31.8517714Z 2025-05-07T20:25:31.8517718Z 2025-05-07T20:25:31.8565652Z cuda-nsight-12.6.77 | 113.2 MB | ####### | 71%  2025-05-07T20:25:31.8566188Z 2025-05-07T20:25:31.8566193Z 2025-05-07T20:25:31.8566528Z 2025-05-07T20:25:31.8716837Z libcusparse-12.5.4.2 | 118.6 MB | #######1 | 72%  2025-05-07T20:25:31.8717403Z 2025-05-07T20:25:31.8717408Z 2025-05-07T20:25:31.9003331Z libcufft-11.3.0.4 | 156.2 MB | ####6 | 46%  2025-05-07T20:25:31.9005744Z 2025-05-07T20:25:31.9492616Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 32%  2025-05-07T20:25:31.9518101Z nsight-compute-2024. | 443.1 MB | #8 | 19% 2025-05-07T20:25:31.9518421Z 2025-05-07T20:25:31.9518427Z 2025-05-07T20:25:31.9518432Z 2025-05-07T20:25:31.9522681Z 2025-05-07T20:25:31.9575985Z cuda-nsight-12.6.77 | 113.2 MB | #######4 | 74%  2025-05-07T20:25:31.9576318Z 2025-05-07T20:25:31.9576322Z 2025-05-07T20:25:31.9576326Z 2025-05-07T20:25:32.0084255Z libcusparse-12.5.4.2 | 118.6 MB | #######5 | 75%  2025-05-07T20:25:32.0090265Z 2025-05-07T20:25:32.0478728Z libcublas-12.6.4.1 | 256.2 MB | ###3 | 33%  2025-05-07T20:25:32.0479148Z 2025-05-07T20:25:32.0480486Z 2025-05-07T20:25:32.0523888Z libcufft-11.3.0.4 | 156.2 MB | ####8 | 48%  2025-05-07T20:25:32.0524237Z 2025-05-07T20:25:32.0524243Z 2025-05-07T20:25:32.0524248Z 2025-05-07T20:25:32.0524253Z 2025-05-07T20:25:32.0603700Z cuda-nsight-12.6.77 | 113.2 MB | #######7 | 77%  2025-05-07T20:25:32.0604073Z 2025-05-07T20:25:32.0604077Z 2025-05-07T20:25:32.0604081Z 2025-05-07T20:25:32.0665228Z libcusparse-12.5.4.2 | 118.6 MB | #######8 | 79%  2025-05-07T20:25:32.1136572Z nsight-compute-2024. | 443.1 MB | #9 | 19% 2025-05-07T20:25:32.1136917Z 2025-05-07T20:25:32.1482015Z libcublas-12.6.4.1 | 256.2 MB | ###4 | 35%  2025-05-07T20:25:32.1482691Z 2025-05-07T20:25:32.1483552Z 2025-05-07T20:25:32.1637507Z libcufft-11.3.0.4 | 156.2 MB | ##### | 50%  2025-05-07T20:25:32.1638096Z 2025-05-07T20:25:32.1638101Z 2025-05-07T20:25:32.1638106Z 2025-05-07T20:25:32.1666140Z libcusparse-12.5.4.2 | 118.6 MB | ########1 | 82%  2025-05-07T20:25:32.1673068Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:25:32.1673391Z 2025-05-07T20:25:32.1673396Z 2025-05-07T20:25:32.1673399Z 2025-05-07T20:25:32.1675763Z 2025-05-07T20:25:32.2138614Z cuda-nsight-12.6.77 | 113.2 MB | ######## | 81%  2025-05-07T20:25:32.2139723Z 2025-05-07T20:25:32.2482985Z libcublas-12.6.4.1 | 256.2 MB | ###5 | 36%  2025-05-07T20:25:32.2483280Z 2025-05-07T20:25:32.2485503Z 2025-05-07T20:25:32.2638489Z libcufft-11.3.0.4 | 156.2 MB | #####2 | 52%  2025-05-07T20:25:32.2638760Z 2025-05-07T20:25:32.2638764Z 2025-05-07T20:25:32.2639073Z 2025-05-07T20:25:32.2673682Z libcusparse-12.5.4.2 | 118.6 MB | ########5 | 85%  2025-05-07T20:25:32.2674386Z 2025-05-07T20:25:32.2674422Z 2025-05-07T20:25:32.2674426Z 2025-05-07T20:25:32.2674430Z 2025-05-07T20:25:32.2709420Z cuda-nsight-12.6.77 | 113.2 MB | ########3 | 84%  2025-05-07T20:25:32.3140608Z nsight-compute-2024. | 443.1 MB | ##1 | 21% 2025-05-07T20:25:32.3140908Z 2025-05-07T20:25:32.3592999Z libcublas-12.6.4.1 | 256.2 MB | ###7 | 37%  2025-05-07T20:25:32.3593366Z 2025-05-07T20:25:32.3597155Z 2025-05-07T20:25:32.3638402Z libcufft-11.3.0.4 | 156.2 MB | #####4 | 54%  2025-05-07T20:25:32.3638679Z 2025-05-07T20:25:32.3638683Z 2025-05-07T20:25:32.3638687Z 2025-05-07T20:25:32.3676714Z libcusparse-12.5.4.2 | 118.6 MB | ########8 | 89%  2025-05-07T20:25:32.3677138Z 2025-05-07T20:25:32.3677144Z 2025-05-07T20:25:32.3677149Z 2025-05-07T20:25:32.3680397Z 2025-05-07T20:25:32.3761802Z cuda-nsight-12.6.77 | 113.2 MB | ########7 | 87%  2025-05-07T20:25:32.4160831Z nsight-compute-2024. | 443.1 MB | ##1 | 22% 2025-05-07T20:25:32.4162007Z 2025-05-07T20:25:32.4593684Z libcublas-12.6.4.1 | 256.2 MB | ###8 | 39%  2025-05-07T20:25:32.4594151Z 2025-05-07T20:25:32.4596676Z 2025-05-07T20:25:32.4696519Z libcufft-11.3.0.4 | 156.2 MB | #####6 | 56%  2025-05-07T20:25:32.4696927Z 2025-05-07T20:25:32.4696933Z 2025-05-07T20:25:32.4699648Z 2025-05-07T20:25:32.4702603Z libcusparse-12.5.4.2 | 118.6 MB | #########2 | 92%  2025-05-07T20:25:32.4702977Z 2025-05-07T20:25:32.4702982Z 2025-05-07T20:25:32.4702986Z 2025-05-07T20:25:32.4702989Z 2025-05-07T20:25:32.4780832Z cuda-nsight-12.6.77 | 113.2 MB | ######### | 90%  2025-05-07T20:25:32.5167478Z nsight-compute-2024. | 443.1 MB | ##2 | 23% 2025-05-07T20:25:32.5168679Z 2025-05-07T20:25:32.5703086Z libcublas-12.6.4.1 | 256.2 MB | #### | 40%  2025-05-07T20:25:32.5703360Z 2025-05-07T20:25:32.5703364Z 2025-05-07T20:25:32.5703831Z 2025-05-07T20:25:32.5706115Z libcusparse-12.5.4.2 | 118.6 MB | #########5 | 95%  2025-05-07T20:25:32.5706538Z 2025-05-07T20:25:32.5706544Z 2025-05-07T20:25:32.5706550Z 2025-05-07T20:25:32.5707369Z 2025-05-07T20:25:32.5782683Z cuda-nsight-12.6.77 | 113.2 MB | #########3 | 94%  2025-05-07T20:25:32.5975010Z nsight-compute-2024. | 443.1 MB | ##3 | 23% 2025-05-07T20:25:32.5975282Z 2025-05-07T20:25:32.5977782Z 2025-05-07T20:25:32.6185977Z libcufft-11.3.0.4 | 156.2 MB | #####8 | 58%  2025-05-07T20:25:32.6187186Z 2025-05-07T20:25:32.6708174Z libcublas-12.6.4.1 | 256.2 MB | ####1 | 42%  2025-05-07T20:25:32.6708509Z 2025-05-07T20:25:32.6708513Z 2025-05-07T20:25:32.6708517Z 2025-05-07T20:25:32.6709506Z libcusparse-12.5.4.2 | 118.6 MB | #########8 | 99%  2025-05-07T20:25:32.6709873Z 2025-05-07T20:25:32.6709879Z 2025-05-07T20:25:32.6709882Z 2025-05-07T20:25:32.6712254Z 2025-05-07T20:25:32.6864121Z cuda-nsight-12.6.77 | 113.2 MB | #########7 | 97%  2025-05-07T20:25:32.7186765Z nsight-compute-2024. | 443.1 MB | ##4 | 24% 2025-05-07T20:25:32.7187508Z 2025-05-07T20:25:32.7763261Z libcublas-12.6.4.1 | 256.2 MB | ####3 | 43%  2025-05-07T20:25:32.7763639Z 2025-05-07T20:25:32.7763643Z 2025-05-07T20:25:32.8187916Z libcufft-11.3.0.4 | 156.2 MB | ###### | 60%  2025-05-07T20:25:32.8188482Z 2025-05-07T20:25:32.8534738Z libcublas-12.6.4.1 | 256.2 MB | ####6 | 47%  2025-05-07T20:25:32.8764098Z nsight-compute-2024. | 443.1 MB | ##4 | 25% 2025-05-07T20:25:32.8764496Z 2025-05-07T20:25:32.8765355Z 2025-05-07T20:25:32.9499645Z libcufft-11.3.0.4 | 156.2 MB | ######2 | 62%  2025-05-07T20:25:32.9500038Z 2025-05-07T20:25:32.9535233Z libcublas-12.6.4.1 | 256.2 MB | ####8 | 49%  2025-05-07T20:25:32.9770321Z nsight-compute-2024. | 443.1 MB | ##5 | 26% 2025-05-07T20:25:32.9770644Z 2025-05-07T20:25:32.9772772Z 2025-05-07T20:25:33.0538367Z libcufft-11.3.0.4 | 156.2 MB | ######4 | 65%  2025-05-07T20:25:33.0771581Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:25:33.0772003Z 2025-05-07T20:25:33.0772023Z 2025-05-07T20:25:33.0826244Z libcufft-11.3.0.4 | 156.2 MB | ######7 | 67%  2025-05-07T20:25:33.0829408Z 2025-05-07T20:25:33.1541992Z libcublas-12.6.4.1 | 256.2 MB | ##### | 50%  2025-05-07T20:25:33.1896758Z nsight-compute-2024. | 443.1 MB | ##7 | 27% 2025-05-07T20:25:33.1897123Z 2025-05-07T20:25:33.1900385Z 2025-05-07T20:25:33.1990253Z libcufft-11.3.0.4 | 156.2 MB | ######9 | 70%  2025-05-07T20:25:33.1991079Z 2025-05-07T20:25:33.2542792Z libcublas-12.6.4.1 | 256.2 MB | #####2 | 52%  2025-05-07T20:25:33.2991578Z nsight-compute-2024. | 443.1 MB | ##8 | 29% 2025-05-07T20:25:33.2991979Z 2025-05-07T20:25:33.3545234Z libcublas-12.6.4.1 | 256.2 MB | #####4 | 55%  2025-05-07T20:25:33.3992382Z nsight-compute-2024. | 443.1 MB | ##9 | 30% 2025-05-07T20:25:33.3992777Z 2025-05-07T20:25:33.4003336Z libcublas-12.6.4.1 | 256.2 MB | #####7 | 57%  2025-05-07T20:25:33.4003662Z 2025-05-07T20:25:33.4005448Z 2025-05-07T20:25:33.4546168Z libcufft-11.3.0.4 | 156.2 MB | #######1 | 72%  2025-05-07T20:25:33.5005381Z nsight-compute-2024. | 443.1 MB | ###1 | 31% 2025-05-07T20:25:33.5005833Z 2025-05-07T20:25:33.5005839Z 2025-05-07T20:25:33.5110816Z libcufft-11.3.0.4 | 156.2 MB | #######4 | 75%  2025-05-07T20:25:33.5111710Z 2025-05-07T20:25:33.5546435Z libcublas-12.6.4.1 | 256.2 MB | #####9 | 59%  2025-05-07T20:25:33.6045754Z nsight-compute-2024. | 443.1 MB | ###2 | 32% 2025-05-07T20:25:33.6046026Z 2025-05-07T20:25:33.6046031Z 2025-05-07T20:25:33.6112120Z libcufft-11.3.0.4 | 156.2 MB | #######6 | 77%  2025-05-07T20:25:33.6112394Z 2025-05-07T20:25:33.6548618Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 61%  2025-05-07T20:25:33.7047150Z nsight-compute-2024. | 443.1 MB | ###3 | 34% 2025-05-07T20:25:33.7047433Z 2025-05-07T20:25:33.7047997Z 2025-05-07T20:25:33.7183591Z libcufft-11.3.0.4 | 156.2 MB | #######9 | 80%  2025-05-07T20:25:33.7185669Z 2025-05-07T20:25:33.7665810Z libcublas-12.6.4.1 | 256.2 MB | ######3 | 63%  2025-05-07T20:25:33.8049431Z nsight-compute-2024. | 443.1 MB | ###4 | 35% 2025-05-07T20:25:33.8049828Z 2025-05-07T20:25:33.8051356Z 2025-05-07T20:25:33.8184593Z libcufft-11.3.0.4 | 156.2 MB | ########2 | 82%  2025-05-07T20:25:33.8186052Z 2025-05-07T20:25:33.8667454Z libcublas-12.6.4.1 | 256.2 MB | ######5 | 65%  2025-05-07T20:25:33.9196765Z nsight-compute-2024. | 443.1 MB | ###5 | 36% 2025-05-07T20:25:33.9197505Z 2025-05-07T20:25:33.9290761Z libcublas-12.6.4.1 | 256.2 MB | ######7 | 67%  2025-05-07T20:25:33.9291031Z 2025-05-07T20:25:33.9291251Z 2025-05-07T20:25:33.9668141Z libcufft-11.3.0.4 | 156.2 MB | ########4 | 85%  2025-05-07T20:25:34.0254872Z nsight-compute-2024. | 443.1 MB | ###7 | 37% 2025-05-07T20:25:34.0255592Z 2025-05-07T20:25:34.0294500Z libcublas-12.6.4.1 | 256.2 MB | ######9 | 69%  2025-05-07T20:25:34.0294873Z 2025-05-07T20:25:34.0295669Z 2025-05-07T20:25:34.0691563Z libcufft-11.3.0.4 | 156.2 MB | ########7 | 88%  2025-05-07T20:25:34.1258352Z nsight-compute-2024. | 443.1 MB | ###8 | 38% 2025-05-07T20:25:34.1261038Z 2025-05-07T20:25:34.1418122Z libcublas-12.6.4.1 | 256.2 MB | #######1 | 71%  2025-05-07T20:25:34.1418430Z 2025-05-07T20:25:34.1418434Z 2025-05-07T20:25:34.1697753Z libcufft-11.3.0.4 | 156.2 MB | ######### | 90%  2025-05-07T20:25:34.2361938Z nsight-compute-2024. | 443.1 MB | ###9 | 39% 2025-05-07T20:25:34.2362317Z 2025-05-07T20:25:34.2418960Z libcublas-12.6.4.1 | 256.2 MB | #######3 | 73%  2025-05-07T20:25:34.2419350Z 2025-05-07T20:25:34.2419357Z 2025-05-07T20:25:34.2784731Z libcufft-11.3.0.4 | 156.2 MB | #########3 | 93%  2025-05-07T20:25:34.3367026Z nsight-compute-2024. | 443.1 MB | #### | 41% 2025-05-07T20:25:34.3367811Z 2025-05-07T20:25:34.3419797Z libcublas-12.6.4.1 | 256.2 MB | #######5 | 75%  2025-05-07T20:25:34.3420207Z 2025-05-07T20:25:34.3420213Z 2025-05-07T20:25:34.3853600Z libcufft-11.3.0.4 | 156.2 MB | #########5 | 96%  2025-05-07T20:25:34.4422640Z nsight-compute-2024. | 443.1 MB | ####1 | 42% 2025-05-07T20:25:34.4422927Z 2025-05-07T20:25:34.4422932Z 2025-05-07T20:25:34.4498081Z libcufft-11.3.0.4 | 156.2 MB | #########8 | 99%  2025-05-07T20:25:34.4498461Z 2025-05-07T20:25:34.4869214Z libcublas-12.6.4.1 | 256.2 MB | #######7 | 77%  2025-05-07T20:25:34.5499428Z nsight-compute-2024. | 443.1 MB | ####2 | 43% 2025-05-07T20:25:34.5503157Z 2025-05-07T20:25:34.5869308Z libcublas-12.6.4.1 | 256.2 MB | #######9 | 79%  2025-05-07T20:25:34.6351758Z nsight-compute-2024. | 443.1 MB | ####4 | 44% 2025-05-07T20:25:34.6352028Z 2025-05-07T20:25:34.6352033Z 2025-05-07T20:25:34.6352037Z 2025-05-07T20:25:34.6352041Z 2025-05-07T20:25:34.6500949Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:34.6503556Z 2025-05-07T20:25:34.6792881Z libcublas-12.6.4.1 | 256.2 MB | ########1 | 81%  2025-05-07T20:25:34.6793227Z 2025-05-07T20:25:34.6793233Z 2025-05-07T20:25:34.6793259Z 2025-05-07T20:25:34.6793265Z 2025-05-07T20:25:34.6794736Z 2025-05-07T20:25:34.6889358Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:34.6889641Z 2025-05-07T20:25:34.6889645Z 2025-05-07T20:25:34.6889648Z 2025-05-07T20:25:34.7490499Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:34.7490810Z 2025-05-07T20:25:34.7490814Z 2025-05-07T20:25:34.7490818Z 2025-05-07T20:25:34.7490822Z 2025-05-07T20:25:34.7490826Z 2025-05-07T20:25:34.7492804Z 2025-05-07T20:25:34.7786484Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:34.7786790Z 2025-05-07T20:25:34.7799365Z libcublas-12.6.4.1 | 256.2 MB | ########3 | 83%  2025-05-07T20:25:34.7799631Z 2025-05-07T20:25:34.7799659Z 2025-05-07T20:25:34.7799665Z 2025-05-07T20:25:34.7799668Z 2025-05-07T20:25:34.7801551Z 2025-05-07T20:25:34.8045848Z cuda-nvvp-12.6.80 | 109.3 MB | 3 | 3%  2025-05-07T20:25:34.8496320Z nsight-compute-2024. | 443.1 MB | ####5 | 45% 2025-05-07T20:25:34.8496645Z 2025-05-07T20:25:34.8496652Z 2025-05-07T20:25:34.8496658Z 2025-05-07T20:25:34.8496663Z 2025-05-07T20:25:34.8496668Z 2025-05-07T20:25:34.8496673Z 2025-05-07T20:25:34.8804689Z libcusolver-11.7.1.2 | 95.8 MB | 2 | 3%  2025-05-07T20:25:34.8805094Z 2025-05-07T20:25:34.8805102Z 2025-05-07T20:25:34.8805107Z 2025-05-07T20:25:34.8805112Z 2025-05-07T20:25:34.8810935Z 2025-05-07T20:25:34.9365368Z cuda-nvvp-12.6.80 | 109.3 MB | 5 | 6%  2025-05-07T20:25:34.9370335Z 2025-05-07T20:25:34.9500545Z libcublas-12.6.4.1 | 256.2 MB | ########4 | 85%  2025-05-07T20:25:34.9500913Z 2025-05-07T20:25:34.9500918Z 2025-05-07T20:25:34.9501185Z 2025-05-07T20:25:34.9501189Z 2025-05-07T20:25:34.9501193Z 2025-05-07T20:25:34.9501198Z 2025-05-07T20:25:34.9682741Z libcusolver-11.7.1.2 | 95.8 MB | 5 | 6%  2025-05-07T20:25:34.9809453Z nsight-compute-2024. | 443.1 MB | ####6 | 46% 2025-05-07T20:25:34.9809838Z 2025-05-07T20:25:34.9809843Z 2025-05-07T20:25:34.9809846Z 2025-05-07T20:25:34.9809859Z 2025-05-07T20:25:34.9812171Z 2025-05-07T20:25:35.0508088Z cuda-nvvp-12.6.80 | 109.3 MB | 8 | 8%  2025-05-07T20:25:35.0508389Z 2025-05-07T20:25:35.0508393Z 2025-05-07T20:25:35.0508397Z 2025-05-07T20:25:35.0508409Z 2025-05-07T20:25:35.0508413Z 2025-05-07T20:25:35.0510233Z 2025-05-07T20:25:35.0810509Z libcusolver-11.7.1.2 | 95.8 MB | 8 | 8%  2025-05-07T20:25:35.0810826Z 2025-05-07T20:25:35.0810838Z 2025-05-07T20:25:35.0810842Z 2025-05-07T20:25:35.0810846Z 2025-05-07T20:25:35.0810850Z 2025-05-07T20:25:35.0872025Z cuda-nvvp-12.6.80 | 109.3 MB | # | 11%  2025-05-07T20:25:35.0873164Z 2025-05-07T20:25:35.1048215Z libcublas-12.6.4.1 | 256.2 MB | ########6 | 87%  2025-05-07T20:25:35.1511545Z nsight-compute-2024. | 443.1 MB | ####6 | 47% 2025-05-07T20:25:35.1511839Z 2025-05-07T20:25:35.1511843Z 2025-05-07T20:25:35.1511847Z 2025-05-07T20:25:35.1511851Z 2025-05-07T20:25:35.1511855Z 2025-05-07T20:25:35.1511858Z 2025-05-07T20:25:35.1811888Z libcusolver-11.7.1.2 | 95.8 MB | #1 | 11%  2025-05-07T20:25:35.1812182Z 2025-05-07T20:25:35.1812185Z 2025-05-07T20:25:35.1812189Z 2025-05-07T20:25:35.1812193Z 2025-05-07T20:25:35.1812197Z 2025-05-07T20:25:35.2274515Z cuda-nvvp-12.6.80 | 109.3 MB | #3 | 13%  2025-05-07T20:25:35.2276248Z 2025-05-07T20:25:35.2462351Z libcublas-12.6.4.1 | 256.2 MB | ########8 | 88%  2025-05-07T20:25:35.2514032Z nsight-compute-2024. | 443.1 MB | ####7 | 48% 2025-05-07T20:25:35.2514359Z 2025-05-07T20:25:35.2514363Z 2025-05-07T20:25:35.2514388Z 2025-05-07T20:25:35.2514393Z 2025-05-07T20:25:35.2514397Z 2025-05-07T20:25:35.2516875Z 2025-05-07T20:25:35.2814481Z libcusolver-11.7.1.2 | 95.8 MB | #3 | 14%  2025-05-07T20:25:35.2814848Z 2025-05-07T20:25:35.2814852Z 2025-05-07T20:25:35.2814856Z 2025-05-07T20:25:35.2814860Z 2025-05-07T20:25:35.2814864Z 2025-05-07T20:25:35.3434256Z cuda-nvvp-12.6.80 | 109.3 MB | #5 | 16%  2025-05-07T20:25:35.3434549Z 2025-05-07T20:25:35.3523835Z libcublas-12.6.4.1 | 256.2 MB | ########9 | 89%  2025-05-07T20:25:35.3524477Z 2025-05-07T20:25:35.3524485Z 2025-05-07T20:25:35.3524492Z 2025-05-07T20:25:35.3524499Z 2025-05-07T20:25:35.3524505Z 2025-05-07T20:25:35.3524513Z 2025-05-07T20:25:35.3542465Z libcusolver-11.7.1.2 | 95.8 MB | #6 | 17%  2025-05-07T20:25:35.3815437Z nsight-compute-2024. | 443.1 MB | ####8 | 48% 2025-05-07T20:25:35.3815700Z 2025-05-07T20:25:35.3815704Z 2025-05-07T20:25:35.3815732Z 2025-05-07T20:25:35.3815744Z 2025-05-07T20:25:35.3815748Z 2025-05-07T20:25:35.4526325Z cuda-nvvp-12.6.80 | 109.3 MB | #8 | 19%  2025-05-07T20:25:35.4526814Z 2025-05-07T20:25:35.4526844Z 2025-05-07T20:25:35.4526847Z 2025-05-07T20:25:35.4526851Z 2025-05-07T20:25:35.4526855Z 2025-05-07T20:25:35.4526863Z 2025-05-07T20:25:35.4580594Z libcusolver-11.7.1.2 | 95.8 MB | #9 | 20%  2025-05-07T20:25:35.4585978Z 2025-05-07T20:25:35.4610979Z libcublas-12.6.4.1 | 256.2 MB | ######### | 91%  2025-05-07T20:25:35.4822911Z nsight-compute-2024. | 443.1 MB | ####9 | 49% 2025-05-07T20:25:35.4823277Z 2025-05-07T20:25:35.4823513Z 2025-05-07T20:25:35.4823520Z 2025-05-07T20:25:35.4823523Z 2025-05-07T20:25:35.4823546Z 2025-05-07T20:25:35.5532133Z cuda-nvvp-12.6.80 | 109.3 MB | ##1 | 21%  2025-05-07T20:25:35.5532439Z 2025-05-07T20:25:35.5532443Z 2025-05-07T20:25:35.5532447Z 2025-05-07T20:25:35.5532451Z 2025-05-07T20:25:35.5532763Z 2025-05-07T20:25:35.5532767Z 2025-05-07T20:25:35.5615062Z libcusolver-11.7.1.2 | 95.8 MB | ##3 | 23%  2025-05-07T20:25:35.5616156Z 2025-05-07T20:25:35.5618814Z libcublas-12.6.4.1 | 256.2 MB | #########1 | 92%  2025-05-07T20:25:35.5829341Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:25:35.5829650Z 2025-05-07T20:25:35.5829654Z 2025-05-07T20:25:35.5829658Z 2025-05-07T20:25:35.5829662Z 2025-05-07T20:25:35.5829665Z 2025-05-07T20:25:35.6569136Z cuda-nvvp-12.6.80 | 109.3 MB | ##3 | 24%  2025-05-07T20:25:35.6569539Z 2025-05-07T20:25:35.6569542Z 2025-05-07T20:25:35.6569546Z 2025-05-07T20:25:35.6569550Z 2025-05-07T20:25:35.6569554Z 2025-05-07T20:25:35.6571769Z 2025-05-07T20:25:35.6623690Z libcusolver-11.7.1.2 | 95.8 MB | ##6 | 26%  2025-05-07T20:25:35.6701082Z nsight-compute-2024. | 443.1 MB | ##### | 51% 2025-05-07T20:25:35.6703260Z 2025-05-07T20:25:35.7624355Z libcublas-12.6.4.1 | 256.2 MB | #########3 | 93%  2025-05-07T20:25:35.7702711Z nsight-compute-2024. | 443.1 MB | #####1 | 51% 2025-05-07T20:25:35.7705091Z 2025-05-07T20:25:35.7716917Z libcublas-12.6.4.1 | 256.2 MB | #########4 | 95%  2025-05-07T20:25:35.7717181Z 2025-05-07T20:25:35.7717186Z 2025-05-07T20:25:35.7717189Z 2025-05-07T20:25:35.7717193Z 2025-05-07T20:25:35.7717197Z 2025-05-07T20:25:35.7718910Z 2025-05-07T20:25:35.8436519Z libcusolver-11.7.1.2 | 95.8 MB | ##9 | 29%  2025-05-07T20:25:35.8436825Z 2025-05-07T20:25:35.8436830Z 2025-05-07T20:25:35.8436833Z 2025-05-07T20:25:35.8436837Z 2025-05-07T20:25:35.8437991Z 2025-05-07T20:25:35.8630872Z cuda-nvvp-12.6.80 | 109.3 MB | ##6 | 26%  2025-05-07T20:25:35.8709922Z nsight-compute-2024. | 443.1 MB | #####2 | 52% 2025-05-07T20:25:35.8710455Z 2025-05-07T20:25:35.8800341Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 96%  2025-05-07T20:25:35.8800687Z 2025-05-07T20:25:35.8800720Z 2025-05-07T20:25:35.8800725Z 2025-05-07T20:25:35.8800731Z 2025-05-07T20:25:35.8800736Z 2025-05-07T20:25:35.8804832Z 2025-05-07T20:25:35.9444760Z libcusolver-11.7.1.2 | 95.8 MB | ###2 | 32%  2025-05-07T20:25:35.9445063Z 2025-05-07T20:25:35.9445071Z 2025-05-07T20:25:35.9445075Z 2025-05-07T20:25:35.9445079Z 2025-05-07T20:25:35.9448566Z 2025-05-07T20:25:35.9784586Z cuda-nvvp-12.6.80 | 109.3 MB | ##9 | 29%  2025-05-07T20:25:35.9803787Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:25:35.9804162Z 2025-05-07T20:25:35.9804169Z 2025-05-07T20:25:35.9804176Z 2025-05-07T20:25:35.9804182Z 2025-05-07T20:25:35.9804187Z 2025-05-07T20:25:35.9805801Z 2025-05-07T20:25:35.9922672Z libcusolver-11.7.1.2 | 95.8 MB | ###4 | 35%  2025-05-07T20:25:35.9924144Z 2025-05-07T20:25:36.0446004Z libcublas-12.6.4.1 | 256.2 MB | #########7 | 97%  2025-05-07T20:25:36.0446277Z 2025-05-07T20:25:36.0446305Z 2025-05-07T20:25:36.0446309Z 2025-05-07T20:25:36.0446313Z 2025-05-07T20:25:36.0447850Z 2025-05-07T20:25:36.0805136Z cuda-nvvp-12.6.80 | 109.3 MB | ###1 | 32%  2025-05-07T20:25:36.0805567Z 2025-05-07T20:25:36.0805592Z 2025-05-07T20:25:36.0805597Z 2025-05-07T20:25:36.0805795Z 2025-05-07T20:25:36.0805801Z 2025-05-07T20:25:36.0806752Z 2025-05-07T20:25:36.0854592Z libcusolver-11.7.1.2 | 95.8 MB | ###8 | 38%  2025-05-07T20:25:36.0973030Z nsight-compute-2024. | 443.1 MB | #####3 | 54% 2025-05-07T20:25:36.0975920Z 2025-05-07T20:25:36.1447274Z libcublas-12.6.4.1 | 256.2 MB | #########8 | 98%  2025-05-07T20:25:36.1447647Z 2025-05-07T20:25:36.1447657Z 2025-05-07T20:25:36.1447661Z 2025-05-07T20:25:36.1447665Z 2025-05-07T20:25:36.1447668Z 2025-05-07T20:25:36.1807022Z cuda-nvvp-12.6.80 | 109.3 MB | ###4 | 35%  2025-05-07T20:25:36.1807318Z 2025-05-07T20:25:36.1807322Z 2025-05-07T20:25:36.1807325Z 2025-05-07T20:25:36.1807677Z 2025-05-07T20:25:36.1807681Z 2025-05-07T20:25:36.1808406Z 2025-05-07T20:25:36.1875876Z libcusolver-11.7.1.2 | 95.8 MB | ####1 | 41%  2025-05-07T20:25:36.1984431Z nsight-compute-2024. | 443.1 MB | #####4 | 55% 2025-05-07T20:25:36.1991717Z 2025-05-07T20:25:36.2454254Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 100%  2025-05-07T20:25:36.2454602Z 2025-05-07T20:25:36.2454607Z 2025-05-07T20:25:36.2454612Z 2025-05-07T20:25:36.2454616Z 2025-05-07T20:25:36.2454631Z 2025-05-07T20:25:36.2809444Z cuda-nvvp-12.6.80 | 109.3 MB | ###7 | 38%  2025-05-07T20:25:36.2809730Z 2025-05-07T20:25:36.2809735Z 2025-05-07T20:25:36.2809738Z 2025-05-07T20:25:36.2809742Z 2025-05-07T20:25:36.2809754Z 2025-05-07T20:25:36.2809757Z 2025-05-07T20:25:36.2878313Z libcusolver-11.7.1.2 | 95.8 MB | ####4 | 45%  2025-05-07T20:25:36.3457416Z nsight-compute-2024. | 443.1 MB | #####5 | 55% 2025-05-07T20:25:36.3457682Z 2025-05-07T20:25:36.3457898Z 2025-05-07T20:25:36.3457903Z 2025-05-07T20:25:36.3457908Z 2025-05-07T20:25:36.3460810Z 2025-05-07T20:25:36.3813102Z cuda-nvvp-12.6.80 | 109.3 MB | #### | 41%  2025-05-07T20:25:36.3813452Z 2025-05-07T20:25:36.3813456Z 2025-05-07T20:25:36.3813460Z 2025-05-07T20:25:36.3813464Z 2025-05-07T20:25:36.3813467Z 2025-05-07T20:25:36.3813471Z 2025-05-07T20:25:36.4009933Z libcusolver-11.7.1.2 | 95.8 MB | ####8 | 48%  2025-05-07T20:25:36.4502491Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:25:36.4502761Z 2025-05-07T20:25:36.4502765Z 2025-05-07T20:25:36.4502769Z 2025-05-07T20:25:36.4502772Z 2025-05-07T20:25:36.4505139Z 2025-05-07T20:25:36.4815513Z cuda-nvvp-12.6.80 | 109.3 MB | ####3 | 44%  2025-05-07T20:25:36.4815895Z 2025-05-07T20:25:36.4815906Z 2025-05-07T20:25:36.4815911Z 2025-05-07T20:25:36.4815915Z 2025-05-07T20:25:36.4815920Z 2025-05-07T20:25:36.4815924Z 2025-05-07T20:25:36.5013323Z libcusolver-11.7.1.2 | 95.8 MB | #####1 | 51%  2025-05-07T20:25:36.5505580Z nsight-compute-2024. | 443.1 MB | #####6 | 57% 2025-05-07T20:25:36.5514816Z 2025-05-07T20:25:36.5514827Z 2025-05-07T20:25:36.5514856Z 2025-05-07T20:25:36.5514866Z 2025-05-07T20:25:36.5514874Z 2025-05-07T20:25:36.5867581Z cuda-nvvp-12.6.80 | 109.3 MB | ####6 | 47%  2025-05-07T20:25:36.5867933Z 2025-05-07T20:25:36.5867937Z 2025-05-07T20:25:36.5867941Z 2025-05-07T20:25:36.5867944Z 2025-05-07T20:25:36.5867948Z 2025-05-07T20:25:36.5871615Z 2025-05-07T20:25:36.6155728Z libcusolver-11.7.1.2 | 95.8 MB | #####4 | 55%  2025-05-07T20:25:36.6660129Z nsight-compute-2024. | 443.1 MB | #####7 | 58% 2025-05-07T20:25:36.6660416Z 2025-05-07T20:25:36.6660420Z 2025-05-07T20:25:36.6660424Z 2025-05-07T20:25:36.6660428Z 2025-05-07T20:25:36.6667479Z 2025-05-07T20:25:36.6868533Z cuda-nvvp-12.6.80 | 109.3 MB | ####9 | 50%  2025-05-07T20:25:36.6868831Z 2025-05-07T20:25:36.6868835Z 2025-05-07T20:25:36.6868839Z 2025-05-07T20:25:36.6868843Z 2025-05-07T20:25:36.6868846Z 2025-05-07T20:25:36.6868850Z 2025-05-07T20:25:36.7272926Z libcusolver-11.7.1.2 | 95.8 MB | #####7 | 58%  2025-05-07T20:25:36.7768310Z nsight-compute-2024. | 443.1 MB | #####8 | 58% 2025-05-07T20:25:36.7768602Z 2025-05-07T20:25:36.7768606Z 2025-05-07T20:25:36.7768610Z 2025-05-07T20:25:36.7768613Z 2025-05-07T20:25:36.7771159Z 2025-05-07T20:25:36.7900276Z cuda-nvvp-12.6.80 | 109.3 MB | #####2 | 52%  2025-05-07T20:25:36.7900665Z 2025-05-07T20:25:36.7900669Z 2025-05-07T20:25:36.7900673Z 2025-05-07T20:25:36.7900686Z 2025-05-07T20:25:36.7900690Z 2025-05-07T20:25:36.7900693Z 2025-05-07T20:25:36.8322604Z libcusolver-11.7.1.2 | 95.8 MB | ######1 | 61%  2025-05-07T20:25:36.8770970Z nsight-compute-2024. | 443.1 MB | #####9 | 59% 2025-05-07T20:25:36.8771352Z 2025-05-07T20:25:36.8771657Z 2025-05-07T20:25:36.8771662Z 2025-05-07T20:25:36.8771668Z 2025-05-07T20:25:36.8773142Z 2025-05-07T20:25:36.8972806Z cuda-nvvp-12.6.80 | 109.3 MB | #####5 | 55%  2025-05-07T20:25:36.8973375Z 2025-05-07T20:25:36.8973383Z 2025-05-07T20:25:36.8973388Z 2025-05-07T20:25:36.8973393Z 2025-05-07T20:25:36.8973398Z 2025-05-07T20:25:36.8975680Z 2025-05-07T20:25:36.9422925Z libcusolver-11.7.1.2 | 95.8 MB | ######4 | 64%  2025-05-07T20:25:36.9771430Z nsight-compute-2024. | 443.1 MB | #####9 | 60% 2025-05-07T20:25:36.9771696Z 2025-05-07T20:25:36.9771700Z 2025-05-07T20:25:36.9771704Z 2025-05-07T20:25:36.9771708Z 2025-05-07T20:25:36.9776076Z 2025-05-07T20:25:36.9974750Z cuda-nvvp-12.6.80 | 109.3 MB | #####8 | 58%  2025-05-07T20:25:36.9975042Z 2025-05-07T20:25:36.9975047Z 2025-05-07T20:25:36.9975050Z 2025-05-07T20:25:36.9975063Z 2025-05-07T20:25:36.9975066Z 2025-05-07T20:25:36.9976905Z 2025-05-07T20:25:37.0429443Z libcusolver-11.7.1.2 | 95.8 MB | ######7 | 68%  2025-05-07T20:25:37.0773863Z nsight-compute-2024. | 443.1 MB | ###### | 60% 2025-05-07T20:25:37.0774129Z 2025-05-07T20:25:37.0774133Z 2025-05-07T20:25:37.0774149Z 2025-05-07T20:25:37.0774153Z 2025-05-07T20:25:37.0774157Z 2025-05-07T20:25:37.0985298Z cuda-nvvp-12.6.80 | 109.3 MB | ###### | 61%  2025-05-07T20:25:37.0985582Z 2025-05-07T20:25:37.0985586Z 2025-05-07T20:25:37.0985590Z 2025-05-07T20:25:37.0985594Z 2025-05-07T20:25:37.0985597Z 2025-05-07T20:25:37.0987157Z 2025-05-07T20:25:37.1439638Z libcusolver-11.7.1.2 | 95.8 MB | ####### | 71%  2025-05-07T20:25:37.1774120Z nsight-compute-2024. | 443.1 MB | ######1 | 61% 2025-05-07T20:25:37.1774377Z 2025-05-07T20:25:37.1774515Z 2025-05-07T20:25:37.1774520Z 2025-05-07T20:25:37.1774524Z 2025-05-07T20:25:37.1775883Z 2025-05-07T20:25:37.1986279Z cuda-nvvp-12.6.80 | 109.3 MB | ######3 | 64%  2025-05-07T20:25:37.1986575Z 2025-05-07T20:25:37.1986583Z 2025-05-07T20:25:37.1986587Z 2025-05-07T20:25:37.1986591Z 2025-05-07T20:25:37.1986594Z 2025-05-07T20:25:37.1990440Z 2025-05-07T20:25:37.2775418Z libcusolver-11.7.1.2 | 95.8 MB | #######4 | 75%  2025-05-07T20:25:37.2775714Z 2025-05-07T20:25:37.2775718Z 2025-05-07T20:25:37.2775722Z 2025-05-07T20:25:37.2775725Z 2025-05-07T20:25:37.2779742Z 2025-05-07T20:25:37.2992875Z cuda-nvvp-12.6.80 | 109.3 MB | ######7 | 67%  2025-05-07T20:25:37.2993153Z 2025-05-07T20:25:37.2993157Z 2025-05-07T20:25:37.2993160Z 2025-05-07T20:25:37.2993164Z 2025-05-07T20:25:37.2993168Z 2025-05-07T20:25:37.2995720Z 2025-05-07T20:25:37.3425171Z libcusolver-11.7.1.2 | 95.8 MB | #######8 | 79%  2025-05-07T20:25:37.3782019Z nsight-compute-2024. | 443.1 MB | ######1 | 62% 2025-05-07T20:25:37.3782281Z 2025-05-07T20:25:37.3782286Z 2025-05-07T20:25:37.3782289Z 2025-05-07T20:25:37.3782293Z 2025-05-07T20:25:37.3784824Z 2025-05-07T20:25:37.3993330Z cuda-nvvp-12.6.80 | 109.3 MB | ####### | 70%  2025-05-07T20:25:37.3993618Z 2025-05-07T20:25:37.3993622Z 2025-05-07T20:25:37.3993625Z 2025-05-07T20:25:37.3993629Z 2025-05-07T20:25:37.3993644Z 2025-05-07T20:25:37.3993648Z 2025-05-07T20:25:37.4429644Z libcusolver-11.7.1.2 | 95.8 MB | ########2 | 83%  2025-05-07T20:25:37.4937332Z nsight-compute-2024. | 443.1 MB | ######2 | 63% 2025-05-07T20:25:37.4937599Z 2025-05-07T20:25:37.4937603Z 2025-05-07T20:25:37.4937607Z 2025-05-07T20:25:37.4937611Z 2025-05-07T20:25:37.4937614Z 2025-05-07T20:25:37.5219535Z cuda-nvvp-12.6.80 | 109.3 MB | #######3 | 73%  2025-05-07T20:25:37.5219871Z 2025-05-07T20:25:37.5219875Z 2025-05-07T20:25:37.5219879Z 2025-05-07T20:25:37.5219882Z 2025-05-07T20:25:37.5219886Z 2025-05-07T20:25:37.5219890Z 2025-05-07T20:25:37.5433545Z libcusolver-11.7.1.2 | 95.8 MB | ########6 | 86%  2025-05-07T20:25:37.5944655Z nsight-compute-2024. | 443.1 MB | ######3 | 63% 2025-05-07T20:25:37.5944910Z 2025-05-07T20:25:37.5944914Z 2025-05-07T20:25:37.5944918Z 2025-05-07T20:25:37.5944921Z 2025-05-07T20:25:37.5948849Z 2025-05-07T20:25:37.6294299Z cuda-nvvp-12.6.80 | 109.3 MB | #######6 | 76%  2025-05-07T20:25:37.6294678Z 2025-05-07T20:25:37.6294683Z 2025-05-07T20:25:37.6294687Z 2025-05-07T20:25:37.6294704Z 2025-05-07T20:25:37.6294708Z 2025-05-07T20:25:37.6299322Z 2025-05-07T20:25:37.6436819Z libcusolver-11.7.1.2 | 95.8 MB | ########9 | 90%  2025-05-07T20:25:37.7084163Z nsight-compute-2024. | 443.1 MB | ######4 | 64% 2025-05-07T20:25:37.7084438Z 2025-05-07T20:25:37.7084442Z 2025-05-07T20:25:37.7084446Z 2025-05-07T20:25:37.7084450Z 2025-05-07T20:25:37.7087338Z 2025-05-07T20:25:37.7295423Z cuda-nvvp-12.6.80 | 109.3 MB | #######9 | 79%  2025-05-07T20:25:37.7295705Z 2025-05-07T20:25:37.7295709Z 2025-05-07T20:25:37.7295712Z 2025-05-07T20:25:37.7295737Z 2025-05-07T20:25:37.7295748Z 2025-05-07T20:25:37.7299979Z 2025-05-07T20:25:37.7441763Z libcusolver-11.7.1.2 | 95.8 MB | #########3 | 93%  2025-05-07T20:25:37.8175018Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:25:37.8175279Z 2025-05-07T20:25:37.8175283Z 2025-05-07T20:25:37.8175286Z 2025-05-07T20:25:37.8175290Z 2025-05-07T20:25:37.8176580Z 2025-05-07T20:25:37.8307741Z cuda-nvvp-12.6.80 | 109.3 MB | ########2 | 82%  2025-05-07T20:25:37.8308067Z 2025-05-07T20:25:37.8308299Z 2025-05-07T20:25:37.8308307Z 2025-05-07T20:25:37.8308312Z 2025-05-07T20:25:37.8308317Z 2025-05-07T20:25:37.8308322Z 2025-05-07T20:25:37.8504806Z libcusolver-11.7.1.2 | 95.8 MB | #########6 | 97%  2025-05-07T20:25:37.9180185Z nsight-compute-2024. | 443.1 MB | ######5 | 66% 2025-05-07T20:25:37.9180449Z 2025-05-07T20:25:37.9180454Z 2025-05-07T20:25:37.9180458Z 2025-05-07T20:25:37.9180461Z 2025-05-07T20:25:37.9182070Z 2025-05-07T20:25:37.9506910Z cuda-nvvp-12.6.80 | 109.3 MB | ########4 | 85%  2025-05-07T20:25:38.0185965Z nsight-compute-2024. | 443.1 MB | ######6 | 66% 2025-05-07T20:25:38.0186227Z 2025-05-07T20:25:38.0186258Z 2025-05-07T20:25:38.0186264Z 2025-05-07T20:25:38.0186269Z 2025-05-07T20:25:38.0188156Z 2025-05-07T20:25:38.0511284Z cuda-nvvp-12.6.80 | 109.3 MB | ########7 | 88%  2025-05-07T20:25:38.0815126Z nsight-compute-2024. | 443.1 MB | ######7 | 67% 2025-05-07T20:25:38.0815384Z 2025-05-07T20:25:38.0815388Z 2025-05-07T20:25:38.0815392Z 2025-05-07T20:25:38.0815457Z 2025-05-07T20:25:38.1190454Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:38.1190737Z 2025-05-07T20:25:38.1190742Z 2025-05-07T20:25:38.1190748Z 2025-05-07T20:25:38.1190752Z 2025-05-07T20:25:38.1190922Z 2025-05-07T20:25:38.1512557Z cuda-nvvp-12.6.80 | 109.3 MB | #########1 | 91%  2025-05-07T20:25:38.2195625Z nsight-compute-2024. | 443.1 MB | ######8 | 68% 2025-05-07T20:25:38.2195948Z 2025-05-07T20:25:38.2195952Z 2025-05-07T20:25:38.2195957Z 2025-05-07T20:25:38.2195961Z 2025-05-07T20:25:38.2196091Z 2025-05-07T20:25:38.2512807Z cuda-nvvp-12.6.80 | 109.3 MB | #########5 | 95%  2025-05-07T20:25:38.3196369Z nsight-compute-2024. | 443.1 MB | ######9 | 69% 2025-05-07T20:25:38.3196625Z 2025-05-07T20:25:38.3196629Z 2025-05-07T20:25:38.3196633Z 2025-05-07T20:25:38.3196637Z 2025-05-07T20:25:38.3198241Z 2025-05-07T20:25:38.3516052Z cuda-nvvp-12.6.80 | 109.3 MB | #########8 | 99%  2025-05-07T20:25:38.4518126Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:25:38.5518269Z nsight-compute-2024. | 443.1 MB | #######1 | 71% 2025-05-07T20:25:38.6519879Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:25:38.7521267Z nsight-compute-2024. | 443.1 MB | #######3 | 73% 2025-05-07T20:25:38.8516107Z nsight-compute-2024. | 443.1 MB | #######4 | 74% 2025-05-07T20:25:38.8516835Z 2025-05-07T20:25:38.8517663Z 2025-05-07T20:25:38.8596623Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:38.9105065Z nsight-compute-2024. | 443.1 MB | #######5 | 75% 2025-05-07T20:25:38.9105425Z 2025-05-07T20:25:38.9105431Z 2025-05-07T20:25:38.9105436Z 2025-05-07T20:25:38.9105441Z 2025-05-07T20:25:38.9105446Z 2025-05-07T20:25:38.9105451Z 2025-05-07T20:25:38.9118779Z 2025-05-07T20:25:38.9735561Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:39.0109158Z nsight-compute-2024. | 443.1 MB | #######6 | 76% 2025-05-07T20:25:39.0109471Z 2025-05-07T20:25:39.0109475Z 2025-05-07T20:25:39.0109479Z 2025-05-07T20:25:39.0109483Z 2025-05-07T20:25:39.0109487Z 2025-05-07T20:25:39.0109492Z 2025-05-07T20:25:39.0109495Z 2025-05-07T20:25:39.0840635Z libnpp-12.3.1.54 | 93.4 MB | 3 | 4%  2025-05-07T20:25:39.1111907Z nsight-compute-2024. | 443.1 MB | #######7 | 77% 2025-05-07T20:25:39.1112214Z 2025-05-07T20:25:39.1112218Z 2025-05-07T20:25:39.1112222Z 2025-05-07T20:25:39.1112227Z 2025-05-07T20:25:39.1112232Z 2025-05-07T20:25:39.1112235Z 2025-05-07T20:25:39.1112239Z 2025-05-07T20:25:39.2008951Z libnpp-12.3.1.54 | 93.4 MB | 7 | 8%  2025-05-07T20:25:39.2114548Z nsight-compute-2024. | 443.1 MB | #######8 | 78% 2025-05-07T20:25:39.2114811Z 2025-05-07T20:25:39.2114815Z 2025-05-07T20:25:39.2114828Z 2025-05-07T20:25:39.2114832Z 2025-05-07T20:25:39.2114836Z 2025-05-07T20:25:39.2114840Z 2025-05-07T20:25:39.2118454Z 2025-05-07T20:25:39.3169933Z libnpp-12.3.1.54 | 93.4 MB | # | 11%  2025-05-07T20:25:39.3363881Z nsight-compute-2024. | 443.1 MB | #######9 | 79% 2025-05-07T20:25:39.3364254Z 2025-05-07T20:25:39.3364261Z 2025-05-07T20:25:39.3364266Z 2025-05-07T20:25:39.3364273Z 2025-05-07T20:25:39.3364278Z 2025-05-07T20:25:39.3364295Z 2025-05-07T20:25:39.3368660Z 2025-05-07T20:25:39.4177309Z libnpp-12.3.1.54 | 93.4 MB | #4 | 14%  2025-05-07T20:25:39.4366698Z nsight-compute-2024. | 443.1 MB | #######9 | 80% 2025-05-07T20:25:39.4367083Z 2025-05-07T20:25:39.4367108Z 2025-05-07T20:25:39.4367114Z 2025-05-07T20:25:39.4367119Z 2025-05-07T20:25:39.4367124Z 2025-05-07T20:25:39.4367129Z 2025-05-07T20:25:39.4369952Z 2025-05-07T20:25:39.5236273Z libnpp-12.3.1.54 | 93.4 MB | #7 | 18%  2025-05-07T20:25:39.5822728Z nsight-compute-2024. | 443.1 MB | ######## | 81% 2025-05-07T20:25:39.5823101Z 2025-05-07T20:25:39.5823107Z 2025-05-07T20:25:39.5823113Z 2025-05-07T20:25:39.5823118Z 2025-05-07T20:25:39.5823123Z 2025-05-07T20:25:39.5823129Z 2025-05-07T20:25:39.5823135Z 2025-05-07T20:25:39.6242302Z libnpp-12.3.1.54 | 93.4 MB | ## | 21%  2025-05-07T20:25:39.7120823Z nsight-compute-2024. | 443.1 MB | ########1 | 82% 2025-05-07T20:25:39.7121195Z 2025-05-07T20:25:39.7121233Z 2025-05-07T20:25:39.7121238Z 2025-05-07T20:25:39.7121244Z 2025-05-07T20:25:39.7121249Z 2025-05-07T20:25:39.7121254Z 2025-05-07T20:25:39.7121259Z 2025-05-07T20:25:39.7342456Z libnpp-12.3.1.54 | 93.4 MB | ##3 | 24%  2025-05-07T20:25:39.8122614Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:25:39.8122992Z 2025-05-07T20:25:39.8122998Z 2025-05-07T20:25:39.8123003Z 2025-05-07T20:25:39.8123008Z 2025-05-07T20:25:39.8123013Z 2025-05-07T20:25:39.8123018Z 2025-05-07T20:25:39.8123023Z 2025-05-07T20:25:39.8698420Z libnpp-12.3.1.54 | 93.4 MB | ##6 | 27%  2025-05-07T20:25:39.9123808Z nsight-compute-2024. | 443.1 MB | ########3 | 83% 2025-05-07T20:25:39.9124176Z 2025-05-07T20:25:39.9124182Z 2025-05-07T20:25:39.9124187Z 2025-05-07T20:25:39.9124192Z 2025-05-07T20:25:39.9124196Z 2025-05-07T20:25:39.9124201Z 2025-05-07T20:25:39.9124206Z 2025-05-07T20:25:39.9713538Z libnpp-12.3.1.54 | 93.4 MB | ### | 30%  2025-05-07T20:25:40.0125597Z nsight-compute-2024. | 443.1 MB | ########4 | 84% 2025-05-07T20:25:40.0126350Z 2025-05-07T20:25:40.0126369Z 2025-05-07T20:25:40.0126375Z 2025-05-07T20:25:40.0126381Z 2025-05-07T20:25:40.0126657Z 2025-05-07T20:25:40.0126662Z 2025-05-07T20:25:40.0126907Z 2025-05-07T20:25:40.0782758Z libnpp-12.3.1.54 | 93.4 MB | ###3 | 33%  2025-05-07T20:25:40.1131011Z nsight-compute-2024. | 443.1 MB | ########4 | 85% 2025-05-07T20:25:40.1131374Z 2025-05-07T20:25:40.1131379Z 2025-05-07T20:25:40.1131385Z 2025-05-07T20:25:40.1131402Z 2025-05-07T20:25:40.1131407Z 2025-05-07T20:25:40.1131413Z 2025-05-07T20:25:40.1133112Z 2025-05-07T20:25:40.1787383Z libnpp-12.3.1.54 | 93.4 MB | ###6 | 37%  2025-05-07T20:25:40.2139047Z nsight-compute-2024. | 443.1 MB | ########5 | 86% 2025-05-07T20:25:40.2139413Z 2025-05-07T20:25:40.2139419Z 2025-05-07T20:25:40.2139424Z 2025-05-07T20:25:40.2139430Z 2025-05-07T20:25:40.2139462Z 2025-05-07T20:25:40.2139467Z 2025-05-07T20:25:40.2139481Z 2025-05-07T20:25:40.2820297Z libnpp-12.3.1.54 | 93.4 MB | #### | 41%  2025-05-07T20:25:40.3148235Z nsight-compute-2024. | 443.1 MB | ########6 | 86% 2025-05-07T20:25:40.3148597Z 2025-05-07T20:25:40.3148604Z 2025-05-07T20:25:40.3148609Z 2025-05-07T20:25:40.3148614Z 2025-05-07T20:25:40.3148619Z 2025-05-07T20:25:40.3148624Z 2025-05-07T20:25:40.3152730Z 2025-05-07T20:25:40.3912756Z libnpp-12.3.1.54 | 93.4 MB | ####3 | 44%  2025-05-07T20:25:40.4224653Z nsight-compute-2024. | 443.1 MB | ########7 | 87% 2025-05-07T20:25:40.4225020Z 2025-05-07T20:25:40.4225027Z 2025-05-07T20:25:40.4225042Z 2025-05-07T20:25:40.4225047Z 2025-05-07T20:25:40.4225052Z 2025-05-07T20:25:40.4225057Z 2025-05-07T20:25:40.4227642Z 2025-05-07T20:25:40.4917861Z libnpp-12.3.1.54 | 93.4 MB | ####7 | 47%  2025-05-07T20:25:40.5226499Z nsight-compute-2024. | 443.1 MB | ########7 | 88% 2025-05-07T20:25:40.5226902Z 2025-05-07T20:25:40.5226908Z 2025-05-07T20:25:40.5226913Z 2025-05-07T20:25:40.5226918Z 2025-05-07T20:25:40.5226923Z 2025-05-07T20:25:40.5226937Z 2025-05-07T20:25:40.5230162Z 2025-05-07T20:25:40.5950921Z libnpp-12.3.1.54 | 93.4 MB | ##### | 51%  2025-05-07T20:25:40.6119379Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:25:40.6119646Z 2025-05-07T20:25:40.6119857Z 2025-05-07T20:25:40.6119862Z 2025-05-07T20:25:40.6119875Z 2025-05-07T20:25:40.6119891Z 2025-05-07T20:25:40.6133109Z 2025-05-07T20:25:40.6227706Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:40.6228159Z 2025-05-07T20:25:40.6228163Z 2025-05-07T20:25:40.6228168Z 2025-05-07T20:25:40.6228172Z 2025-05-07T20:25:40.6228176Z 2025-05-07T20:25:40.6228181Z 2025-05-07T20:25:40.6228185Z 2025-05-07T20:25:40.6570375Z libnpp-12.3.1.54 | 93.4 MB | #####4 | 54%  2025-05-07T20:25:40.6570717Z 2025-05-07T20:25:40.6570721Z 2025-05-07T20:25:40.6570724Z 2025-05-07T20:25:40.6570728Z 2025-05-07T20:25:40.6570732Z 2025-05-07T20:25:40.6570736Z 2025-05-07T20:25:40.6570739Z 2025-05-07T20:25:40.6572614Z 2025-05-07T20:25:40.6968029Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:40.7403839Z nsight-compute-2024. | 443.1 MB | ########9 | 89% 2025-05-07T20:25:40.7404150Z 2025-05-07T20:25:40.7404156Z 2025-05-07T20:25:40.7404161Z 2025-05-07T20:25:40.7404166Z 2025-05-07T20:25:40.7404172Z 2025-05-07T20:25:40.7404177Z 2025-05-07T20:25:40.7406239Z 2025-05-07T20:25:40.7571750Z libnpp-12.3.1.54 | 93.4 MB | #####7 | 57%  2025-05-07T20:25:40.7572187Z 2025-05-07T20:25:40.7572194Z 2025-05-07T20:25:40.7572199Z 2025-05-07T20:25:40.7572205Z 2025-05-07T20:25:40.7572210Z 2025-05-07T20:25:40.7572216Z 2025-05-07T20:25:40.7572221Z 2025-05-07T20:25:40.7577762Z 2025-05-07T20:25:40.8099744Z cuda-nvdisasm-12.6.7 | 47.6 MB | 6 | 6%  2025-05-07T20:25:40.8574612Z nsight-compute-2024. | 443.1 MB | ######### | 90% 2025-05-07T20:25:40.8574895Z 2025-05-07T20:25:40.8574899Z 2025-05-07T20:25:40.8575146Z 2025-05-07T20:25:40.8575151Z 2025-05-07T20:25:40.8575155Z 2025-05-07T20:25:40.8575158Z 2025-05-07T20:25:40.8575162Z 2025-05-07T20:25:40.8575811Z 2025-05-07T20:25:40.9099489Z cuda-nvdisasm-12.6.7 | 47.6 MB | #2 | 13%  2025-05-07T20:25:40.9470092Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:25:40.9470429Z 2025-05-07T20:25:40.9470644Z 2025-05-07T20:25:40.9470657Z 2025-05-07T20:25:40.9470661Z 2025-05-07T20:25:40.9470664Z 2025-05-07T20:25:40.9470668Z 2025-05-07T20:25:40.9472555Z 2025-05-07T20:25:40.9578550Z libnpp-12.3.1.54 | 93.4 MB | ###### | 61%  2025-05-07T20:25:40.9578844Z 2025-05-07T20:25:40.9578848Z 2025-05-07T20:25:40.9578851Z 2025-05-07T20:25:40.9578883Z 2025-05-07T20:25:40.9578887Z 2025-05-07T20:25:40.9578892Z 2025-05-07T20:25:40.9578896Z 2025-05-07T20:25:40.9580490Z 2025-05-07T20:25:41.0105218Z cuda-nvdisasm-12.6.7 | 47.6 MB | ## | 20%  2025-05-07T20:25:41.0471091Z nsight-compute-2024. | 443.1 MB | #########1 | 92% 2025-05-07T20:25:41.0471369Z 2025-05-07T20:25:41.0471373Z 2025-05-07T20:25:41.0471377Z 2025-05-07T20:25:41.0471380Z 2025-05-07T20:25:41.0471384Z 2025-05-07T20:25:41.0471387Z 2025-05-07T20:25:41.0472828Z 2025-05-07T20:25:41.0703547Z libnpp-12.3.1.54 | 93.4 MB | ######3 | 63%  2025-05-07T20:25:41.0703830Z 2025-05-07T20:25:41.0703834Z 2025-05-07T20:25:41.0703837Z 2025-05-07T20:25:41.0703841Z 2025-05-07T20:25:41.0703845Z 2025-05-07T20:25:41.0703849Z 2025-05-07T20:25:41.0703853Z 2025-05-07T20:25:41.0704959Z 2025-05-07T20:25:41.1362127Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##7 | 27%  2025-05-07T20:25:41.1561005Z nsight-compute-2024. | 443.1 MB | #########2 | 92% 2025-05-07T20:25:41.1561416Z 2025-05-07T20:25:41.1561422Z 2025-05-07T20:25:41.1561427Z 2025-05-07T20:25:41.1561432Z 2025-05-07T20:25:41.1561438Z 2025-05-07T20:25:41.1561442Z 2025-05-07T20:25:41.1561460Z 2025-05-07T20:25:41.1918531Z libnpp-12.3.1.54 | 93.4 MB | ######6 | 66%  2025-05-07T20:25:41.1918920Z 2025-05-07T20:25:41.1918925Z 2025-05-07T20:25:41.1918931Z 2025-05-07T20:25:41.1918936Z 2025-05-07T20:25:41.1918941Z 2025-05-07T20:25:41.1918946Z 2025-05-07T20:25:41.1918952Z 2025-05-07T20:25:41.1918957Z 2025-05-07T20:25:41.2363671Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###4 | 34%  2025-05-07T20:25:41.2562695Z nsight-compute-2024. | 443.1 MB | #########2 | 93% 2025-05-07T20:25:41.2563066Z 2025-05-07T20:25:41.2563073Z 2025-05-07T20:25:41.2563078Z 2025-05-07T20:25:41.2563083Z 2025-05-07T20:25:41.2563088Z 2025-05-07T20:25:41.2563093Z 2025-05-07T20:25:41.2564673Z 2025-05-07T20:25:41.2940101Z libnpp-12.3.1.54 | 93.4 MB | ######8 | 69%  2025-05-07T20:25:41.2940519Z 2025-05-07T20:25:41.2940525Z 2025-05-07T20:25:41.2940531Z 2025-05-07T20:25:41.2940536Z 2025-05-07T20:25:41.2940541Z 2025-05-07T20:25:41.2940559Z 2025-05-07T20:25:41.2940565Z 2025-05-07T20:25:41.2940570Z 2025-05-07T20:25:41.3531873Z cuda-nvdisasm-12.6.7 | 47.6 MB | #### | 40%  2025-05-07T20:25:41.3565937Z nsight-compute-2024. | 443.1 MB | #########3 | 94% 2025-05-07T20:25:41.3566285Z 2025-05-07T20:25:41.3566292Z 2025-05-07T20:25:41.3566299Z 2025-05-07T20:25:41.3566310Z 2025-05-07T20:25:41.3566316Z 2025-05-07T20:25:41.3566322Z 2025-05-07T20:25:41.3568029Z 2025-05-07T20:25:41.3951024Z libnpp-12.3.1.54 | 93.4 MB | #######1 | 72%  2025-05-07T20:25:41.3951410Z 2025-05-07T20:25:41.3951414Z 2025-05-07T20:25:41.3951418Z 2025-05-07T20:25:41.3951422Z 2025-05-07T20:25:41.3951426Z 2025-05-07T20:25:41.3951429Z 2025-05-07T20:25:41.3951433Z 2025-05-07T20:25:41.3956250Z 2025-05-07T20:25:41.4568488Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####6 | 47%  2025-05-07T20:25:41.4568880Z 2025-05-07T20:25:41.4568884Z 2025-05-07T20:25:41.4568888Z 2025-05-07T20:25:41.4569107Z 2025-05-07T20:25:41.4569112Z 2025-05-07T20:25:41.4569125Z 2025-05-07T20:25:41.4570967Z 2025-05-07T20:25:41.4585377Z libnpp-12.3.1.54 | 93.4 MB | #######4 | 75%  2025-05-07T20:25:41.4951558Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:25:41.4951921Z 2025-05-07T20:25:41.4951927Z 2025-05-07T20:25:41.4951932Z 2025-05-07T20:25:41.4951937Z 2025-05-07T20:25:41.4951942Z 2025-05-07T20:25:41.4951947Z 2025-05-07T20:25:41.4951952Z 2025-05-07T20:25:41.4951957Z 2025-05-07T20:25:41.5581495Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####3 | 53%  2025-05-07T20:25:41.5581883Z 2025-05-07T20:25:41.5581887Z 2025-05-07T20:25:41.5581892Z 2025-05-07T20:25:41.5581896Z 2025-05-07T20:25:41.5581935Z 2025-05-07T20:25:41.5581939Z 2025-05-07T20:25:41.5583455Z 2025-05-07T20:25:41.5590985Z libnpp-12.3.1.54 | 93.4 MB | #######7 | 78%  2025-05-07T20:25:41.6097910Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:25:41.6098190Z 2025-05-07T20:25:41.6098195Z 2025-05-07T20:25:41.6098198Z 2025-05-07T20:25:41.6098202Z 2025-05-07T20:25:41.6098206Z 2025-05-07T20:25:41.6098209Z 2025-05-07T20:25:41.6098213Z 2025-05-07T20:25:41.6101044Z 2025-05-07T20:25:41.6582933Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####9 | 60%  2025-05-07T20:25:41.6583343Z 2025-05-07T20:25:41.6583347Z 2025-05-07T20:25:41.6583360Z 2025-05-07T20:25:41.6583364Z 2025-05-07T20:25:41.6583367Z 2025-05-07T20:25:41.6583372Z 2025-05-07T20:25:41.6584183Z 2025-05-07T20:25:41.6654173Z libnpp-12.3.1.54 | 93.4 MB | ######## | 81%  2025-05-07T20:25:41.7098023Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:25:41.7098448Z 2025-05-07T20:25:41.7098454Z 2025-05-07T20:25:41.7098460Z 2025-05-07T20:25:41.7098465Z 2025-05-07T20:25:41.7098470Z 2025-05-07T20:25:41.7098486Z 2025-05-07T20:25:41.7098491Z 2025-05-07T20:25:41.7101106Z 2025-05-07T20:25:41.7617464Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######5 | 66%  2025-05-07T20:25:41.7617865Z 2025-05-07T20:25:41.7617880Z 2025-05-07T20:25:41.7617884Z 2025-05-07T20:25:41.7617888Z 2025-05-07T20:25:41.7617892Z 2025-05-07T20:25:41.7617895Z 2025-05-07T20:25:41.7621068Z 2025-05-07T20:25:41.7758230Z libnpp-12.3.1.54 | 93.4 MB | ########3 | 84%  2025-05-07T20:25:41.8101228Z nsight-compute-2024. | 443.1 MB | #########6 | 96% 2025-05-07T20:25:41.8101584Z 2025-05-07T20:25:41.8101591Z 2025-05-07T20:25:41.8101596Z 2025-05-07T20:25:41.8101601Z 2025-05-07T20:25:41.8101615Z 2025-05-07T20:25:41.8101621Z 2025-05-07T20:25:41.8101626Z 2025-05-07T20:25:41.8101631Z 2025-05-07T20:25:41.8200489Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######2 | 72%  2025-05-07T20:25:41.8200926Z 2025-05-07T20:25:41.8200941Z 2025-05-07T20:25:41.8200946Z 2025-05-07T20:25:41.8200952Z 2025-05-07T20:25:41.8201716Z 2025-05-07T20:25:41.8623520Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:41.8623833Z 2025-05-07T20:25:41.8623838Z 2025-05-07T20:25:41.8623841Z 2025-05-07T20:25:41.8623845Z 2025-05-07T20:25:41.8623849Z 2025-05-07T20:25:41.8623853Z 2025-05-07T20:25:41.8633137Z 2025-05-07T20:25:41.8761348Z libnpp-12.3.1.54 | 93.4 MB | ########7 | 87%  2025-05-07T20:25:41.8883950Z nsight-compute-2024. | 443.1 MB | #########7 | 97% 2025-05-07T20:25:41.8884360Z 2025-05-07T20:25:41.8884374Z 2025-05-07T20:25:41.8884380Z 2025-05-07T20:25:41.8884385Z 2025-05-07T20:25:41.8884390Z 2025-05-07T20:25:41.8884396Z 2025-05-07T20:25:41.8884401Z 2025-05-07T20:25:41.8884406Z 2025-05-07T20:25:41.8889073Z 2025-05-07T20:25:41.9105532Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:41.9106482Z 2025-05-07T20:25:41.9106487Z 2025-05-07T20:25:41.9106491Z 2025-05-07T20:25:41.9106495Z 2025-05-07T20:25:41.9106498Z 2025-05-07T20:25:41.9106502Z 2025-05-07T20:25:41.9106683Z 2025-05-07T20:25:41.9107386Z 2025-05-07T20:25:41.9727323Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######8 | 79%  2025-05-07T20:25:41.9727716Z 2025-05-07T20:25:41.9727721Z 2025-05-07T20:25:41.9727724Z 2025-05-07T20:25:41.9727728Z 2025-05-07T20:25:41.9727732Z 2025-05-07T20:25:41.9727736Z 2025-05-07T20:25:41.9736070Z 2025-05-07T20:25:41.9868785Z libnpp-12.3.1.54 | 93.4 MB | ######### | 90%  2025-05-07T20:25:41.9887052Z nsight-compute-2024. | 443.1 MB | #########7 | 98% 2025-05-07T20:25:41.9887441Z 2025-05-07T20:25:41.9887447Z 2025-05-07T20:25:41.9887453Z 2025-05-07T20:25:41.9887459Z 2025-05-07T20:25:41.9887464Z 2025-05-07T20:25:41.9887470Z 2025-05-07T20:25:41.9887477Z 2025-05-07T20:25:41.9887483Z 2025-05-07T20:25:41.9889422Z 2025-05-07T20:25:42.0208523Z libcurand-10.3.7.77 | 39.9 MB | 5 | 6%  2025-05-07T20:25:42.0208834Z 2025-05-07T20:25:42.0208839Z 2025-05-07T20:25:42.0208843Z 2025-05-07T20:25:42.0208864Z 2025-05-07T20:25:42.0208867Z 2025-05-07T20:25:42.0208871Z 2025-05-07T20:25:42.0208875Z 2025-05-07T20:25:42.0208878Z 2025-05-07T20:25:42.0847661Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########5 | 85%  2025-05-07T20:25:42.0848097Z 2025-05-07T20:25:42.0848103Z 2025-05-07T20:25:42.0848108Z 2025-05-07T20:25:42.0848113Z 2025-05-07T20:25:42.0848118Z 2025-05-07T20:25:42.0848123Z 2025-05-07T20:25:42.0849726Z 2025-05-07T20:25:42.0977434Z libnpp-12.3.1.54 | 93.4 MB | #########3 | 93%  2025-05-07T20:25:42.0977800Z 2025-05-07T20:25:42.0977806Z 2025-05-07T20:25:42.0977810Z 2025-05-07T20:25:42.0977823Z 2025-05-07T20:25:42.0977826Z 2025-05-07T20:25:42.0977831Z 2025-05-07T20:25:42.0977835Z 2025-05-07T20:25:42.0977873Z 2025-05-07T20:25:42.0982667Z 2025-05-07T20:25:42.0986131Z libcurand-10.3.7.77 | 39.9 MB | #1 | 12%  2025-05-07T20:25:42.1326382Z nsight-compute-2024. | 443.1 MB | #########8 | 98% 2025-05-07T20:25:42.1326788Z 2025-05-07T20:25:42.1326795Z 2025-05-07T20:25:42.1326800Z 2025-05-07T20:25:42.1326816Z 2025-05-07T20:25:42.1326822Z 2025-05-07T20:25:42.1326827Z 2025-05-07T20:25:42.1326832Z 2025-05-07T20:25:42.1330627Z 2025-05-07T20:25:42.1849893Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########1 | 91%  2025-05-07T20:25:42.1850216Z 2025-05-07T20:25:42.1850221Z 2025-05-07T20:25:42.1850224Z 2025-05-07T20:25:42.1850229Z 2025-05-07T20:25:42.1850232Z 2025-05-07T20:25:42.1850236Z 2025-05-07T20:25:42.1851594Z 2025-05-07T20:25:42.1987896Z libnpp-12.3.1.54 | 93.4 MB | #########6 | 96%  2025-05-07T20:25:42.1988288Z 2025-05-07T20:25:42.1988292Z 2025-05-07T20:25:42.1988296Z 2025-05-07T20:25:42.1988309Z 2025-05-07T20:25:42.1988343Z 2025-05-07T20:25:42.1988346Z 2025-05-07T20:25:42.1988351Z 2025-05-07T20:25:42.1988355Z 2025-05-07T20:25:42.1990349Z 2025-05-07T20:25:42.2146042Z libcurand-10.3.7.77 | 39.9 MB | #7 | 18%  2025-05-07T20:25:42.2589163Z nsight-compute-2024. | 443.1 MB | #########8 | 99% 2025-05-07T20:25:42.2589440Z 2025-05-07T20:25:42.2589444Z 2025-05-07T20:25:42.2589448Z 2025-05-07T20:25:42.2589459Z 2025-05-07T20:25:42.2589463Z 2025-05-07T20:25:42.2589467Z 2025-05-07T20:25:42.2589470Z 2025-05-07T20:25:42.2591242Z 2025-05-07T20:25:42.2892681Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########7 | 97%  2025-05-07T20:25:42.2893000Z 2025-05-07T20:25:42.2893004Z 2025-05-07T20:25:42.2893008Z 2025-05-07T20:25:42.2893013Z 2025-05-07T20:25:42.2893016Z 2025-05-07T20:25:42.2893021Z 2025-05-07T20:25:42.2895642Z 2025-05-07T20:25:42.2989621Z libnpp-12.3.1.54 | 93.4 MB | #########9 | 99%  2025-05-07T20:25:42.2990186Z 2025-05-07T20:25:42.2990191Z 2025-05-07T20:25:42.2990194Z 2025-05-07T20:25:42.2990198Z 2025-05-07T20:25:42.2990202Z 2025-05-07T20:25:42.2990206Z 2025-05-07T20:25:42.2990209Z 2025-05-07T20:25:42.2990213Z 2025-05-07T20:25:42.2992190Z 2025-05-07T20:25:42.3312845Z libcurand-10.3.7.77 | 39.9 MB | ##3 | 24%  2025-05-07T20:25:42.3991754Z nsight-compute-2024. | 443.1 MB | #########9 | 100% 2025-05-07T20:25:42.3992032Z 2025-05-07T20:25:42.3992036Z 2025-05-07T20:25:42.3992040Z 2025-05-07T20:25:42.3992043Z 2025-05-07T20:25:42.3992047Z 2025-05-07T20:25:42.3992051Z 2025-05-07T20:25:42.3992054Z 2025-05-07T20:25:42.3992058Z 2025-05-07T20:25:42.3994144Z 2025-05-07T20:25:42.4993675Z libcurand-10.3.7.77 | 39.9 MB | ###1 | 32%  2025-05-07T20:25:42.4994030Z 2025-05-07T20:25:42.4994036Z 2025-05-07T20:25:42.4994041Z 2025-05-07T20:25:42.4994046Z 2025-05-07T20:25:42.4994051Z 2025-05-07T20:25:42.4994057Z 2025-05-07T20:25:42.4994096Z 2025-05-07T20:25:42.4994101Z 2025-05-07T20:25:42.4995787Z 2025-05-07T20:25:42.6000346Z libcurand-10.3.7.77 | 39.9 MB | ####1 | 42%  2025-05-07T20:25:42.6000702Z 2025-05-07T20:25:42.6000729Z 2025-05-07T20:25:42.6000733Z 2025-05-07T20:25:42.6000737Z 2025-05-07T20:25:42.6000741Z 2025-05-07T20:25:42.6000745Z 2025-05-07T20:25:42.6000756Z 2025-05-07T20:25:42.6000760Z 2025-05-07T20:25:42.6000763Z 2025-05-07T20:25:42.7004313Z libcurand-10.3.7.77 | 39.9 MB | ##### | 51%  2025-05-07T20:25:42.7004742Z 2025-05-07T20:25:42.7004755Z 2025-05-07T20:25:42.7004760Z 2025-05-07T20:25:42.7004764Z 2025-05-07T20:25:42.7004768Z 2025-05-07T20:25:42.7004772Z 2025-05-07T20:25:42.7004776Z 2025-05-07T20:25:42.7004779Z 2025-05-07T20:25:42.7006654Z 2025-05-07T20:25:42.8005260Z libcurand-10.3.7.77 | 39.9 MB | ######1 | 61%  2025-05-07T20:25:42.8005865Z 2025-05-07T20:25:42.8005870Z 2025-05-07T20:25:42.8005907Z 2025-05-07T20:25:42.8005913Z 2025-05-07T20:25:42.8005918Z 2025-05-07T20:25:42.8005924Z 2025-05-07T20:25:42.8005930Z 2025-05-07T20:25:42.8005935Z 2025-05-07T20:25:42.8007672Z 2025-05-07T20:25:42.9006263Z libcurand-10.3.7.77 | 39.9 MB | ####### | 70%  2025-05-07T20:25:42.9006638Z 2025-05-07T20:25:42.9006644Z 2025-05-07T20:25:42.9006649Z 2025-05-07T20:25:42.9006654Z 2025-05-07T20:25:42.9006660Z 2025-05-07T20:25:42.9006665Z 2025-05-07T20:25:42.9006682Z 2025-05-07T20:25:42.9006688Z 2025-05-07T20:25:42.9009070Z 2025-05-07T20:25:43.0013502Z libcurand-10.3.7.77 | 39.9 MB | ######## | 81%  2025-05-07T20:25:43.0013884Z 2025-05-07T20:25:43.0013896Z 2025-05-07T20:25:43.0013900Z 2025-05-07T20:25:43.0013904Z 2025-05-07T20:25:43.0013908Z 2025-05-07T20:25:43.0013912Z 2025-05-07T20:25:43.0013916Z 2025-05-07T20:25:43.0013921Z 2025-05-07T20:25:43.0013924Z 2025-05-07T20:25:43.0174691Z libcurand-10.3.7.77 | 39.9 MB | ######### | 90%  2025-05-07T20:25:43.0175072Z 2025-05-07T20:25:43.0175077Z 2025-05-07T20:25:43.0178129Z 2025-05-07T20:25:43.1016550Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:43.1016860Z 2025-05-07T20:25:43.1018899Z 2025-05-07T20:25:43.1019154Z 2025-05-07T20:25:43.1019162Z 2025-05-07T20:25:43.1019165Z 2025-05-07T20:25:43.1019169Z 2025-05-07T20:25:43.1019173Z 2025-05-07T20:25:43.1019177Z 2025-05-07T20:25:43.1021152Z 2025-05-07T20:25:43.9679546Z libcurand-10.3.7.77 | 39.9 MB | #########9 | 100%  2025-05-07T20:25:43.9679890Z 2025-05-07T20:25:43.9679895Z 2025-05-07T20:25:43.9679898Z 2025-05-07T20:25:43.9679902Z 2025-05-07T20:25:43.9679906Z 2025-05-07T20:25:43.9679910Z 2025-05-07T20:25:43.9679913Z 2025-05-07T20:25:43.9680855Z 2025-05-07T20:25:44.0078792Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:44.0079121Z 2025-05-07T20:25:44.0079126Z 2025-05-07T20:25:44.0079386Z 2025-05-07T20:25:44.0079389Z 2025-05-07T20:25:44.0079393Z 2025-05-07T20:25:44.0079397Z 2025-05-07T20:25:44.0079400Z 2025-05-07T20:25:44.0079405Z 2025-05-07T20:25:44.0079409Z 2025-05-07T20:25:44.0085261Z 2025-05-07T20:25:44.1079162Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:44.1079639Z 2025-05-07T20:25:44.1079646Z 2025-05-07T20:25:44.1079652Z 2025-05-07T20:25:44.1079657Z 2025-05-07T20:25:44.1079677Z 2025-05-07T20:25:44.1079683Z 2025-05-07T20:25:44.1079688Z 2025-05-07T20:25:44.1079694Z 2025-05-07T20:25:44.1079699Z 2025-05-07T20:25:44.1079705Z 2025-05-07T20:25:44.2142865Z gds-tools-1.11.1.6 | 37.8 MB | 9 | 9%  2025-05-07T20:25:44.2143184Z 2025-05-07T20:25:44.2143188Z 2025-05-07T20:25:44.2143192Z 2025-05-07T20:25:44.2143196Z 2025-05-07T20:25:44.2143199Z 2025-05-07T20:25:44.2143204Z 2025-05-07T20:25:44.2143208Z 2025-05-07T20:25:44.2143212Z 2025-05-07T20:25:44.2143215Z 2025-05-07T20:25:44.2143237Z 2025-05-07T20:25:44.3166471Z gds-tools-1.11.1.6 | 37.8 MB | #8 | 19%  2025-05-07T20:25:44.3166839Z 2025-05-07T20:25:44.3166843Z 2025-05-07T20:25:44.3166854Z 2025-05-07T20:25:44.3166872Z 2025-05-07T20:25:44.3166876Z 2025-05-07T20:25:44.3166880Z 2025-05-07T20:25:44.3166884Z 2025-05-07T20:25:44.3166887Z 2025-05-07T20:25:44.3166891Z 2025-05-07T20:25:44.3169126Z 2025-05-07T20:25:44.4153795Z gds-tools-1.11.1.6 | 37.8 MB | ##7 | 28%  2025-05-07T20:25:44.4154096Z 2025-05-07T20:25:44.4167744Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:44.4168021Z 2025-05-07T20:25:44.4168025Z 2025-05-07T20:25:44.4168029Z 2025-05-07T20:25:44.4168032Z 2025-05-07T20:25:44.4168044Z 2025-05-07T20:25:44.4168048Z 2025-05-07T20:25:44.4168051Z 2025-05-07T20:25:44.4168055Z 2025-05-07T20:25:44.4168063Z 2025-05-07T20:25:44.4170652Z 2025-05-07T20:25:44.4348617Z gds-tools-1.11.1.6 | 37.8 MB | ###7 | 38%  2025-05-07T20:25:44.4348948Z 2025-05-07T20:25:44.4348952Z 2025-05-07T20:25:44.4348955Z 2025-05-07T20:25:44.4348959Z 2025-05-07T20:25:44.4348963Z 2025-05-07T20:25:44.4348967Z 2025-05-07T20:25:44.4348979Z 2025-05-07T20:25:44.4348982Z 2025-05-07T20:25:44.4350814Z 2025-05-07T20:25:44.4507753Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:44.4508214Z 2025-05-07T20:25:44.4508219Z 2025-05-07T20:25:44.4508225Z 2025-05-07T20:25:44.4508230Z 2025-05-07T20:25:44.4508235Z 2025-05-07T20:25:44.4508240Z 2025-05-07T20:25:44.4508246Z 2025-05-07T20:25:44.4508251Z 2025-05-07T20:25:44.4508256Z 2025-05-07T20:25:44.4508261Z 2025-05-07T20:25:44.4509514Z 2025-05-07T20:25:44.4860688Z python-3.11.8 | 29.3 MB | | 0%  2025-05-07T20:25:44.4860986Z 2025-05-07T20:25:44.4860990Z 2025-05-07T20:25:44.4860994Z 2025-05-07T20:25:44.4860997Z 2025-05-07T20:25:44.4861001Z 2025-05-07T20:25:44.4861017Z 2025-05-07T20:25:44.4861021Z 2025-05-07T20:25:44.4861024Z 2025-05-07T20:25:44.4861035Z 2025-05-07T20:25:44.4861039Z 2025-05-07T20:25:44.4861043Z 2025-05-07T20:25:44.4863949Z 2025-05-07T20:25:44.5375022Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:44.5375350Z 2025-05-07T20:25:44.5375354Z 2025-05-07T20:25:44.5375358Z 2025-05-07T20:25:44.5375362Z 2025-05-07T20:25:44.5375366Z 2025-05-07T20:25:44.5375369Z 2025-05-07T20:25:44.5375373Z 2025-05-07T20:25:44.5375377Z 2025-05-07T20:25:44.5375381Z 2025-05-07T20:25:44.5377233Z 2025-05-07T20:25:44.5509700Z gds-tools-1.11.1.6 | 37.8 MB | ####6 | 47%  2025-05-07T20:25:44.5510006Z 2025-05-07T20:25:44.5510010Z 2025-05-07T20:25:44.5510014Z 2025-05-07T20:25:44.5510017Z 2025-05-07T20:25:44.5510021Z 2025-05-07T20:25:44.5510025Z 2025-05-07T20:25:44.5510028Z 2025-05-07T20:25:44.5510032Z 2025-05-07T20:25:44.5510036Z 2025-05-07T20:25:44.5510293Z 2025-05-07T20:25:44.5512613Z 2025-05-07T20:25:44.5863763Z python-3.11.8 | 29.3 MB | 8 | 9%  2025-05-07T20:25:44.5864058Z 2025-05-07T20:25:44.5864062Z 2025-05-07T20:25:44.5864262Z 2025-05-07T20:25:44.5864268Z 2025-05-07T20:25:44.5864272Z 2025-05-07T20:25:44.5864276Z 2025-05-07T20:25:44.5864280Z 2025-05-07T20:25:44.5864283Z 2025-05-07T20:25:44.5864296Z 2025-05-07T20:25:44.5864299Z 2025-05-07T20:25:44.5864303Z 2025-05-07T20:25:44.5865078Z 2025-05-07T20:25:44.6515090Z cuda-nvcc-tools-12.6 | 23.0 MB | #1 | 12%  2025-05-07T20:25:44.6515469Z 2025-05-07T20:25:44.6515473Z 2025-05-07T20:25:44.6515477Z 2025-05-07T20:25:44.6515481Z 2025-05-07T20:25:44.6515484Z 2025-05-07T20:25:44.6515488Z 2025-05-07T20:25:44.6515492Z 2025-05-07T20:25:44.6515496Z 2025-05-07T20:25:44.6515499Z 2025-05-07T20:25:44.6515503Z 2025-05-07T20:25:44.6515511Z 2025-05-07T20:25:44.6579620Z python-3.11.8 | 29.3 MB | #8 | 18%  2025-05-07T20:25:44.6580001Z 2025-05-07T20:25:44.6580006Z 2025-05-07T20:25:44.6580010Z 2025-05-07T20:25:44.6580013Z 2025-05-07T20:25:44.6580017Z 2025-05-07T20:25:44.6580028Z 2025-05-07T20:25:44.6580032Z 2025-05-07T20:25:44.6580036Z 2025-05-07T20:25:44.6580040Z 2025-05-07T20:25:44.6580043Z 2025-05-07T20:25:44.6867140Z gds-tools-1.11.1.6 | 37.8 MB | #####5 | 56%  2025-05-07T20:25:44.6867531Z 2025-05-07T20:25:44.6867537Z 2025-05-07T20:25:44.6867542Z 2025-05-07T20:25:44.6867547Z 2025-05-07T20:25:44.6867552Z 2025-05-07T20:25:44.6867557Z 2025-05-07T20:25:44.6867562Z 2025-05-07T20:25:44.6867567Z 2025-05-07T20:25:44.6867573Z 2025-05-07T20:25:44.6867578Z 2025-05-07T20:25:44.6867596Z 2025-05-07T20:25:44.6869520Z 2025-05-07T20:25:44.7550153Z cuda-nvcc-tools-12.6 | 23.0 MB | ##5 | 26%  2025-05-07T20:25:44.7550511Z 2025-05-07T20:25:44.7550523Z 2025-05-07T20:25:44.7550544Z 2025-05-07T20:25:44.7550548Z 2025-05-07T20:25:44.7550552Z 2025-05-07T20:25:44.7550555Z 2025-05-07T20:25:44.7550560Z 2025-05-07T20:25:44.7550564Z 2025-05-07T20:25:44.7550567Z 2025-05-07T20:25:44.7550580Z 2025-05-07T20:25:44.7550583Z 2025-05-07T20:25:44.7618814Z python-3.11.8 | 29.3 MB | ##7 | 27%  2025-05-07T20:25:44.7619108Z 2025-05-07T20:25:44.7619112Z 2025-05-07T20:25:44.7619116Z 2025-05-07T20:25:44.7619119Z 2025-05-07T20:25:44.7619123Z 2025-05-07T20:25:44.7619127Z 2025-05-07T20:25:44.7619131Z 2025-05-07T20:25:44.7619134Z 2025-05-07T20:25:44.7619138Z 2025-05-07T20:25:44.7619142Z 2025-05-07T20:25:44.7867254Z gds-tools-1.11.1.6 | 37.8 MB | ######3 | 64%  2025-05-07T20:25:44.7867561Z 2025-05-07T20:25:44.7867565Z 2025-05-07T20:25:44.7867568Z 2025-05-07T20:25:44.7867572Z 2025-05-07T20:25:44.7867576Z 2025-05-07T20:25:44.7867579Z 2025-05-07T20:25:44.7867583Z 2025-05-07T20:25:44.7867598Z 2025-05-07T20:25:44.7867602Z 2025-05-07T20:25:44.7867606Z 2025-05-07T20:25:44.7867610Z 2025-05-07T20:25:44.7869778Z 2025-05-07T20:25:44.8555043Z cuda-nvcc-tools-12.6 | 23.0 MB | ###8 | 39%  2025-05-07T20:25:44.8555406Z 2025-05-07T20:25:44.8555412Z 2025-05-07T20:25:44.8555417Z 2025-05-07T20:25:44.8555434Z 2025-05-07T20:25:44.8555440Z 2025-05-07T20:25:44.8555445Z 2025-05-07T20:25:44.8555450Z 2025-05-07T20:25:44.8555455Z 2025-05-07T20:25:44.8555460Z 2025-05-07T20:25:44.8555465Z 2025-05-07T20:25:44.8555470Z 2025-05-07T20:25:44.8746670Z python-3.11.8 | 29.3 MB | ###6 | 37%  2025-05-07T20:25:44.8746969Z 2025-05-07T20:25:44.8746973Z 2025-05-07T20:25:44.8746977Z 2025-05-07T20:25:44.8746981Z 2025-05-07T20:25:44.8746984Z 2025-05-07T20:25:44.8746988Z 2025-05-07T20:25:44.8746992Z 2025-05-07T20:25:44.8746995Z 2025-05-07T20:25:44.8746999Z 2025-05-07T20:25:44.8749918Z 2025-05-07T20:25:44.8870111Z gds-tools-1.11.1.6 | 37.8 MB | #######2 | 72%  2025-05-07T20:25:44.8870690Z 2025-05-07T20:25:44.8870694Z 2025-05-07T20:25:44.8870698Z 2025-05-07T20:25:44.8870701Z 2025-05-07T20:25:44.8870705Z 2025-05-07T20:25:44.8870836Z 2025-05-07T20:25:44.8870841Z 2025-05-07T20:25:44.8870845Z 2025-05-07T20:25:44.8870848Z 2025-05-07T20:25:44.8870852Z 2025-05-07T20:25:44.8870856Z 2025-05-07T20:25:44.8870860Z 2025-05-07T20:25:44.9555205Z cuda-nvcc-tools-12.6 | 23.0 MB | #####1 | 52%  2025-05-07T20:25:44.9555526Z 2025-05-07T20:25:44.9555530Z 2025-05-07T20:25:44.9555534Z 2025-05-07T20:25:44.9555537Z 2025-05-07T20:25:44.9555541Z 2025-05-07T20:25:44.9555545Z 2025-05-07T20:25:44.9555549Z 2025-05-07T20:25:44.9555560Z 2025-05-07T20:25:44.9555564Z 2025-05-07T20:25:44.9555568Z 2025-05-07T20:25:44.9555571Z 2025-05-07T20:25:44.9757651Z python-3.11.8 | 29.3 MB | ####5 | 46%  2025-05-07T20:25:44.9758049Z 2025-05-07T20:25:44.9758053Z 2025-05-07T20:25:44.9758057Z 2025-05-07T20:25:44.9758061Z 2025-05-07T20:25:44.9758064Z 2025-05-07T20:25:44.9758068Z 2025-05-07T20:25:44.9758071Z 2025-05-07T20:25:44.9758075Z 2025-05-07T20:25:44.9758096Z 2025-05-07T20:25:44.9763975Z 2025-05-07T20:25:44.9940989Z gds-tools-1.11.1.6 | 37.8 MB | #######9 | 80%  2025-05-07T20:25:44.9941414Z 2025-05-07T20:25:44.9941420Z 2025-05-07T20:25:44.9941425Z 2025-05-07T20:25:44.9941430Z 2025-05-07T20:25:44.9941435Z 2025-05-07T20:25:44.9941440Z 2025-05-07T20:25:44.9941445Z 2025-05-07T20:25:44.9941451Z 2025-05-07T20:25:44.9941456Z 2025-05-07T20:25:44.9941461Z 2025-05-07T20:25:44.9941466Z 2025-05-07T20:25:44.9941471Z 2025-05-07T20:25:45.0598556Z cuda-nvcc-tools-12.6 | 23.0 MB | ######4 | 65%  2025-05-07T20:25:45.0598889Z 2025-05-07T20:25:45.0598893Z 2025-05-07T20:25:45.0598897Z 2025-05-07T20:25:45.0598901Z 2025-05-07T20:25:45.0598904Z 2025-05-07T20:25:45.0598922Z 2025-05-07T20:25:45.0598926Z 2025-05-07T20:25:45.0598929Z 2025-05-07T20:25:45.0598933Z 2025-05-07T20:25:45.0598944Z 2025-05-07T20:25:45.0601698Z 2025-05-07T20:25:45.0770925Z python-3.11.8 | 29.3 MB | #####5 | 55%  2025-05-07T20:25:45.0771301Z 2025-05-07T20:25:45.0771308Z 2025-05-07T20:25:45.0771321Z 2025-05-07T20:25:45.0771327Z 2025-05-07T20:25:45.0771332Z 2025-05-07T20:25:45.0771338Z 2025-05-07T20:25:45.0771343Z 2025-05-07T20:25:45.0771348Z 2025-05-07T20:25:45.0771353Z 2025-05-07T20:25:45.0772744Z 2025-05-07T20:25:45.1009604Z gds-tools-1.11.1.6 | 37.8 MB | ########7 | 88%  2025-05-07T20:25:45.1009905Z 2025-05-07T20:25:45.1009910Z 2025-05-07T20:25:45.1009914Z 2025-05-07T20:25:45.1009917Z 2025-05-07T20:25:45.1009921Z 2025-05-07T20:25:45.1009924Z 2025-05-07T20:25:45.1010085Z 2025-05-07T20:25:45.1010094Z 2025-05-07T20:25:45.1010103Z 2025-05-07T20:25:45.1010111Z 2025-05-07T20:25:45.1010132Z 2025-05-07T20:25:45.1010142Z 2025-05-07T20:25:45.1603073Z cuda-nvcc-tools-12.6 | 23.0 MB | #######7 | 78%  2025-05-07T20:25:45.1603435Z 2025-05-07T20:25:45.1603440Z 2025-05-07T20:25:45.1603457Z 2025-05-07T20:25:45.1603462Z 2025-05-07T20:25:45.1603466Z 2025-05-07T20:25:45.1603470Z 2025-05-07T20:25:45.1603473Z 2025-05-07T20:25:45.1603477Z 2025-05-07T20:25:45.1603481Z 2025-05-07T20:25:45.1603485Z 2025-05-07T20:25:45.1604011Z 2025-05-07T20:25:45.1802974Z python-3.11.8 | 29.3 MB | ######4 | 65%  2025-05-07T20:25:45.1803270Z 2025-05-07T20:25:45.1803274Z 2025-05-07T20:25:45.1803278Z 2025-05-07T20:25:45.1803282Z 2025-05-07T20:25:45.1803295Z 2025-05-07T20:25:45.1803299Z 2025-05-07T20:25:45.1803302Z 2025-05-07T20:25:45.1803306Z 2025-05-07T20:25:45.1803309Z 2025-05-07T20:25:45.1806376Z 2025-05-07T20:25:45.2011515Z gds-tools-1.11.1.6 | 37.8 MB | #########5 | 96%  2025-05-07T20:25:45.2012282Z 2025-05-07T20:25:45.2012289Z 2025-05-07T20:25:45.2012294Z 2025-05-07T20:25:45.2012309Z 2025-05-07T20:25:45.2012314Z 2025-05-07T20:25:45.2012320Z 2025-05-07T20:25:45.2012325Z 2025-05-07T20:25:45.2012501Z 2025-05-07T20:25:45.2012509Z 2025-05-07T20:25:45.2012514Z 2025-05-07T20:25:45.2012519Z 2025-05-07T20:25:45.2015343Z 2025-05-07T20:25:45.2604245Z cuda-nvcc-tools-12.6 | 23.0 MB | #########1 | 91%  2025-05-07T20:25:45.2604700Z 2025-05-07T20:25:45.2604704Z 2025-05-07T20:25:45.2604708Z 2025-05-07T20:25:45.2604711Z 2025-05-07T20:25:45.2604715Z 2025-05-07T20:25:45.2604719Z 2025-05-07T20:25:45.2604723Z 2025-05-07T20:25:45.2604726Z 2025-05-07T20:25:45.2604730Z 2025-05-07T20:25:45.2604734Z 2025-05-07T20:25:45.2604737Z 2025-05-07T20:25:45.3608304Z python-3.11.8 | 29.3 MB | #######4 | 74%  2025-05-07T20:25:45.3608682Z 2025-05-07T20:25:45.3608686Z 2025-05-07T20:25:45.3608690Z 2025-05-07T20:25:45.3608719Z 2025-05-07T20:25:45.3608722Z 2025-05-07T20:25:45.3608726Z 2025-05-07T20:25:45.3608730Z 2025-05-07T20:25:45.3608733Z 2025-05-07T20:25:45.3608737Z 2025-05-07T20:25:45.3608742Z 2025-05-07T20:25:45.3608746Z 2025-05-07T20:25:45.4611718Z python-3.11.8 | 29.3 MB | ########5 | 85%  2025-05-07T20:25:45.4612128Z 2025-05-07T20:25:45.4612132Z 2025-05-07T20:25:45.4612136Z 2025-05-07T20:25:45.4612147Z 2025-05-07T20:25:45.4612151Z 2025-05-07T20:25:45.4612155Z 2025-05-07T20:25:45.4612158Z 2025-05-07T20:25:45.4612162Z 2025-05-07T20:25:45.4612166Z 2025-05-07T20:25:45.4612169Z 2025-05-07T20:25:45.4613375Z 2025-05-07T20:25:45.6112672Z python-3.11.8 | 29.3 MB | #########6 | 97%  2025-05-07T20:25:45.6113084Z 2025-05-07T20:25:45.6113090Z 2025-05-07T20:25:45.6113096Z 2025-05-07T20:25:45.6113101Z 2025-05-07T20:25:45.6113106Z 2025-05-07T20:25:45.6113111Z 2025-05-07T20:25:45.6113116Z 2025-05-07T20:25:45.6925757Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:45.6926152Z 2025-05-07T20:25:45.6926158Z 2025-05-07T20:25:45.6926163Z 2025-05-07T20:25:45.6926168Z 2025-05-07T20:25:45.6926191Z 2025-05-07T20:25:45.6926197Z 2025-05-07T20:25:45.6926202Z 2025-05-07T20:25:45.6926207Z 2025-05-07T20:25:45.6926212Z 2025-05-07T20:25:45.6926217Z 2025-05-07T20:25:45.6926222Z 2025-05-07T20:25:45.6926227Z 2025-05-07T20:25:45.6926232Z 2025-05-07T20:25:45.7929974Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:45.7930403Z 2025-05-07T20:25:45.7930409Z 2025-05-07T20:25:45.7930414Z 2025-05-07T20:25:45.7930419Z 2025-05-07T20:25:45.7930424Z 2025-05-07T20:25:45.7930429Z 2025-05-07T20:25:45.7930434Z 2025-05-07T20:25:45.7930449Z 2025-05-07T20:25:45.7930455Z 2025-05-07T20:25:45.7930460Z 2025-05-07T20:25:45.7930465Z 2025-05-07T20:25:45.7930470Z 2025-05-07T20:25:45.7930475Z 2025-05-07T20:25:45.8976605Z cuda-nvrtc-12.6.85 | 17.3 MB | #9 | 20%  2025-05-07T20:25:45.8977034Z 2025-05-07T20:25:45.8977039Z 2025-05-07T20:25:45.8977044Z 2025-05-07T20:25:45.8977049Z 2025-05-07T20:25:45.8977066Z 2025-05-07T20:25:45.8977071Z 2025-05-07T20:25:45.8977087Z 2025-05-07T20:25:45.8977092Z 2025-05-07T20:25:45.8977097Z 2025-05-07T20:25:45.8977102Z 2025-05-07T20:25:45.8977107Z 2025-05-07T20:25:45.8977112Z 2025-05-07T20:25:45.8981757Z 2025-05-07T20:25:46.0006278Z cuda-nvrtc-12.6.85 | 17.3 MB | ###8 | 39%  2025-05-07T20:25:46.0006704Z 2025-05-07T20:25:46.0006709Z 2025-05-07T20:25:46.0006714Z 2025-05-07T20:25:46.0006720Z 2025-05-07T20:25:46.0006724Z 2025-05-07T20:25:46.0006730Z 2025-05-07T20:25:46.0006735Z 2025-05-07T20:25:46.0006811Z 2025-05-07T20:25:46.0006816Z 2025-05-07T20:25:46.0006821Z 2025-05-07T20:25:46.0006826Z 2025-05-07T20:25:46.0010697Z 2025-05-07T20:25:46.0031453Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:46.0032185Z 2025-05-07T20:25:46.0032191Z 2025-05-07T20:25:46.0032196Z 2025-05-07T20:25:46.0032201Z 2025-05-07T20:25:46.0032206Z 2025-05-07T20:25:46.0032211Z 2025-05-07T20:25:46.0032386Z 2025-05-07T20:25:46.0032392Z 2025-05-07T20:25:46.0032397Z 2025-05-07T20:25:46.0032402Z 2025-05-07T20:25:46.0032407Z 2025-05-07T20:25:46.0032421Z 2025-05-07T20:25:46.0032426Z 2025-05-07T20:25:46.0528861Z cuda-nvrtc-12.6.85 | 17.3 MB | #####7 | 58%  2025-05-07T20:25:46.0529272Z 2025-05-07T20:25:46.0529277Z 2025-05-07T20:25:46.0529291Z 2025-05-07T20:25:46.0529297Z 2025-05-07T20:25:46.0529302Z 2025-05-07T20:25:46.0529307Z 2025-05-07T20:25:46.0529312Z 2025-05-07T20:25:46.0529318Z 2025-05-07T20:25:46.0529322Z 2025-05-07T20:25:46.0529328Z 2025-05-07T20:25:46.0529332Z 2025-05-07T20:25:46.0529338Z 2025-05-07T20:25:46.0529343Z 2025-05-07T20:25:46.0529348Z 2025-05-07T20:25:46.1040328Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:46.1040929Z 2025-05-07T20:25:46.1040935Z 2025-05-07T20:25:46.1040941Z 2025-05-07T20:25:46.1040947Z 2025-05-07T20:25:46.1040963Z 2025-05-07T20:25:46.1040968Z 2025-05-07T20:25:46.1040974Z 2025-05-07T20:25:46.1040980Z 2025-05-07T20:25:46.1040985Z 2025-05-07T20:25:46.1040991Z 2025-05-07T20:25:46.1040997Z 2025-05-07T20:25:46.1041002Z 2025-05-07T20:25:46.1041008Z 2025-05-07T20:25:46.1535029Z cuda-nvrtc-12.6.85 | 17.3 MB | #######6 | 77%  2025-05-07T20:25:46.1535337Z 2025-05-07T20:25:46.1535341Z 2025-05-07T20:25:46.1535345Z 2025-05-07T20:25:46.1535349Z 2025-05-07T20:25:46.1535352Z 2025-05-07T20:25:46.1535356Z 2025-05-07T20:25:46.1535368Z 2025-05-07T20:25:46.1535371Z 2025-05-07T20:25:46.1535375Z 2025-05-07T20:25:46.1535379Z 2025-05-07T20:25:46.1535382Z 2025-05-07T20:25:46.1535386Z 2025-05-07T20:25:46.1535390Z 2025-05-07T20:25:46.1535393Z 2025-05-07T20:25:46.2240687Z libnvjitlink-12.6.85 | 14.9 MB | ## | 20%  2025-05-07T20:25:46.2241181Z 2025-05-07T20:25:46.2241187Z 2025-05-07T20:25:46.2241193Z 2025-05-07T20:25:46.2241219Z 2025-05-07T20:25:46.2241225Z 2025-05-07T20:25:46.2241232Z 2025-05-07T20:25:46.2241238Z 2025-05-07T20:25:46.2241243Z 2025-05-07T20:25:46.2241249Z 2025-05-07T20:25:46.2241255Z 2025-05-07T20:25:46.2241261Z 2025-05-07T20:25:46.2241268Z 2025-05-07T20:25:46.2241274Z 2025-05-07T20:25:46.2562349Z cuda-nvrtc-12.6.85 | 17.3 MB | #########5 | 95%  2025-05-07T20:25:46.2562717Z 2025-05-07T20:25:46.2562721Z 2025-05-07T20:25:46.2562725Z 2025-05-07T20:25:46.2562729Z 2025-05-07T20:25:46.2562732Z 2025-05-07T20:25:46.2562737Z 2025-05-07T20:25:46.2562741Z 2025-05-07T20:25:46.2562744Z 2025-05-07T20:25:46.2562748Z 2025-05-07T20:25:46.2562752Z 2025-05-07T20:25:46.2562756Z 2025-05-07T20:25:46.2562768Z 2025-05-07T20:25:46.2562772Z 2025-05-07T20:25:46.2567055Z 2025-05-07T20:25:46.3562607Z libnvjitlink-12.6.85 | 14.9 MB | ###9 | 40%  2025-05-07T20:25:46.3563145Z 2025-05-07T20:25:46.3563152Z 2025-05-07T20:25:46.3563172Z 2025-05-07T20:25:46.3563178Z 2025-05-07T20:25:46.3563183Z 2025-05-07T20:25:46.3563188Z 2025-05-07T20:25:46.3563194Z 2025-05-07T20:25:46.3563199Z 2025-05-07T20:25:46.3563204Z 2025-05-07T20:25:46.3563209Z 2025-05-07T20:25:46.3563214Z 2025-05-07T20:25:46.3563219Z 2025-05-07T20:25:46.3563225Z 2025-05-07T20:25:46.3563230Z 2025-05-07T20:25:46.4567218Z libnvjitlink-12.6.85 | 14.9 MB | ######3 | 63%  2025-05-07T20:25:46.4567766Z 2025-05-07T20:25:46.4567772Z 2025-05-07T20:25:46.4567777Z 2025-05-07T20:25:46.4567783Z 2025-05-07T20:25:46.4567788Z 2025-05-07T20:25:46.4567794Z 2025-05-07T20:25:46.4567799Z 2025-05-07T20:25:46.4567804Z 2025-05-07T20:25:46.4567809Z 2025-05-07T20:25:46.4567814Z 2025-05-07T20:25:46.4567820Z 2025-05-07T20:25:46.4568076Z 2025-05-07T20:25:46.4568082Z 2025-05-07T20:25:46.4568087Z 2025-05-07T20:25:46.4782399Z libnvjitlink-12.6.85 | 14.9 MB | ########5 | 86%  2025-05-07T20:25:46.4783041Z 2025-05-07T20:25:46.4783049Z 2025-05-07T20:25:46.4783054Z 2025-05-07T20:25:46.4783059Z 2025-05-07T20:25:46.4783064Z 2025-05-07T20:25:46.4783081Z 2025-05-07T20:25:46.4783086Z 2025-05-07T20:25:46.4783091Z 2025-05-07T20:25:46.4783096Z 2025-05-07T20:25:46.4783101Z 2025-05-07T20:25:46.4784646Z 2025-05-07T20:25:46.5144589Z python-3.11.8 | 29.3 MB | ########## | 100%  2025-05-07T20:25:46.5144984Z 2025-05-07T20:25:46.5144990Z 2025-05-07T20:25:46.5144995Z 2025-05-07T20:25:46.5145000Z 2025-05-07T20:25:46.5145005Z 2025-05-07T20:25:46.5145010Z 2025-05-07T20:25:46.5145015Z 2025-05-07T20:25:46.5145021Z 2025-05-07T20:25:46.5145026Z 2025-05-07T20:25:46.5145031Z 2025-05-07T20:25:46.5145036Z 2025-05-07T20:25:46.5145041Z 2025-05-07T20:25:46.5145061Z 2025-05-07T20:25:46.5145066Z 2025-05-07T20:25:46.5145072Z 2025-05-07T20:25:46.6149391Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:46.6149899Z 2025-05-07T20:25:46.6149906Z 2025-05-07T20:25:46.6149911Z 2025-05-07T20:25:46.6149916Z 2025-05-07T20:25:46.6149922Z 2025-05-07T20:25:46.6149927Z 2025-05-07T20:25:46.6149933Z 2025-05-07T20:25:46.6149938Z 2025-05-07T20:25:46.6149954Z 2025-05-07T20:25:46.6149960Z 2025-05-07T20:25:46.6149965Z 2025-05-07T20:25:46.6149971Z 2025-05-07T20:25:46.6149976Z 2025-05-07T20:25:46.6149981Z 2025-05-07T20:25:46.6149987Z 2025-05-07T20:25:46.6251422Z cuda-nvcc-dev_linux- | 10.8 MB | ###3 | 33%  2025-05-07T20:25:46.6251767Z 2025-05-07T20:25:46.6251771Z 2025-05-07T20:25:46.6251775Z 2025-05-07T20:25:46.6251778Z 2025-05-07T20:25:46.6251782Z 2025-05-07T20:25:46.6251786Z 2025-05-07T20:25:46.6251789Z 2025-05-07T20:25:46.6251803Z 2025-05-07T20:25:46.6251808Z 2025-05-07T20:25:46.6255028Z 2025-05-07T20:25:46.6798841Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:46.6799167Z 2025-05-07T20:25:46.6799185Z 2025-05-07T20:25:46.6799190Z 2025-05-07T20:25:46.6799194Z 2025-05-07T20:25:46.6799197Z 2025-05-07T20:25:46.6799201Z 2025-05-07T20:25:46.6799205Z 2025-05-07T20:25:46.6799208Z 2025-05-07T20:25:46.6799212Z 2025-05-07T20:25:46.6799216Z 2025-05-07T20:25:46.6799220Z 2025-05-07T20:25:46.6799223Z 2025-05-07T20:25:46.6799227Z 2025-05-07T20:25:46.6799231Z 2025-05-07T20:25:46.6799241Z 2025-05-07T20:25:46.6799245Z 2025-05-07T20:25:46.7313257Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:46.7313691Z 2025-05-07T20:25:46.7313696Z 2025-05-07T20:25:46.7313714Z 2025-05-07T20:25:46.7313718Z 2025-05-07T20:25:46.7313721Z 2025-05-07T20:25:46.7313725Z 2025-05-07T20:25:46.7313728Z 2025-05-07T20:25:46.7313747Z 2025-05-07T20:25:46.7313751Z 2025-05-07T20:25:46.7313755Z 2025-05-07T20:25:46.7313759Z 2025-05-07T20:25:46.7313763Z 2025-05-07T20:25:46.7313767Z 2025-05-07T20:25:46.7313770Z 2025-05-07T20:25:46.7313774Z 2025-05-07T20:25:46.7805228Z cuda-nvcc-dev_linux- | 10.8 MB | ######6 | 66%  2025-05-07T20:25:46.7805813Z 2025-05-07T20:25:46.7805820Z 2025-05-07T20:25:46.7805826Z 2025-05-07T20:25:46.7805831Z 2025-05-07T20:25:46.7805836Z 2025-05-07T20:25:46.7805841Z 2025-05-07T20:25:46.7805846Z 2025-05-07T20:25:46.7805852Z 2025-05-07T20:25:46.7805856Z 2025-05-07T20:25:46.7805859Z 2025-05-07T20:25:46.7805863Z 2025-05-07T20:25:46.7805867Z 2025-05-07T20:25:46.7805870Z 2025-05-07T20:25:46.7805874Z 2025-05-07T20:25:46.7805878Z 2025-05-07T20:25:46.7805889Z 2025-05-07T20:25:46.8462892Z cuda-nvvm-tools-12.6 | 10.4 MB | ##8 | 29%  2025-05-07T20:25:46.8463254Z 2025-05-07T20:25:46.8463259Z 2025-05-07T20:25:46.8463540Z 2025-05-07T20:25:46.8463556Z 2025-05-07T20:25:46.8463560Z 2025-05-07T20:25:46.8463564Z 2025-05-07T20:25:46.8463567Z 2025-05-07T20:25:46.8463571Z 2025-05-07T20:25:46.8463575Z 2025-05-07T20:25:46.8463703Z 2025-05-07T20:25:46.8463707Z 2025-05-07T20:25:46.8463711Z 2025-05-07T20:25:46.8463715Z 2025-05-07T20:25:46.8463722Z 2025-05-07T20:25:46.8463727Z 2025-05-07T20:25:46.8874436Z cuda-nvcc-dev_linux- | 10.8 MB | #########6 | 97%  2025-05-07T20:25:46.8874794Z 2025-05-07T20:25:46.8874800Z 2025-05-07T20:25:46.8874805Z 2025-05-07T20:25:46.8874810Z 2025-05-07T20:25:46.8874815Z 2025-05-07T20:25:46.8874820Z 2025-05-07T20:25:46.8874826Z 2025-05-07T20:25:46.8874831Z 2025-05-07T20:25:46.8874836Z 2025-05-07T20:25:46.8874841Z 2025-05-07T20:25:46.8874846Z 2025-05-07T20:25:46.8874851Z 2025-05-07T20:25:46.8874860Z 2025-05-07T20:25:46.8932369Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:46.8932753Z 2025-05-07T20:25:46.8932757Z 2025-05-07T20:25:46.8932761Z 2025-05-07T20:25:46.8932765Z 2025-05-07T20:25:46.8932768Z 2025-05-07T20:25:46.8932772Z 2025-05-07T20:25:46.8932783Z 2025-05-07T20:25:46.8932794Z 2025-05-07T20:25:46.8932797Z 2025-05-07T20:25:46.8932801Z 2025-05-07T20:25:46.8932805Z 2025-05-07T20:25:46.8932808Z 2025-05-07T20:25:46.8932812Z 2025-05-07T20:25:46.8932816Z 2025-05-07T20:25:46.8932819Z 2025-05-07T20:25:46.8932823Z 2025-05-07T20:25:46.9449087Z cuda-nvvm-tools-12.6 | 10.4 MB | #####7 | 58%  2025-05-07T20:25:46.9449429Z 2025-05-07T20:25:46.9449433Z 2025-05-07T20:25:46.9449437Z 2025-05-07T20:25:46.9449441Z 2025-05-07T20:25:46.9449444Z 2025-05-07T20:25:46.9449448Z 2025-05-07T20:25:46.9449452Z 2025-05-07T20:25:46.9449456Z 2025-05-07T20:25:46.9449459Z 2025-05-07T20:25:46.9449463Z 2025-05-07T20:25:46.9449467Z 2025-05-07T20:25:46.9449470Z 2025-05-07T20:25:46.9449474Z 2025-05-07T20:25:46.9449499Z 2025-05-07T20:25:46.9449503Z 2025-05-07T20:25:46.9449506Z 2025-05-07T20:25:46.9449510Z 2025-05-07T20:25:46.9935710Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:46.9936280Z 2025-05-07T20:25:46.9936287Z 2025-05-07T20:25:46.9936292Z 2025-05-07T20:25:46.9936297Z 2025-05-07T20:25:46.9936302Z 2025-05-07T20:25:46.9936308Z 2025-05-07T20:25:46.9936313Z 2025-05-07T20:25:46.9936318Z 2025-05-07T20:25:46.9936323Z 2025-05-07T20:25:46.9936329Z 2025-05-07T20:25:46.9936334Z 2025-05-07T20:25:46.9936339Z 2025-05-07T20:25:46.9936345Z 2025-05-07T20:25:46.9936350Z 2025-05-07T20:25:46.9936355Z 2025-05-07T20:25:46.9936394Z 2025-05-07T20:25:47.0449137Z cuda-nvvm-tools-12.6 | 10.4 MB | ######### | 90%  2025-05-07T20:25:47.0449517Z 2025-05-07T20:25:47.0449523Z 2025-05-07T20:25:47.0449528Z 2025-05-07T20:25:47.0449533Z 2025-05-07T20:25:47.0449539Z 2025-05-07T20:25:47.0449544Z 2025-05-07T20:25:47.0449565Z 2025-05-07T20:25:47.0449570Z 2025-05-07T20:25:47.0449575Z 2025-05-07T20:25:47.0449580Z 2025-05-07T20:25:47.0449594Z 2025-05-07T20:25:47.0449600Z 2025-05-07T20:25:47.0449605Z 2025-05-07T20:25:47.0449618Z 2025-05-07T20:25:47.0449623Z 2025-05-07T20:25:47.0449629Z 2025-05-07T20:25:47.0449634Z 2025-05-07T20:25:47.0594982Z cuda-sanitizer-api-1 | 8.9 MB | ###3 | 34%  2025-05-07T20:25:47.0595412Z 2025-05-07T20:25:47.0595417Z 2025-05-07T20:25:47.0595420Z 2025-05-07T20:25:47.0595424Z 2025-05-07T20:25:47.0595428Z 2025-05-07T20:25:47.0595431Z 2025-05-07T20:25:47.0595435Z 2025-05-07T20:25:47.0595439Z 2025-05-07T20:25:47.0595443Z 2025-05-07T20:25:47.0595446Z 2025-05-07T20:25:47.0595450Z 2025-05-07T20:25:47.0595454Z 2025-05-07T20:25:47.0595457Z 2025-05-07T20:25:47.0598079Z 2025-05-07T20:25:47.1326530Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:47.1327321Z 2025-05-07T20:25:47.1328945Z 2025-05-07T20:25:47.1382536Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:47.1382914Z 2025-05-07T20:25:47.1382920Z 2025-05-07T20:25:47.1382926Z 2025-05-07T20:25:47.1383134Z 2025-05-07T20:25:47.1383139Z 2025-05-07T20:25:47.1383143Z 2025-05-07T20:25:47.1383147Z 2025-05-07T20:25:47.1383150Z 2025-05-07T20:25:47.1383154Z 2025-05-07T20:25:47.1383158Z 2025-05-07T20:25:47.1383161Z 2025-05-07T20:25:47.1383167Z 2025-05-07T20:25:47.1383173Z 2025-05-07T20:25:47.1383178Z 2025-05-07T20:25:47.1383183Z 2025-05-07T20:25:47.1383188Z 2025-05-07T20:25:47.1383194Z 2025-05-07T20:25:47.1387462Z 2025-05-07T20:25:47.1451561Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:47.1452148Z 2025-05-07T20:25:47.1452155Z 2025-05-07T20:25:47.1452160Z 2025-05-07T20:25:47.1452165Z 2025-05-07T20:25:47.1452170Z 2025-05-07T20:25:47.1452175Z 2025-05-07T20:25:47.1452180Z 2025-05-07T20:25:47.1452204Z 2025-05-07T20:25:47.1452223Z 2025-05-07T20:25:47.1452229Z 2025-05-07T20:25:47.1452234Z 2025-05-07T20:25:47.1452239Z 2025-05-07T20:25:47.1452245Z 2025-05-07T20:25:47.1452250Z 2025-05-07T20:25:47.1452263Z 2025-05-07T20:25:47.1452268Z 2025-05-07T20:25:47.1452273Z 2025-05-07T20:25:47.2346680Z cuda-sanitizer-api-1 | 8.9 MB | #######2 | 72%  2025-05-07T20:25:47.2347035Z 2025-05-07T20:25:47.2347039Z 2025-05-07T20:25:47.2347043Z 2025-05-07T20:25:47.2347047Z 2025-05-07T20:25:47.2347050Z 2025-05-07T20:25:47.2347054Z 2025-05-07T20:25:47.2347058Z 2025-05-07T20:25:47.2347062Z 2025-05-07T20:25:47.2347065Z 2025-05-07T20:25:47.2347069Z 2025-05-07T20:25:47.2347073Z 2025-05-07T20:25:47.2347076Z 2025-05-07T20:25:47.2347080Z 2025-05-07T20:25:47.2347084Z 2025-05-07T20:25:47.2347088Z 2025-05-07T20:25:47.2384415Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:47.2384829Z 2025-05-07T20:25:47.2384833Z 2025-05-07T20:25:47.2384837Z 2025-05-07T20:25:47.2384841Z 2025-05-07T20:25:47.2384844Z 2025-05-07T20:25:47.2384848Z 2025-05-07T20:25:47.2384852Z 2025-05-07T20:25:47.2384863Z 2025-05-07T20:25:47.2384875Z 2025-05-07T20:25:47.2384879Z 2025-05-07T20:25:47.2384882Z 2025-05-07T20:25:47.2384886Z 2025-05-07T20:25:47.2384890Z 2025-05-07T20:25:47.2384894Z 2025-05-07T20:25:47.2384897Z 2025-05-07T20:25:47.2384901Z 2025-05-07T20:25:47.2384905Z 2025-05-07T20:25:47.2387937Z 2025-05-07T20:25:47.2752887Z cuda-nvvm-impl-12.6. | 7.7 MB | ###5 | 36%  2025-05-07T20:25:47.2753357Z 2025-05-07T20:25:47.2753362Z 2025-05-07T20:25:47.2753368Z 2025-05-07T20:25:47.2753372Z 2025-05-07T20:25:47.2753377Z 2025-05-07T20:25:47.2753382Z 2025-05-07T20:25:47.2753387Z 2025-05-07T20:25:47.2753392Z 2025-05-07T20:25:47.2753397Z 2025-05-07T20:25:47.2753410Z 2025-05-07T20:25:47.2753416Z 2025-05-07T20:25:47.2753421Z 2025-05-07T20:25:47.2753439Z 2025-05-07T20:25:47.2753444Z 2025-05-07T20:25:47.2753457Z 2025-05-07T20:25:47.2753462Z 2025-05-07T20:25:47.2753467Z 2025-05-07T20:25:47.2753473Z 2025-05-07T20:25:47.2753478Z 2025-05-07T20:25:47.3389913Z ... (more hidden) ... 2025-05-07T20:25:47.3390247Z 2025-05-07T20:25:47.3390252Z 2025-05-07T20:25:47.3390255Z 2025-05-07T20:25:47.3390259Z 2025-05-07T20:25:47.3390262Z 2025-05-07T20:25:47.3390266Z 2025-05-07T20:25:47.3390270Z 2025-05-07T20:25:47.3390273Z 2025-05-07T20:25:47.3390277Z 2025-05-07T20:25:47.3390281Z 2025-05-07T20:25:47.3390284Z 2025-05-07T20:25:47.3390288Z 2025-05-07T20:25:47.3390291Z 2025-05-07T20:25:47.3390295Z 2025-05-07T20:25:47.3390299Z 2025-05-07T20:25:47.3390302Z 2025-05-07T20:25:47.3390306Z 2025-05-07T20:25:47.3391552Z 2025-05-07T20:25:47.3754324Z cuda-nvvm-impl-12.6. | 7.7 MB | #######3 | 73%  2025-05-07T20:25:47.3754774Z 2025-05-07T20:25:47.3755044Z 2025-05-07T20:25:47.3755050Z 2025-05-07T20:25:47.3755055Z 2025-05-07T20:25:47.3755060Z 2025-05-07T20:25:47.3755065Z 2025-05-07T20:25:47.3755071Z 2025-05-07T20:25:47.3755088Z 2025-05-07T20:25:47.3755339Z 2025-05-07T20:25:47.3755346Z 2025-05-07T20:25:47.3755351Z 2025-05-07T20:25:47.3755356Z 2025-05-07T20:25:47.3755362Z 2025-05-07T20:25:47.3755366Z 2025-05-07T20:25:47.3755372Z 2025-05-07T20:25:47.3755377Z 2025-05-07T20:25:47.3755382Z 2025-05-07T20:25:47.3755387Z 2025-05-07T20:25:47.3755392Z 2025-05-07T20:25:47.4070042Z ... (more hidden) ... 2025-05-07T20:25:47.4070366Z 2025-05-07T20:25:47.4070370Z 2025-05-07T20:25:47.4070373Z 2025-05-07T20:25:47.4070377Z 2025-05-07T20:25:47.4070381Z 2025-05-07T20:25:47.4070385Z 2025-05-07T20:25:47.4070388Z 2025-05-07T20:25:47.4070392Z 2025-05-07T20:25:47.4070396Z 2025-05-07T20:25:47.4070399Z 2025-05-07T20:25:47.4070403Z 2025-05-07T20:25:47.4070407Z 2025-05-07T20:25:47.4070422Z 2025-05-07T20:25:47.4070426Z 2025-05-07T20:25:47.4070429Z 2025-05-07T20:25:47.4072979Z 2025-05-07T20:25:47.5332153Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:47.5332559Z 2025-05-07T20:25:47.5332564Z 2025-05-07T20:25:47.5332568Z 2025-05-07T20:25:47.5332571Z 2025-05-07T20:25:47.5332575Z 2025-05-07T20:25:47.5332579Z 2025-05-07T20:25:47.5332583Z 2025-05-07T20:25:47.5332586Z 2025-05-07T20:25:47.5332597Z 2025-05-07T20:25:47.5332601Z 2025-05-07T20:25:47.5332605Z 2025-05-07T20:25:47.5332609Z 2025-05-07T20:25:47.5332612Z 2025-05-07T20:25:47.5332616Z 2025-05-07T20:25:47.5332620Z 2025-05-07T20:25:47.5332624Z 2025-05-07T20:25:47.5332627Z 2025-05-07T20:25:47.5403869Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:47.5404291Z 2025-05-07T20:25:47.5404296Z 2025-05-07T20:25:47.5404299Z 2025-05-07T20:25:47.5404303Z 2025-05-07T20:25:47.5404317Z 2025-05-07T20:25:47.5404321Z 2025-05-07T20:25:47.5404325Z 2025-05-07T20:25:47.5404328Z 2025-05-07T20:25:47.5404332Z 2025-05-07T20:25:47.5404336Z 2025-05-07T20:25:47.5404340Z 2025-05-07T20:25:47.5404343Z 2025-05-07T20:25:47.5404353Z 2025-05-07T20:25:47.5404356Z 2025-05-07T20:25:47.5404360Z 2025-05-07T20:25:47.5404364Z 2025-05-07T20:25:47.5404368Z 2025-05-07T20:25:47.5404371Z 2025-05-07T20:25:47.5404488Z 2025-05-07T20:25:47.6125116Z ... (more hidden) ... 2025-05-07T20:25:47.6125413Z 2025-05-07T20:25:47.6125417Z 2025-05-07T20:25:47.6125421Z 2025-05-07T20:25:47.6125424Z 2025-05-07T20:25:47.6125646Z 2025-05-07T20:25:47.6125654Z 2025-05-07T20:25:47.6125659Z 2025-05-07T20:25:47.6125665Z 2025-05-07T20:25:47.6125670Z 2025-05-07T20:25:47.6125676Z 2025-05-07T20:25:47.6125681Z 2025-05-07T20:25:47.6125687Z 2025-05-07T20:25:47.6125692Z 2025-05-07T20:25:47.6125698Z 2025-05-07T20:25:47.6125703Z 2025-05-07T20:25:47.6125709Z 2025-05-07T20:25:47.6125729Z 2025-05-07T20:25:47.6125737Z 2025-05-07T20:25:47.8882455Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:47.8882791Z 2025-05-07T20:25:47.8882813Z 2025-05-07T20:25:47.8882817Z 2025-05-07T20:25:47.8882821Z 2025-05-07T20:25:47.8882825Z 2025-05-07T20:25:47.8884584Z 2025-05-07T20:25:48.4604005Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:48.4604331Z 2025-05-07T20:25:48.4604335Z 2025-05-07T20:25:48.4604339Z 2025-05-07T20:25:48.4604343Z 2025-05-07T20:25:48.4604346Z 2025-05-07T20:25:48.4604350Z 2025-05-07T20:25:48.4604355Z 2025-05-07T20:25:48.4606534Z 2025-05-07T20:25:49.2384947Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:49.2385264Z 2025-05-07T20:25:49.2385268Z 2025-05-07T20:25:49.2385272Z 2025-05-07T20:25:49.2385286Z 2025-05-07T20:25:49.2388068Z 2025-05-07T20:25:49.5941908Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:49.5942655Z 2025-05-07T20:25:49.5942661Z 2025-05-07T20:25:49.5942666Z 2025-05-07T20:25:49.5942671Z 2025-05-07T20:25:49.5942687Z 2025-05-07T20:25:49.5942692Z 2025-05-07T20:25:49.5942870Z 2025-05-07T20:25:49.5942878Z 2025-05-07T20:25:49.5942883Z 2025-05-07T20:25:50.2686783Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:50.3071646Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:50.3072027Z 2025-05-07T20:25:50.3072031Z 2025-05-07T20:25:50.3072034Z 2025-05-07T20:25:50.3072038Z 2025-05-07T20:25:50.3072042Z 2025-05-07T20:25:50.3072046Z 2025-05-07T20:25:50.3072049Z 2025-05-07T20:25:50.3072061Z 2025-05-07T20:25:50.3072065Z 2025-05-07T20:25:50.3072069Z 2025-05-07T20:25:50.3072072Z 2025-05-07T20:25:50.3072077Z 2025-05-07T20:25:50.9843200Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:50.9843535Z 2025-05-07T20:25:50.9843573Z 2025-05-07T20:25:50.9843577Z 2025-05-07T20:25:50.9843581Z 2025-05-07T20:25:50.9843584Z 2025-05-07T20:25:50.9843588Z 2025-05-07T20:25:50.9843592Z 2025-05-07T20:25:51.3744987Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:51.3745391Z 2025-05-07T20:25:51.3745398Z 2025-05-07T20:25:51.3745404Z 2025-05-07T20:25:51.3745409Z 2025-05-07T20:25:51.3745415Z 2025-05-07T20:25:51.3745420Z 2025-05-07T20:25:51.3745426Z 2025-05-07T20:25:51.3745431Z 2025-05-07T20:25:51.3745436Z 2025-05-07T20:25:51.3745452Z 2025-05-07T20:25:51.6992395Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:51.6993012Z 2025-05-07T20:25:51.6993019Z 2025-05-07T20:25:51.6993027Z 2025-05-07T20:25:51.6993035Z 2025-05-07T20:25:51.6993042Z 2025-05-07T20:25:51.6993050Z 2025-05-07T20:25:51.6993057Z 2025-05-07T20:25:51.6993065Z 2025-05-07T20:25:51.6993073Z 2025-05-07T20:25:51.6993081Z 2025-05-07T20:25:51.6993088Z 2025-05-07T20:25:51.6993139Z 2025-05-07T20:25:51.6993146Z 2025-05-07T20:25:51.8590568Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:51.8590887Z 2025-05-07T20:25:51.8590891Z 2025-05-07T20:25:51.8590919Z 2025-05-07T20:25:51.8590922Z 2025-05-07T20:25:51.8590926Z 2025-05-07T20:25:51.8590930Z 2025-05-07T20:25:51.8590933Z 2025-05-07T20:25:51.8590937Z 2025-05-07T20:25:51.8590941Z 2025-05-07T20:25:51.8590944Z 2025-05-07T20:25:51.8590951Z 2025-05-07T20:25:51.9723556Z python-3.11.8 | 29.3 MB | ########## | 100%  2025-05-07T20:25:51.9723853Z 2025-05-07T20:25:51.9723857Z 2025-05-07T20:25:51.9723860Z 2025-05-07T20:25:51.9723865Z 2025-05-07T20:25:51.9723877Z 2025-05-07T20:25:51.9723880Z 2025-05-07T20:25:51.9723884Z 2025-05-07T20:25:51.9723888Z 2025-05-07T20:25:51.9723891Z 2025-05-07T20:25:51.9723895Z 2025-05-07T20:25:51.9723898Z 2025-05-07T20:25:51.9723902Z 2025-05-07T20:25:51.9723906Z 2025-05-07T20:25:51.9723932Z 2025-05-07T20:25:52.0693911Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:52.0694242Z 2025-05-07T20:25:52.0694246Z 2025-05-07T20:25:52.0694250Z 2025-05-07T20:25:52.0694279Z 2025-05-07T20:25:52.0694283Z 2025-05-07T20:25:52.0694287Z 2025-05-07T20:25:52.0694291Z 2025-05-07T20:25:52.0694295Z 2025-05-07T20:25:52.0694299Z 2025-05-07T20:25:52.0694303Z 2025-05-07T20:25:52.0694306Z 2025-05-07T20:25:52.0694310Z 2025-05-07T20:25:52.0694314Z 2025-05-07T20:25:52.0694318Z 2025-05-07T20:25:52.0694321Z 2025-05-07T20:25:52.1427304Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:52.1427629Z 2025-05-07T20:25:52.1427632Z 2025-05-07T20:25:52.1427636Z 2025-05-07T20:25:52.1427641Z 2025-05-07T20:25:52.1427646Z 2025-05-07T20:25:52.1427650Z 2025-05-07T20:25:52.1427654Z 2025-05-07T20:25:52.1427666Z 2025-05-07T20:25:52.1427670Z 2025-05-07T20:25:52.1427673Z 2025-05-07T20:25:52.1427947Z 2025-05-07T20:25:52.1427951Z 2025-05-07T20:25:52.1427955Z 2025-05-07T20:25:52.1427959Z 2025-05-07T20:25:52.1427963Z 2025-05-07T20:25:52.1427966Z 2025-05-07T20:25:52.2947335Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:52.2947683Z 2025-05-07T20:25:52.2947694Z 2025-05-07T20:25:52.2947698Z 2025-05-07T20:25:52.2947702Z 2025-05-07T20:25:52.2947705Z 2025-05-07T20:25:52.2947709Z 2025-05-07T20:25:52.2947713Z 2025-05-07T20:25:52.2947717Z 2025-05-07T20:25:52.2947720Z 2025-05-07T20:25:52.2947724Z 2025-05-07T20:25:52.2947728Z 2025-05-07T20:25:52.2947732Z 2025-05-07T20:25:52.2947735Z 2025-05-07T20:25:52.2947739Z 2025-05-07T20:25:52.2947743Z 2025-05-07T20:25:52.2947746Z 2025-05-07T20:25:52.2947753Z 2025-05-07T20:25:52.3610314Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:52.3610674Z 2025-05-07T20:25:52.3610679Z 2025-05-07T20:25:52.3610709Z 2025-05-07T20:25:52.3610713Z 2025-05-07T20:25:52.3610717Z 2025-05-07T20:25:52.3610721Z 2025-05-07T20:25:52.3610726Z 2025-05-07T20:25:52.3610730Z 2025-05-07T20:25:52.3610733Z 2025-05-07T20:25:52.3610737Z 2025-05-07T20:25:52.3610758Z 2025-05-07T20:25:52.3610762Z 2025-05-07T20:25:52.3610765Z 2025-05-07T20:25:52.3610769Z 2025-05-07T20:25:52.3610773Z 2025-05-07T20:25:52.3610776Z 2025-05-07T20:25:52.3610780Z 2025-05-07T20:25:52.3610783Z 2025-05-07T20:25:52.3610787Z 2025-05-07T20:25:52.4446531Z ... (more hidden) ... 2025-05-07T20:25:52.4446834Z 2025-05-07T20:25:52.4446838Z 2025-05-07T20:25:52.4446842Z 2025-05-07T20:25:52.4446846Z 2025-05-07T20:25:52.4446851Z 2025-05-07T20:25:52.4446855Z 2025-05-07T20:25:52.4446858Z 2025-05-07T20:25:52.4446862Z 2025-05-07T20:25:52.4446874Z 2025-05-07T20:25:52.4446878Z 2025-05-07T20:25:52.4446882Z 2025-05-07T20:25:52.4446885Z 2025-05-07T20:25:52.4446889Z 2025-05-07T20:25:52.4446893Z 2025-05-07T20:25:52.4446918Z 2025-05-07T20:25:52.4446922Z 2025-05-07T20:25:52.4446925Z 2025-05-07T20:25:52.4446929Z 2025-05-07T20:25:52.5340283Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:52.5341137Z 2025-05-07T20:25:58.2585241Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:58.2592373Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:58.2592743Z 2025-05-07T20:25:58.2592749Z 2025-05-07T20:25:58.2592754Z 2025-05-07T20:25:58.2592758Z 2025-05-07T20:25:58.2592761Z 2025-05-07T20:25:58.2592765Z 2025-05-07T20:25:58.2592769Z 2025-05-07T20:25:58.2592772Z 2025-05-07T20:25:58.2592785Z 2025-05-07T20:25:58.2592789Z 2025-05-07T20:25:58.2592793Z 2025-05-07T20:25:58.2592797Z 2025-05-07T20:25:58.2592800Z 2025-05-07T20:25:58.2592804Z 2025-05-07T20:25:58.2592808Z 2025-05-07T20:25:58.2592813Z 2025-05-07T20:25:58.2592817Z 2025-05-07T20:25:58.2592821Z 2025-05-07T20:25:58.2592824Z 2025-05-07T20:25:58.2592939Z 2025-05-07T20:25:58.2593292Z  2025-05-07T20:25:58.2593616Z 2025-05-07T20:25:58.2593820Z 2025-05-07T20:25:58.2593988Z  2025-05-07T20:25:58.2594192Z 2025-05-07T20:25:58.2594196Z 2025-05-07T20:25:58.2594580Z  2025-05-07T20:25:58.2594790Z 2025-05-07T20:25:58.2594793Z 2025-05-07T20:25:58.2594797Z 2025-05-07T20:25:58.2595435Z  2025-05-07T20:25:58.2595730Z 2025-05-07T20:25:58.2595736Z 2025-05-07T20:25:58.2595750Z 2025-05-07T20:25:58.2595755Z 2025-05-07T20:25:58.2595992Z  2025-05-07T20:25:58.2596270Z 2025-05-07T20:25:58.2596275Z 2025-05-07T20:25:58.2596280Z 2025-05-07T20:25:58.2596290Z 2025-05-07T20:25:58.2596559Z 2025-05-07T20:25:58.2596778Z  2025-05-07T20:25:58.2596992Z 2025-05-07T20:25:58.2596996Z 2025-05-07T20:25:58.2597000Z 2025-05-07T20:25:58.2597003Z 2025-05-07T20:25:58.2597143Z 2025-05-07T20:25:58.2597148Z 2025-05-07T20:25:58.2597360Z  2025-05-07T20:25:58.2597581Z 2025-05-07T20:25:58.2597585Z 2025-05-07T20:25:58.2597589Z 2025-05-07T20:25:58.2597592Z 2025-05-07T20:25:58.2597596Z 2025-05-07T20:25:58.2597600Z 2025-05-07T20:25:58.2597603Z 2025-05-07T20:25:58.2597792Z  2025-05-07T20:25:58.2598183Z 2025-05-07T20:25:58.2598193Z 2025-05-07T20:25:58.2598197Z 2025-05-07T20:25:58.2598201Z 2025-05-07T20:25:58.2598204Z 2025-05-07T20:25:58.2598208Z 2025-05-07T20:25:58.2598211Z 2025-05-07T20:25:58.2598215Z 2025-05-07T20:25:58.2598520Z  2025-05-07T20:25:58.2598749Z 2025-05-07T20:25:58.2598753Z 2025-05-07T20:25:58.2598756Z 2025-05-07T20:25:58.2598760Z 2025-05-07T20:25:58.2598763Z 2025-05-07T20:25:58.2598767Z 2025-05-07T20:25:58.2598777Z 2025-05-07T20:25:58.2598780Z 2025-05-07T20:25:58.2598784Z 2025-05-07T20:25:58.2599352Z  2025-05-07T20:25:58.2599577Z 2025-05-07T20:25:58.2599580Z 2025-05-07T20:25:58.2599584Z 2025-05-07T20:25:58.2599594Z 2025-05-07T20:25:58.2599598Z 2025-05-07T20:25:58.2599601Z 2025-05-07T20:25:58.2599605Z 2025-05-07T20:25:58.2599609Z 2025-05-07T20:25:58.2599612Z 2025-05-07T20:25:58.2599616Z 2025-05-07T20:25:58.2600062Z  2025-05-07T20:25:58.2600279Z 2025-05-07T20:25:58.2600286Z 2025-05-07T20:25:58.2600290Z 2025-05-07T20:25:58.2600294Z 2025-05-07T20:25:58.2600297Z 2025-05-07T20:25:58.2600301Z 2025-05-07T20:25:58.2600308Z 2025-05-07T20:25:58.2600312Z 2025-05-07T20:25:58.2600316Z 2025-05-07T20:25:58.2600319Z 2025-05-07T20:25:58.2600323Z 2025-05-07T20:25:58.2600794Z  2025-05-07T20:25:58.2601017Z 2025-05-07T20:25:58.2601021Z 2025-05-07T20:25:58.2601026Z 2025-05-07T20:25:58.2601030Z 2025-05-07T20:25:58.2601035Z 2025-05-07T20:25:58.2601053Z 2025-05-07T20:25:58.2601057Z 2025-05-07T20:25:58.2601062Z 2025-05-07T20:25:58.2601066Z 2025-05-07T20:25:58.2601071Z 2025-05-07T20:25:58.2601075Z 2025-05-07T20:25:58.2601080Z 2025-05-07T20:25:58.2601564Z  2025-05-07T20:25:58.2601792Z 2025-05-07T20:25:58.2601800Z 2025-05-07T20:25:58.2601803Z 2025-05-07T20:25:58.2601807Z 2025-05-07T20:25:58.2601810Z 2025-05-07T20:25:58.2601814Z 2025-05-07T20:25:58.2601818Z 2025-05-07T20:25:58.2601821Z 2025-05-07T20:25:58.2601825Z 2025-05-07T20:25:58.2601828Z 2025-05-07T20:25:58.2601837Z 2025-05-07T20:25:58.2601840Z 2025-05-07T20:25:58.2601844Z 2025-05-07T20:25:58.2602330Z  2025-05-07T20:25:58.2602572Z 2025-05-07T20:25:58.2602576Z 2025-05-07T20:25:58.2602580Z 2025-05-07T20:25:58.2602583Z 2025-05-07T20:25:58.2602587Z 2025-05-07T20:25:58.2602596Z 2025-05-07T20:25:58.2602600Z 2025-05-07T20:25:58.2602603Z 2025-05-07T20:25:58.2602607Z 2025-05-07T20:25:58.2602610Z 2025-05-07T20:25:58.2602614Z 2025-05-07T20:25:58.2602618Z 2025-05-07T20:25:58.2602621Z 2025-05-07T20:25:58.2602625Z 2025-05-07T20:25:58.2603065Z  2025-05-07T20:25:58.2603362Z 2025-05-07T20:25:58.2603366Z 2025-05-07T20:25:58.2603370Z 2025-05-07T20:25:58.2603373Z 2025-05-07T20:25:58.2603377Z 2025-05-07T20:25:58.2603380Z 2025-05-07T20:25:58.2603384Z 2025-05-07T20:25:58.2603387Z 2025-05-07T20:25:58.2603391Z 2025-05-07T20:25:58.2603504Z 2025-05-07T20:25:58.2603508Z 2025-05-07T20:25:58.2603519Z 2025-05-07T20:25:58.2603523Z 2025-05-07T20:25:58.2603527Z 2025-05-07T20:25:58.2603530Z 2025-05-07T20:25:58.2603908Z  2025-05-07T20:25:58.2604144Z 2025-05-07T20:25:58.2604154Z 2025-05-07T20:25:58.2604158Z 2025-05-07T20:25:58.2604162Z 2025-05-07T20:25:58.2604165Z 2025-05-07T20:25:58.2604169Z 2025-05-07T20:25:58.2604179Z 2025-05-07T20:25:58.2604182Z 2025-05-07T20:25:58.2604186Z 2025-05-07T20:25:58.2604190Z 2025-05-07T20:25:58.2604193Z 2025-05-07T20:25:58.2604197Z 2025-05-07T20:25:58.2604200Z 2025-05-07T20:25:58.2604204Z 2025-05-07T20:25:58.2604208Z 2025-05-07T20:25:58.2604211Z 2025-05-07T20:25:58.2604506Z  2025-05-07T20:25:58.2604744Z 2025-05-07T20:25:58.2604747Z 2025-05-07T20:25:58.2604751Z 2025-05-07T20:25:58.2604761Z 2025-05-07T20:25:58.2604765Z 2025-05-07T20:25:58.2604768Z 2025-05-07T20:25:58.2604772Z 2025-05-07T20:25:58.2604776Z 2025-05-07T20:25:58.2604779Z 2025-05-07T20:25:58.2604783Z 2025-05-07T20:25:58.2604793Z 2025-05-07T20:25:58.2604801Z 2025-05-07T20:25:58.2604805Z 2025-05-07T20:25:58.2604808Z 2025-05-07T20:25:58.2604812Z 2025-05-07T20:25:58.2604816Z 2025-05-07T20:25:58.2604819Z 2025-05-07T20:25:58.2605211Z  2025-05-07T20:25:58.2605477Z 2025-05-07T20:25:58.2605481Z 2025-05-07T20:25:58.2605484Z 2025-05-07T20:25:58.2605496Z 2025-05-07T20:25:58.2605499Z 2025-05-07T20:25:58.2605503Z 2025-05-07T20:25:58.2605507Z 2025-05-07T20:25:58.2605510Z 2025-05-07T20:25:58.2605514Z 2025-05-07T20:25:58.2605517Z 2025-05-07T20:25:58.2605521Z 2025-05-07T20:25:58.2605535Z 2025-05-07T20:25:58.2605539Z 2025-05-07T20:25:58.2605542Z 2025-05-07T20:25:58.2605546Z 2025-05-07T20:25:58.2605550Z 2025-05-07T20:25:58.2605560Z 2025-05-07T20:25:58.2605563Z 2025-05-07T20:25:58.2607597Z  2025-05-07T20:25:58.2608003Z 2025-05-07T20:25:58.2608024Z 2025-05-07T20:25:58.2608211Z  2025-05-07T20:25:58.2608371Z 2025-05-07T20:25:58.2608378Z 2025-05-07T20:25:58.2608995Z  2025-05-07T20:25:58.2609175Z 2025-05-07T20:25:58.2609181Z 2025-05-07T20:25:58.2609191Z 2025-05-07T20:25:58.2609721Z  2025-05-07T20:25:58.2609906Z 2025-05-07T20:25:58.2609913Z 2025-05-07T20:25:58.2609918Z 2025-05-07T20:25:58.2609928Z 2025-05-07T20:25:58.2610661Z  2025-05-07T20:25:58.2610858Z 2025-05-07T20:25:58.2610864Z 2025-05-07T20:25:58.2610869Z 2025-05-07T20:25:58.2610875Z 2025-05-07T20:25:58.2610885Z 2025-05-07T20:25:58.2611299Z  2025-05-07T20:25:58.2611451Z 2025-05-07T20:25:58.2611455Z 2025-05-07T20:25:58.2611458Z 2025-05-07T20:25:58.2611462Z 2025-05-07T20:25:58.2611466Z 2025-05-07T20:25:58.2611479Z 2025-05-07T20:25:58.2611973Z  2025-05-07T20:25:58.2612116Z 2025-05-07T20:25:58.2612119Z 2025-05-07T20:25:58.2612123Z 2025-05-07T20:25:58.2612127Z 2025-05-07T20:25:58.2612131Z 2025-05-07T20:25:58.2612138Z 2025-05-07T20:25:58.2612141Z 2025-05-07T20:25:58.2612781Z  2025-05-07T20:25:58.2612941Z 2025-05-07T20:25:58.2612945Z 2025-05-07T20:25:58.2612948Z 2025-05-07T20:25:58.2612952Z 2025-05-07T20:25:58.2612956Z 2025-05-07T20:25:58.2612959Z 2025-05-07T20:25:58.2612963Z 2025-05-07T20:25:58.2612970Z 2025-05-07T20:25:58.2613399Z  2025-05-07T20:25:58.2613558Z 2025-05-07T20:25:58.2613565Z 2025-05-07T20:25:58.2613569Z 2025-05-07T20:25:58.2613572Z 2025-05-07T20:25:58.2613576Z 2025-05-07T20:25:58.2613580Z 2025-05-07T20:25:58.2613583Z 2025-05-07T20:25:58.2613587Z 2025-05-07T20:25:58.2613590Z 2025-05-07T20:25:58.2614080Z  2025-05-07T20:25:58.2614251Z 2025-05-07T20:25:58.2614259Z 2025-05-07T20:25:58.2614449Z 2025-05-07T20:25:58.2614453Z 2025-05-07T20:25:58.2614457Z 2025-05-07T20:25:58.2614460Z 2025-05-07T20:25:58.2614464Z 2025-05-07T20:25:58.2614467Z 2025-05-07T20:25:58.2614471Z 2025-05-07T20:25:58.2614474Z 2025-05-07T20:25:58.2614901Z  2025-05-07T20:25:58.2615166Z 2025-05-07T20:25:58.2615171Z 2025-05-07T20:25:58.2615176Z 2025-05-07T20:25:58.2615182Z 2025-05-07T20:25:58.2615187Z 2025-05-07T20:25:58.2615192Z 2025-05-07T20:25:58.2615197Z 2025-05-07T20:25:58.2615203Z 2025-05-07T20:25:58.2615208Z 2025-05-07T20:25:58.2615213Z 2025-05-07T20:25:58.2615218Z 2025-05-07T20:25:58.2615466Z  2025-05-07T20:25:58.2615654Z 2025-05-07T20:25:58.2615662Z 2025-05-07T20:25:58.2615666Z 2025-05-07T20:25:58.2615669Z 2025-05-07T20:25:58.2615673Z 2025-05-07T20:25:58.2615676Z 2025-05-07T20:25:58.2615680Z 2025-05-07T20:25:58.2615684Z 2025-05-07T20:25:58.2615687Z 2025-05-07T20:25:58.2615691Z 2025-05-07T20:25:58.2615694Z 2025-05-07T20:25:58.2615704Z 2025-05-07T20:25:58.2616108Z  2025-05-07T20:25:58.2616292Z 2025-05-07T20:25:58.2616300Z 2025-05-07T20:25:58.2616303Z 2025-05-07T20:25:58.2616307Z 2025-05-07T20:25:58.2616316Z 2025-05-07T20:25:58.2616320Z 2025-05-07T20:25:58.2616323Z 2025-05-07T20:25:58.2616327Z 2025-05-07T20:25:58.2616336Z 2025-05-07T20:25:58.2616340Z 2025-05-07T20:25:58.2616343Z 2025-05-07T20:25:58.2616347Z 2025-05-07T20:25:58.2616351Z 2025-05-07T20:25:58.2616801Z  2025-05-07T20:25:58.2617001Z 2025-05-07T20:25:58.2617004Z 2025-05-07T20:25:58.2617008Z 2025-05-07T20:25:58.2617012Z 2025-05-07T20:25:58.2617015Z 2025-05-07T20:25:58.2617019Z 2025-05-07T20:25:58.2617022Z 2025-05-07T20:25:58.2617026Z 2025-05-07T20:25:58.2617030Z 2025-05-07T20:25:58.2617033Z 2025-05-07T20:25:58.2617037Z 2025-05-07T20:25:58.2617040Z 2025-05-07T20:25:58.2617044Z 2025-05-07T20:25:58.2617047Z 2025-05-07T20:25:58.2617466Z  2025-05-07T20:25:58.2617671Z 2025-05-07T20:25:58.2617675Z 2025-05-07T20:25:58.2617682Z 2025-05-07T20:25:58.2617686Z 2025-05-07T20:25:58.2617690Z 2025-05-07T20:25:58.2617693Z 2025-05-07T20:25:58.2617697Z 2025-05-07T20:25:58.2617705Z 2025-05-07T20:25:58.2617708Z 2025-05-07T20:25:58.2617712Z 2025-05-07T20:25:58.2617715Z 2025-05-07T20:25:58.2617719Z 2025-05-07T20:25:58.2617722Z 2025-05-07T20:25:58.2617726Z 2025-05-07T20:25:58.2617737Z 2025-05-07T20:25:58.2618232Z  2025-05-07T20:25:58.2618442Z 2025-05-07T20:25:58.2618446Z 2025-05-07T20:25:58.2618450Z 2025-05-07T20:25:58.2618453Z 2025-05-07T20:25:58.2618457Z 2025-05-07T20:25:58.2618461Z 2025-05-07T20:25:58.2618464Z 2025-05-07T20:25:58.2618468Z 2025-05-07T20:25:58.2618475Z 2025-05-07T20:25:58.2618479Z 2025-05-07T20:25:58.2618491Z 2025-05-07T20:25:58.2618495Z 2025-05-07T20:25:58.2618498Z 2025-05-07T20:25:58.2618502Z 2025-05-07T20:25:58.2618506Z 2025-05-07T20:25:58.2618509Z 2025-05-07T20:25:58.2618791Z  2025-05-07T20:25:58.2619052Z 2025-05-07T20:25:58.2619063Z 2025-05-07T20:25:58.2619072Z 2025-05-07T20:25:58.2619076Z 2025-05-07T20:25:58.2619079Z 2025-05-07T20:25:58.2619088Z 2025-05-07T20:25:58.2619092Z 2025-05-07T20:25:58.2619095Z 2025-05-07T20:25:58.2619099Z 2025-05-07T20:25:58.2619102Z 2025-05-07T20:25:58.2619106Z 2025-05-07T20:25:58.2619110Z 2025-05-07T20:25:58.2619113Z 2025-05-07T20:25:58.2619117Z 2025-05-07T20:25:58.2619120Z 2025-05-07T20:25:58.2619124Z 2025-05-07T20:25:58.2619128Z 2025-05-07T20:25:58.2619622Z  2025-05-07T20:25:58.2619972Z 2025-05-07T20:25:58.2619978Z 2025-05-07T20:25:58.2619983Z 2025-05-07T20:25:58.2619989Z 2025-05-07T20:25:58.2619995Z 2025-05-07T20:25:58.2620001Z 2025-05-07T20:25:58.2620007Z 2025-05-07T20:25:58.2620013Z 2025-05-07T20:25:58.2620019Z 2025-05-07T20:25:58.2620025Z 2025-05-07T20:25:58.2620038Z 2025-05-07T20:25:58.2620044Z 2025-05-07T20:25:58.2620208Z 2025-05-07T20:25:58.2620213Z 2025-05-07T20:25:58.2620219Z 2025-05-07T20:25:58.2620225Z 2025-05-07T20:25:58.2620231Z 2025-05-07T20:25:58.2620236Z 2025-05-07T20:25:58.2621262Z  2025-05-07T20:25:58.2621613Z 2025-05-07T20:25:58.2621625Z 2025-05-07T20:25:58.2621797Z  2025-05-07T20:25:58.2621985Z 2025-05-07T20:25:58.2621991Z 2025-05-07T20:25:58.2622356Z  2025-05-07T20:25:58.2622473Z 2025-05-07T20:25:58.2629817Z 2025-05-07T20:25:58.2629836Z 2025-05-07T20:25:58.2630221Z  2025-05-07T20:25:58.2630422Z 2025-05-07T20:25:58.2630428Z 2025-05-07T20:25:58.2630435Z 2025-05-07T20:25:58.2630441Z 2025-05-07T20:25:58.2630636Z  2025-05-07T20:25:58.2630810Z 2025-05-07T20:25:58.2630816Z 2025-05-07T20:25:58.2630821Z 2025-05-07T20:25:58.2630826Z 2025-05-07T20:25:58.2630832Z 2025-05-07T20:25:58.2631028Z  2025-05-07T20:25:58.2631234Z 2025-05-07T20:25:58.2631240Z 2025-05-07T20:25:58.2631246Z 2025-05-07T20:25:58.2631267Z 2025-05-07T20:25:58.2631272Z 2025-05-07T20:25:58.2631278Z 2025-05-07T20:25:58.2631468Z  2025-05-07T20:25:58.2631683Z 2025-05-07T20:25:58.2631689Z 2025-05-07T20:25:58.2631695Z 2025-05-07T20:25:58.2631709Z 2025-05-07T20:25:58.2631715Z 2025-05-07T20:25:58.2631721Z 2025-05-07T20:25:58.2631726Z 2025-05-07T20:25:58.2631928Z  2025-05-07T20:25:58.2632144Z 2025-05-07T20:25:58.2632198Z 2025-05-07T20:25:58.2632205Z 2025-05-07T20:25:58.2632210Z 2025-05-07T20:25:58.2632216Z 2025-05-07T20:25:58.2632221Z 2025-05-07T20:25:58.2632227Z 2025-05-07T20:25:58.2632232Z 2025-05-07T20:25:58.2632437Z  2025-05-07T20:25:58.2632677Z 2025-05-07T20:25:58.2632693Z 2025-05-07T20:25:58.2632699Z 2025-05-07T20:25:58.2632704Z 2025-05-07T20:25:58.2632709Z 2025-05-07T20:25:58.2632714Z 2025-05-07T20:25:58.2632719Z 2025-05-07T20:25:58.2632725Z 2025-05-07T20:25:58.2632730Z 2025-05-07T20:25:58.2632941Z  2025-05-07T20:25:58.2633224Z 2025-05-07T20:25:58.2633230Z 2025-05-07T20:25:58.2633236Z 2025-05-07T20:25:58.2633241Z 2025-05-07T20:25:58.2633247Z 2025-05-07T20:25:58.2633253Z 2025-05-07T20:25:58.2633259Z 2025-05-07T20:25:58.2633271Z 2025-05-07T20:25:58.2633278Z 2025-05-07T20:25:58.2633283Z 2025-05-07T20:25:58.2633505Z  2025-05-07T20:25:58.2633794Z 2025-05-07T20:25:58.2633799Z 2025-05-07T20:25:58.2633805Z 2025-05-07T20:25:58.2633811Z 2025-05-07T20:25:58.2633816Z 2025-05-07T20:25:58.2633822Z 2025-05-07T20:25:58.2633828Z 2025-05-07T20:25:58.2633833Z 2025-05-07T20:25:58.2633839Z 2025-05-07T20:25:58.2633844Z 2025-05-07T20:25:58.2633850Z 2025-05-07T20:25:58.2634059Z  2025-05-07T20:25:58.2634357Z 2025-05-07T20:25:58.2634364Z 2025-05-07T20:25:58.2634370Z 2025-05-07T20:25:58.2634375Z 2025-05-07T20:25:58.2634381Z 2025-05-07T20:25:58.2634387Z 2025-05-07T20:25:58.2634393Z 2025-05-07T20:25:58.2634398Z 2025-05-07T20:25:58.2634404Z 2025-05-07T20:25:58.2634419Z 2025-05-07T20:25:58.2634425Z 2025-05-07T20:25:58.2634430Z 2025-05-07T20:25:58.2634661Z  2025-05-07T20:25:58.2634961Z 2025-05-07T20:25:58.2634967Z 2025-05-07T20:25:58.2634980Z 2025-05-07T20:25:58.2634986Z 2025-05-07T20:25:58.2634992Z 2025-05-07T20:25:58.2634997Z 2025-05-07T20:25:58.2635003Z 2025-05-07T20:25:58.2635009Z 2025-05-07T20:25:58.2635015Z 2025-05-07T20:25:58.2635021Z 2025-05-07T20:25:58.2635027Z 2025-05-07T20:25:58.2635033Z 2025-05-07T20:25:58.2635038Z 2025-05-07T20:25:58.2635268Z  2025-05-07T20:25:58.2635579Z 2025-05-07T20:25:58.2635585Z 2025-05-07T20:25:58.2635591Z 2025-05-07T20:25:58.2635597Z 2025-05-07T20:25:58.2635603Z 2025-05-07T20:25:58.2635609Z 2025-05-07T20:25:58.2635615Z 2025-05-07T20:25:58.2635635Z 2025-05-07T20:25:58.2635641Z 2025-05-07T20:25:58.2635647Z 2025-05-07T20:25:58.2635653Z 2025-05-07T20:25:58.2635659Z 2025-05-07T20:25:58.2635664Z 2025-05-07T20:25:58.2635817Z 2025-05-07T20:25:58.2636057Z  2025-05-07T20:25:58.2636393Z 2025-05-07T20:25:58.2636400Z 2025-05-07T20:25:58.2636406Z 2025-05-07T20:25:58.2636411Z 2025-05-07T20:25:58.2636417Z 2025-05-07T20:25:58.2636535Z 2025-05-07T20:25:58.2636543Z 2025-05-07T20:25:58.2636548Z 2025-05-07T20:25:58.2636554Z 2025-05-07T20:25:58.2636559Z 2025-05-07T20:25:58.2636565Z 2025-05-07T20:25:58.2636571Z 2025-05-07T20:25:58.2636576Z 2025-05-07T20:25:58.2636581Z 2025-05-07T20:25:58.2636587Z 2025-05-07T20:25:58.2636852Z  2025-05-07T20:25:58.2637175Z 2025-05-07T20:25:58.2637181Z 2025-05-07T20:25:58.2637187Z 2025-05-07T20:25:58.2637192Z 2025-05-07T20:25:58.2637197Z 2025-05-07T20:25:58.2637203Z 2025-05-07T20:25:58.2637208Z 2025-05-07T20:25:58.2637214Z 2025-05-07T20:25:58.2637220Z 2025-05-07T20:25:58.2637225Z 2025-05-07T20:25:58.2637230Z 2025-05-07T20:25:58.2637236Z 2025-05-07T20:25:58.2637241Z 2025-05-07T20:25:58.2637247Z 2025-05-07T20:25:58.2637263Z 2025-05-07T20:25:58.2637268Z 2025-05-07T20:25:58.2637542Z  2025-05-07T20:25:58.2637868Z 2025-05-07T20:25:58.2637875Z 2025-05-07T20:25:58.2637880Z 2025-05-07T20:25:58.2637896Z 2025-05-07T20:25:58.2637902Z 2025-05-07T20:25:58.2637908Z 2025-05-07T20:25:58.2637914Z 2025-05-07T20:25:58.2637931Z 2025-05-07T20:25:58.2637937Z 2025-05-07T20:25:58.2637943Z 2025-05-07T20:25:58.2637949Z 2025-05-07T20:25:58.2637955Z 2025-05-07T20:25:58.2637961Z 2025-05-07T20:25:58.2637966Z 2025-05-07T20:25:58.2637972Z 2025-05-07T20:25:58.2637978Z 2025-05-07T20:25:58.2637984Z 2025-05-07T20:25:58.2638249Z  2025-05-07T20:25:58.2638587Z 2025-05-07T20:25:58.2638593Z 2025-05-07T20:25:58.2638599Z 2025-05-07T20:25:58.2638605Z 2025-05-07T20:25:58.2638611Z 2025-05-07T20:25:58.2638617Z 2025-05-07T20:25:58.2638623Z 2025-05-07T20:25:58.2638629Z 2025-05-07T20:25:58.2638635Z 2025-05-07T20:25:58.2638648Z 2025-05-07T20:25:58.2638654Z 2025-05-07T20:25:58.2638660Z 2025-05-07T20:25:58.2638666Z 2025-05-07T20:25:58.2638672Z 2025-05-07T20:25:58.2638678Z 2025-05-07T20:25:58.2638684Z 2025-05-07T20:25:58.2638690Z 2025-05-07T20:25:58.2638702Z 2025-05-07T20:25:58.2638978Z  2025-05-07T20:25:58.2639331Z 2025-05-07T20:25:58.2639337Z 2025-05-07T20:25:58.2639493Z  2025-05-07T20:25:58.2639672Z 2025-05-07T20:25:58.2639678Z 2025-05-07T20:25:58.2639842Z  2025-05-07T20:25:58.2640017Z 2025-05-07T20:25:58.2640022Z 2025-05-07T20:25:58.2640027Z 2025-05-07T20:25:58.2640198Z  2025-05-07T20:25:58.2640375Z 2025-05-07T20:25:58.2640381Z 2025-05-07T20:25:58.2640387Z 2025-05-07T20:25:58.2640401Z 2025-05-07T20:25:58.2640581Z  2025-05-07T20:25:58.2640760Z 2025-05-07T20:25:58.2640766Z 2025-05-07T20:25:58.2640771Z 2025-05-07T20:25:58.2640776Z 2025-05-07T20:25:58.2640781Z 2025-05-07T20:25:58.2640994Z  2025-05-07T20:25:58.2641225Z 2025-05-07T20:25:58.2641231Z 2025-05-07T20:25:58.2641237Z 2025-05-07T20:25:58.2641242Z 2025-05-07T20:25:58.2641248Z 2025-05-07T20:25:58.2641253Z 2025-05-07T20:25:58.2641478Z  2025-05-07T20:25:58.2641707Z 2025-05-07T20:25:58.2641713Z 2025-05-07T20:25:58.2641719Z 2025-05-07T20:25:58.2641725Z 2025-05-07T20:25:58.2641730Z 2025-05-07T20:25:58.2641736Z 2025-05-07T20:25:58.2641741Z 2025-05-07T20:25:58.2641925Z  2025-05-07T20:25:58.2642153Z 2025-05-07T20:25:58.2642159Z 2025-05-07T20:25:58.2642165Z 2025-05-07T20:25:58.2642171Z 2025-05-07T20:25:58.2642177Z 2025-05-07T20:25:58.2642183Z 2025-05-07T20:25:58.2642188Z 2025-05-07T20:25:58.2642194Z 2025-05-07T20:25:58.2642392Z  2025-05-07T20:25:58.2642639Z 2025-05-07T20:25:58.2642644Z 2025-05-07T20:25:58.2642650Z 2025-05-07T20:25:58.2642656Z 2025-05-07T20:25:58.2642661Z 2025-05-07T20:25:58.2642667Z 2025-05-07T20:25:58.2642673Z 2025-05-07T20:25:58.2642678Z 2025-05-07T20:25:58.2642814Z 2025-05-07T20:25:58.2643027Z  2025-05-07T20:25:58.2643280Z 2025-05-07T20:25:58.2643286Z 2025-05-07T20:25:58.2643291Z 2025-05-07T20:25:58.2643297Z 2025-05-07T20:25:58.2643303Z 2025-05-07T20:25:58.2643408Z 2025-05-07T20:25:58.2643415Z 2025-05-07T20:25:58.2643421Z 2025-05-07T20:25:58.2643427Z 2025-05-07T20:25:58.2643432Z 2025-05-07T20:25:58.2643643Z  2025-05-07T20:25:58.2643911Z 2025-05-07T20:25:58.2643917Z 2025-05-07T20:25:58.2643923Z 2025-05-07T20:25:58.2643928Z 2025-05-07T20:25:58.2643934Z 2025-05-07T20:25:58.2643940Z 2025-05-07T20:25:58.2643946Z 2025-05-07T20:25:58.2643952Z 2025-05-07T20:25:58.2643957Z 2025-05-07T20:25:58.2643963Z 2025-05-07T20:25:58.2643969Z 2025-05-07T20:25:58.2644196Z  2025-05-07T20:25:58.2644470Z 2025-05-07T20:25:58.2644476Z 2025-05-07T20:25:58.2644482Z 2025-05-07T20:25:58.2644487Z 2025-05-07T20:25:58.2644493Z 2025-05-07T20:25:58.2644499Z 2025-05-07T20:25:58.2644515Z 2025-05-07T20:25:58.2644521Z 2025-05-07T20:25:58.2644527Z 2025-05-07T20:25:58.2644533Z 2025-05-07T20:25:58.2644539Z 2025-05-07T20:25:58.2644545Z 2025-05-07T20:25:58.2644770Z  2025-05-07T20:25:58.2645070Z 2025-05-07T20:25:58.2645076Z 2025-05-07T20:25:58.2645082Z 2025-05-07T20:25:58.2645088Z 2025-05-07T20:25:58.2645094Z 2025-05-07T20:25:58.2645100Z 2025-05-07T20:25:58.2645105Z 2025-05-07T20:25:58.2645111Z 2025-05-07T20:25:58.2645117Z 2025-05-07T20:25:58.2645123Z 2025-05-07T20:25:58.2645129Z 2025-05-07T20:25:58.2645144Z 2025-05-07T20:25:58.2645149Z 2025-05-07T20:25:58.2645370Z  2025-05-07T20:25:58.2645682Z 2025-05-07T20:25:58.2645688Z 2025-05-07T20:25:58.2645694Z 2025-05-07T20:25:58.2645700Z 2025-05-07T20:25:58.2645706Z 2025-05-07T20:25:58.2645721Z 2025-05-07T20:25:58.2645727Z 2025-05-07T20:25:58.2645732Z 2025-05-07T20:25:58.2645738Z 2025-05-07T20:25:58.2645744Z 2025-05-07T20:25:58.2645749Z 2025-05-07T20:25:58.2645761Z 2025-05-07T20:25:58.2645767Z 2025-05-07T20:25:58.2645773Z 2025-05-07T20:25:58.2646003Z  2025-05-07T20:25:58.2646335Z 2025-05-07T20:25:58.2646342Z 2025-05-07T20:25:58.2646354Z 2025-05-07T20:25:58.2646361Z 2025-05-07T20:25:58.2646366Z 2025-05-07T20:25:58.2646372Z 2025-05-07T20:25:58.2646377Z 2025-05-07T20:25:58.2646382Z 2025-05-07T20:25:58.2646387Z 2025-05-07T20:25:58.2646392Z 2025-05-07T20:25:58.2646397Z 2025-05-07T20:25:58.2646402Z 2025-05-07T20:25:58.2646407Z 2025-05-07T20:25:58.2646412Z 2025-05-07T20:25:58.2646417Z 2025-05-07T20:25:58.2646670Z  2025-05-07T20:25:58.2646992Z 2025-05-07T20:25:58.2646997Z 2025-05-07T20:25:58.2647003Z 2025-05-07T20:25:58.2647008Z 2025-05-07T20:25:58.2647013Z 2025-05-07T20:25:58.2647019Z 2025-05-07T20:25:58.2647024Z 2025-05-07T20:25:58.2647030Z 2025-05-07T20:25:58.2647036Z 2025-05-07T20:25:58.2647041Z 2025-05-07T20:25:58.2647046Z 2025-05-07T20:25:58.2647059Z 2025-05-07T20:25:58.2647065Z 2025-05-07T20:25:58.2647071Z 2025-05-07T20:25:58.2647087Z 2025-05-07T20:25:58.2647093Z 2025-05-07T20:25:58.2647353Z  2025-05-07T20:25:58.2647791Z 2025-05-07T20:25:58.2647798Z 2025-05-07T20:25:58.2647803Z 2025-05-07T20:25:58.2647809Z 2025-05-07T20:25:58.2647815Z 2025-05-07T20:25:58.2647832Z 2025-05-07T20:25:58.2647838Z 2025-05-07T20:25:58.2647843Z 2025-05-07T20:25:58.2647849Z 2025-05-07T20:25:58.2647855Z 2025-05-07T20:25:58.2647861Z 2025-05-07T20:25:58.2647867Z 2025-05-07T20:25:58.2647873Z 2025-05-07T20:25:58.2647879Z 2025-05-07T20:25:58.2647885Z 2025-05-07T20:25:58.2647890Z 2025-05-07T20:25:58.2647896Z 2025-05-07T20:25:58.2648160Z  2025-05-07T20:25:58.2648519Z 2025-05-07T20:25:58.2648525Z 2025-05-07T20:25:58.2648531Z 2025-05-07T20:25:58.2648536Z 2025-05-07T20:25:58.2648542Z 2025-05-07T20:25:58.2648548Z 2025-05-07T20:25:58.2648553Z 2025-05-07T20:25:58.2648687Z 2025-05-07T20:25:58.2648692Z 2025-05-07T20:25:58.2648697Z 2025-05-07T20:25:58.2648702Z 2025-05-07T20:25:58.2648707Z 2025-05-07T20:25:58.2648712Z 2025-05-07T20:25:58.2648717Z 2025-05-07T20:25:58.2648723Z 2025-05-07T20:25:58.2649324Z 2025-05-07T20:25:58.2649334Z 2025-05-07T20:25:58.2649340Z 2025-05-07T20:25:58.2649648Z  2025-05-07T20:25:58.2649985Z 2025-05-07T20:25:58.2649991Z 2025-05-07T20:25:58.2650166Z  2025-05-07T20:25:58.2650322Z 2025-05-07T20:25:58.2650328Z 2025-05-07T20:25:58.2650547Z  2025-05-07T20:25:58.2650712Z 2025-05-07T20:25:58.2650718Z 2025-05-07T20:25:58.2650723Z 2025-05-07T20:25:58.2650914Z  2025-05-07T20:25:58.2651092Z 2025-05-07T20:25:58.2651098Z 2025-05-07T20:25:58.2651104Z 2025-05-07T20:25:58.2651109Z 2025-05-07T20:25:58.2651280Z  2025-05-07T20:25:58.2651469Z 2025-05-07T20:25:58.2651475Z 2025-05-07T20:25:58.2651480Z 2025-05-07T20:25:58.2651486Z 2025-05-07T20:25:58.2651503Z 2025-05-07T20:25:58.2651685Z  2025-05-07T20:25:58.2651883Z 2025-05-07T20:25:58.2651889Z 2025-05-07T20:25:58.2651895Z 2025-05-07T20:25:58.2651901Z 2025-05-07T20:25:58.2651906Z 2025-05-07T20:25:58.2651911Z 2025-05-07T20:25:58.2652110Z  2025-05-07T20:25:58.2652312Z 2025-05-07T20:25:58.2652318Z 2025-05-07T20:25:58.2652323Z 2025-05-07T20:25:58.2652328Z 2025-05-07T20:25:58.2652333Z 2025-05-07T20:25:58.2652339Z 2025-05-07T20:25:58.2652344Z 2025-05-07T20:25:58.2652545Z  2025-05-07T20:25:58.2652771Z 2025-05-07T20:25:58.2652776Z 2025-05-07T20:25:58.2652781Z 2025-05-07T20:25:58.2652786Z 2025-05-07T20:25:58.2652791Z 2025-05-07T20:25:58.2652797Z 2025-05-07T20:25:58.2652801Z 2025-05-07T20:25:58.2652807Z 2025-05-07T20:25:58.2653002Z  2025-05-07T20:25:58.2653248Z 2025-05-07T20:25:58.2653254Z 2025-05-07T20:25:58.2653259Z 2025-05-07T20:25:58.2653265Z 2025-05-07T20:25:58.2653271Z 2025-05-07T20:25:58.2653276Z 2025-05-07T20:25:58.2653290Z 2025-05-07T20:25:58.2653296Z 2025-05-07T20:25:58.2653301Z 2025-05-07T20:25:58.2653504Z  2025-05-07T20:25:58.2653751Z 2025-05-07T20:25:58.2653757Z 2025-05-07T20:25:58.2653763Z 2025-05-07T20:25:58.2653776Z 2025-05-07T20:25:58.2653782Z 2025-05-07T20:25:58.2653788Z 2025-05-07T20:25:58.2653793Z 2025-05-07T20:25:58.2653799Z 2025-05-07T20:25:58.2653805Z 2025-05-07T20:25:58.2653821Z 2025-05-07T20:25:58.2654026Z  2025-05-07T20:25:58.2654295Z 2025-05-07T20:25:58.2654301Z 2025-05-07T20:25:58.2654307Z 2025-05-07T20:25:58.2654313Z 2025-05-07T20:25:58.2654318Z 2025-05-07T20:25:58.2654324Z 2025-05-07T20:25:58.2654337Z 2025-05-07T20:25:58.2654343Z 2025-05-07T20:25:58.2654349Z 2025-05-07T20:25:58.2654355Z 2025-05-07T20:25:58.2654360Z 2025-05-07T20:25:58.2654569Z  2025-05-07T20:25:58.2654851Z 2025-05-07T20:25:58.2654857Z 2025-05-07T20:25:58.2654872Z 2025-05-07T20:25:58.2654878Z 2025-05-07T20:25:58.2654893Z 2025-05-07T20:25:58.2654899Z 2025-05-07T20:25:58.2654904Z 2025-05-07T20:25:58.2654910Z 2025-05-07T20:25:58.2654915Z 2025-05-07T20:25:58.2654920Z 2025-05-07T20:25:58.2654925Z 2025-05-07T20:25:58.2654930Z 2025-05-07T20:25:58.2655154Z  2025-05-07T20:25:58.2655462Z 2025-05-07T20:25:58.2655467Z 2025-05-07T20:25:58.2655473Z 2025-05-07T20:25:58.2655478Z 2025-05-07T20:25:58.2655483Z 2025-05-07T20:25:58.2655488Z 2025-05-07T20:25:58.2655494Z 2025-05-07T20:25:58.2655499Z 2025-05-07T20:25:58.2655504Z 2025-05-07T20:25:58.2655509Z 2025-05-07T20:25:58.2655515Z 2025-05-07T20:25:58.2655520Z 2025-05-07T20:25:58.2655526Z 2025-05-07T20:25:58.2655761Z  2025-05-07T20:25:58.2656064Z 2025-05-07T20:25:58.2656071Z 2025-05-07T20:25:58.2656076Z 2025-05-07T20:25:58.2656081Z 2025-05-07T20:25:58.2656087Z 2025-05-07T20:25:58.2656093Z 2025-05-07T20:25:58.2656098Z 2025-05-07T20:25:58.2656104Z 2025-05-07T20:25:58.2656110Z 2025-05-07T20:25:58.2656267Z 2025-05-07T20:25:58.2656273Z 2025-05-07T20:25:58.2656278Z 2025-05-07T20:25:58.2656284Z 2025-05-07T20:25:58.2656290Z 2025-05-07T20:25:58.2656540Z  2025-05-07T20:25:58.2656969Z 2025-05-07T20:25:58.2656977Z 2025-05-07T20:25:58.2656982Z 2025-05-07T20:25:58.2656987Z 2025-05-07T20:25:58.2656993Z 2025-05-07T20:25:58.2656998Z 2025-05-07T20:25:58.2657004Z 2025-05-07T20:25:58.2657009Z 2025-05-07T20:25:58.2657015Z 2025-05-07T20:25:58.2657031Z 2025-05-07T20:25:58.2657037Z 2025-05-07T20:25:58.2657042Z 2025-05-07T20:25:58.2657048Z 2025-05-07T20:25:58.2657053Z 2025-05-07T20:25:58.2657059Z 2025-05-07T20:25:58.2657310Z  2025-05-07T20:25:58.2657630Z 2025-05-07T20:25:58.2657646Z 2025-05-07T20:25:58.2657652Z 2025-05-07T20:25:58.2657657Z 2025-05-07T20:25:58.2657663Z 2025-05-07T20:25:58.2657669Z 2025-05-07T20:25:58.2657675Z 2025-05-07T20:25:58.2657681Z 2025-05-07T20:25:58.2657686Z 2025-05-07T20:25:58.2657701Z 2025-05-07T20:25:58.2657707Z 2025-05-07T20:25:58.2657712Z 2025-05-07T20:25:58.2657718Z 2025-05-07T20:25:58.2657724Z 2025-05-07T20:25:58.2657729Z 2025-05-07T20:25:58.2657734Z 2025-05-07T20:25:58.2657989Z  2025-05-07T20:25:58.2658326Z 2025-05-07T20:25:58.2658332Z 2025-05-07T20:25:58.2658338Z 2025-05-07T20:25:58.2658343Z 2025-05-07T20:25:58.2658348Z 2025-05-07T20:25:58.2658354Z 2025-05-07T20:25:58.2658360Z 2025-05-07T20:25:58.2658365Z 2025-05-07T20:25:58.2658371Z 2025-05-07T20:25:58.2658377Z 2025-05-07T20:25:58.2658383Z 2025-05-07T20:25:58.2658388Z 2025-05-07T20:25:58.2658394Z 2025-05-07T20:25:58.2658400Z 2025-05-07T20:25:58.2658405Z 2025-05-07T20:25:58.2658411Z 2025-05-07T20:25:58.2658417Z 2025-05-07T20:25:58.2658685Z  2025-05-07T20:25:58.2659033Z 2025-05-07T20:25:58.2659039Z 2025-05-07T20:25:58.2659044Z 2025-05-07T20:25:58.2659049Z 2025-05-07T20:25:58.2659054Z 2025-05-07T20:25:58.2659066Z 2025-05-07T20:25:58.2659072Z 2025-05-07T20:25:58.2659087Z 2025-05-07T20:25:58.2659092Z 2025-05-07T20:25:58.2659097Z 2025-05-07T20:25:58.2659102Z 2025-05-07T20:25:58.2659107Z 2025-05-07T20:25:58.2659120Z 2025-05-07T20:25:58.2659125Z 2025-05-07T20:25:58.2659130Z 2025-05-07T20:25:58.2659135Z 2025-05-07T20:25:58.2659140Z 2025-05-07T20:25:58.2659146Z 2025-05-07T20:25:58.2659475Z  2025-05-07T20:25:58.2659823Z 2025-05-07T20:25:58.2659830Z 2025-05-07T20:25:58.2659995Z  2025-05-07T20:25:58.2660156Z 2025-05-07T20:25:58.2660162Z 2025-05-07T20:25:58.2660322Z  2025-05-07T20:25:58.2660496Z 2025-05-07T20:25:58.2660502Z 2025-05-07T20:25:58.2660508Z 2025-05-07T20:25:58.2660682Z  2025-05-07T20:25:58.2660860Z 2025-05-07T20:25:58.2660865Z 2025-05-07T20:25:58.2660871Z 2025-05-07T20:25:58.2660877Z 2025-05-07T20:25:58.2661055Z  2025-05-07T20:25:58.2661236Z 2025-05-07T20:25:58.2661241Z 2025-05-07T20:25:58.2661255Z 2025-05-07T20:25:58.2661270Z 2025-05-07T20:25:58.2661276Z 2025-05-07T20:25:58.2661463Z  2025-05-07T20:25:58.2661655Z 2025-05-07T20:25:58.2661660Z 2025-05-07T20:25:58.2661665Z 2025-05-07T20:25:58.2661676Z 2025-05-07T20:25:58.2661682Z 2025-05-07T20:25:58.2661687Z 2025-05-07T20:25:58.2661884Z  2025-05-07T20:25:58.2662086Z 2025-05-07T20:25:58.2662092Z 2025-05-07T20:25:58.2662097Z 2025-05-07T20:25:58.2662103Z 2025-05-07T20:25:58.2662108Z 2025-05-07T20:25:58.2662113Z 2025-05-07T20:25:58.2662118Z 2025-05-07T20:25:58.2662321Z  2025-05-07T20:25:58.2662533Z 2025-05-07T20:25:58.2662539Z 2025-05-07T20:25:58.2662544Z 2025-05-07T20:25:58.2662550Z 2025-05-07T20:25:58.2662555Z 2025-05-07T20:25:58.2662561Z 2025-05-07T20:25:58.2662567Z 2025-05-07T20:25:58.2662572Z 2025-05-07T20:25:58.2662785Z  2025-05-07T20:25:58.2663017Z 2025-05-07T20:25:58.2663023Z 2025-05-07T20:25:58.2663028Z 2025-05-07T20:25:58.2663033Z 2025-05-07T20:25:58.2663172Z 2025-05-07T20:25:58.2663177Z 2025-05-07T20:25:58.2663183Z 2025-05-07T20:25:58.2663188Z 2025-05-07T20:25:58.2663205Z 2025-05-07T20:25:58.2663413Z  2025-05-07T20:25:58.2663778Z 2025-05-07T20:25:58.2663786Z 2025-05-07T20:25:58.2663791Z 2025-05-07T20:25:58.2663797Z 2025-05-07T20:25:58.2663802Z 2025-05-07T20:25:58.2663807Z 2025-05-07T20:25:58.2663813Z 2025-05-07T20:25:58.2663829Z 2025-05-07T20:25:58.2663835Z 2025-05-07T20:25:58.2663841Z 2025-05-07T20:25:58.2664064Z  2025-05-07T20:25:58.2664315Z 2025-05-07T20:25:58.2664321Z 2025-05-07T20:25:58.2664327Z 2025-05-07T20:25:58.2664332Z 2025-05-07T20:25:58.2664350Z 2025-05-07T20:25:58.2664356Z 2025-05-07T20:25:58.2664362Z 2025-05-07T20:25:58.2664367Z 2025-05-07T20:25:58.2664373Z 2025-05-07T20:25:58.2664379Z 2025-05-07T20:25:58.2664385Z 2025-05-07T20:25:58.2664597Z  2025-05-07T20:25:58.2664890Z 2025-05-07T20:25:58.2664895Z 2025-05-07T20:25:58.2664910Z 2025-05-07T20:25:58.2664915Z 2025-05-07T20:25:58.2664921Z 2025-05-07T20:25:58.2664926Z 2025-05-07T20:25:58.2664931Z 2025-05-07T20:25:58.2664936Z 2025-05-07T20:25:58.2664942Z 2025-05-07T20:25:58.2664956Z 2025-05-07T20:25:58.2664963Z 2025-05-07T20:25:58.2664968Z 2025-05-07T20:25:58.2665193Z  2025-05-07T20:25:58.2665488Z 2025-05-07T20:25:58.2665494Z 2025-05-07T20:25:58.2665500Z 2025-05-07T20:25:58.2665506Z 2025-05-07T20:25:58.2665512Z 2025-05-07T20:25:58.2665517Z 2025-05-07T20:25:58.2665523Z 2025-05-07T20:25:58.2665529Z 2025-05-07T20:25:58.2665535Z 2025-05-07T20:25:58.2665540Z 2025-05-07T20:25:58.2665546Z 2025-05-07T20:25:58.2665552Z 2025-05-07T20:25:58.2665557Z 2025-05-07T20:25:58.2665782Z  2025-05-07T20:25:58.2666087Z 2025-05-07T20:25:58.2666094Z 2025-05-07T20:25:58.2666099Z 2025-05-07T20:25:58.2666104Z 2025-05-07T20:25:58.2666110Z 2025-05-07T20:25:58.2666116Z 2025-05-07T20:25:58.2666130Z 2025-05-07T20:25:58.2666135Z 2025-05-07T20:25:58.2666141Z 2025-05-07T20:25:58.2666147Z 2025-05-07T20:25:58.2666152Z 2025-05-07T20:25:58.2666158Z 2025-05-07T20:25:58.2666164Z 2025-05-07T20:25:58.2666169Z 2025-05-07T20:25:58.2666417Z  2025-05-07T20:25:58.2666730Z 2025-05-07T20:25:58.2666736Z 2025-05-07T20:25:58.2666741Z 2025-05-07T20:25:58.2666746Z 2025-05-07T20:25:58.2666752Z 2025-05-07T20:25:58.2666758Z 2025-05-07T20:25:58.2666763Z 2025-05-07T20:25:58.2666778Z 2025-05-07T20:25:58.2666783Z 2025-05-07T20:25:58.2666789Z 2025-05-07T20:25:58.2666794Z 2025-05-07T20:25:58.2666800Z 2025-05-07T20:25:58.2666805Z 2025-05-07T20:25:58.2666811Z 2025-05-07T20:25:58.2666816Z 2025-05-07T20:25:58.2667066Z  2025-05-07T20:25:58.2667398Z 2025-05-07T20:25:58.2667404Z 2025-05-07T20:25:58.2667410Z 2025-05-07T20:25:58.2667416Z 2025-05-07T20:25:58.2667422Z 2025-05-07T20:25:58.2667427Z 2025-05-07T20:25:58.2667432Z 2025-05-07T20:25:58.2667447Z 2025-05-07T20:25:58.2667453Z 2025-05-07T20:25:58.2667458Z 2025-05-07T20:25:58.2667463Z 2025-05-07T20:25:58.2667468Z 2025-05-07T20:25:58.2667473Z 2025-05-07T20:25:58.2667477Z 2025-05-07T20:25:58.2667487Z 2025-05-07T20:25:58.2667494Z 2025-05-07T20:25:58.2667759Z  2025-05-07T20:25:58.2668086Z 2025-05-07T20:25:58.2668092Z 2025-05-07T20:25:58.2668097Z 2025-05-07T20:25:58.2668103Z 2025-05-07T20:25:58.2668108Z 2025-05-07T20:25:58.2668112Z 2025-05-07T20:25:58.2668117Z 2025-05-07T20:25:58.2668123Z 2025-05-07T20:25:58.2668128Z 2025-05-07T20:25:58.2668133Z 2025-05-07T20:25:58.2668138Z 2025-05-07T20:25:58.2668143Z 2025-05-07T20:25:58.2668148Z 2025-05-07T20:25:58.2668153Z 2025-05-07T20:25:58.2668169Z 2025-05-07T20:25:58.2668175Z 2025-05-07T20:25:58.2668180Z 2025-05-07T20:25:58.2668446Z  2025-05-07T20:25:58.2668785Z 2025-05-07T20:25:58.2668790Z 2025-05-07T20:25:58.2668795Z 2025-05-07T20:25:58.2668932Z 2025-05-07T20:25:58.2668938Z 2025-05-07T20:25:58.2669000Z 2025-05-07T20:25:58.2669006Z 2025-05-07T20:25:58.2669011Z 2025-05-07T20:25:58.2669017Z 2025-05-07T20:25:58.2669023Z 2025-05-07T20:25:58.2669140Z 2025-05-07T20:25:58.2669148Z 2025-05-07T20:25:58.2669153Z 2025-05-07T20:25:58.2669158Z 2025-05-07T20:25:58.2669163Z 2025-05-07T20:25:58.2669179Z 2025-05-07T20:25:58.2669185Z 2025-05-07T20:25:58.2669190Z 2025-05-07T20:25:58.2669486Z  2025-05-07T20:25:58.2669825Z 2025-05-07T20:25:58.2669830Z 2025-05-07T20:25:58.2670004Z  2025-05-07T20:25:58.2670170Z 2025-05-07T20:25:58.2670176Z 2025-05-07T20:25:58.2670339Z  2025-05-07T20:25:58.2670517Z 2025-05-07T20:25:58.2670523Z 2025-05-07T20:25:58.2670529Z 2025-05-07T20:25:58.2670711Z  2025-05-07T20:25:58.2670909Z 2025-05-07T20:25:58.2670916Z 2025-05-07T20:25:58.2670923Z 2025-05-07T20:25:58.2670929Z 2025-05-07T20:25:58.2671124Z  2025-05-07T20:25:58.2671303Z 2025-05-07T20:25:58.2671317Z 2025-05-07T20:25:58.2671323Z 2025-05-07T20:25:58.2671328Z 2025-05-07T20:25:58.2671334Z 2025-05-07T20:25:58.2671499Z  2025-05-07T20:25:58.2671686Z 2025-05-07T20:25:58.2671699Z 2025-05-07T20:25:58.2671705Z 2025-05-07T20:25:58.2671710Z 2025-05-07T20:25:58.2671727Z 2025-05-07T20:25:58.2671732Z 2025-05-07T20:25:58.2671902Z  2025-05-07T20:25:58.2672102Z 2025-05-07T20:25:58.2672108Z 2025-05-07T20:25:58.2672114Z 2025-05-07T20:25:58.2672119Z 2025-05-07T20:25:58.2672125Z 2025-05-07T20:25:58.2672131Z 2025-05-07T20:25:58.2672144Z 2025-05-07T20:25:58.2672319Z  2025-05-07T20:25:58.2672535Z 2025-05-07T20:25:58.2672540Z 2025-05-07T20:25:58.2672546Z 2025-05-07T20:25:58.2672552Z 2025-05-07T20:25:58.2672557Z 2025-05-07T20:25:58.2672563Z 2025-05-07T20:25:58.2672577Z 2025-05-07T20:25:58.2672583Z 2025-05-07T20:25:58.2672768Z  2025-05-07T20:25:58.2673005Z 2025-05-07T20:25:58.2673016Z 2025-05-07T20:25:58.2673022Z 2025-05-07T20:25:58.2673028Z 2025-05-07T20:25:58.2673034Z 2025-05-07T20:25:58.2673039Z 2025-05-07T20:25:58.2673052Z 2025-05-07T20:25:58.2673057Z 2025-05-07T20:25:58.2673063Z 2025-05-07T20:25:58.2673266Z  2025-05-07T20:25:58.2673517Z 2025-05-07T20:25:58.2673523Z 2025-05-07T20:25:58.2673529Z 2025-05-07T20:25:58.2673535Z 2025-05-07T20:25:58.2673549Z 2025-05-07T20:25:58.2673555Z 2025-05-07T20:25:58.2673561Z 2025-05-07T20:25:58.2673567Z 2025-05-07T20:25:58.2673573Z 2025-05-07T20:25:58.2673578Z 2025-05-07T20:25:58.2673779Z  2025-05-07T20:25:58.2674045Z 2025-05-07T20:25:58.2674050Z 2025-05-07T20:25:58.2674066Z 2025-05-07T20:25:58.2674071Z 2025-05-07T20:25:58.2674076Z 2025-05-07T20:25:58.2674081Z 2025-05-07T20:25:58.2674086Z 2025-05-07T20:25:58.2674091Z 2025-05-07T20:25:58.2674096Z 2025-05-07T20:25:58.2674102Z 2025-05-07T20:25:58.2674107Z 2025-05-07T20:25:58.2674302Z  2025-05-07T20:25:58.2674578Z 2025-05-07T20:25:58.2674584Z 2025-05-07T20:25:58.2674589Z 2025-05-07T20:25:58.2674594Z 2025-05-07T20:25:58.2674600Z 2025-05-07T20:25:58.2674606Z 2025-05-07T20:25:58.2674611Z 2025-05-07T20:25:58.2674620Z 2025-05-07T20:25:58.2674625Z 2025-05-07T20:25:58.2674631Z 2025-05-07T20:25:58.2674636Z 2025-05-07T20:25:58.2674641Z 2025-05-07T20:25:58.2674827Z  2025-05-07T20:25:58.2675107Z 2025-05-07T20:25:58.2675113Z 2025-05-07T20:25:58.2675118Z 2025-05-07T20:25:58.2675123Z 2025-05-07T20:25:58.2675128Z 2025-05-07T20:25:58.2675133Z 2025-05-07T20:25:58.2675139Z 2025-05-07T20:25:58.2675144Z 2025-05-07T20:25:58.2675149Z 2025-05-07T20:25:58.2675154Z 2025-05-07T20:25:58.2675159Z 2025-05-07T20:25:58.2675164Z 2025-05-07T20:25:58.2675169Z 2025-05-07T20:25:58.2675389Z  2025-05-07T20:25:58.2675655Z 2025-05-07T20:25:58.2675660Z 2025-05-07T20:25:58.2675666Z 2025-05-07T20:25:58.2675671Z 2025-05-07T20:25:58.2675838Z 2025-05-07T20:25:58.2675843Z 2025-05-07T20:25:58.2675848Z 2025-05-07T20:25:58.2675853Z 2025-05-07T20:25:58.2675858Z 2025-05-07T20:25:58.2675863Z 2025-05-07T20:25:58.2675868Z 2025-05-07T20:25:58.2675873Z 2025-05-07T20:25:58.2675967Z 2025-05-07T20:25:58.2675973Z 2025-05-07T20:25:58.2676181Z  2025-05-07T20:25:58.2676452Z 2025-05-07T20:25:58.2676457Z 2025-05-07T20:25:58.2676462Z 2025-05-07T20:25:58.2676467Z 2025-05-07T20:25:58.2676472Z 2025-05-07T20:25:58.2676488Z 2025-05-07T20:25:58.2676493Z 2025-05-07T20:25:58.2676498Z 2025-05-07T20:25:58.2676503Z 2025-05-07T20:25:58.2676508Z 2025-05-07T20:25:58.2676513Z 2025-05-07T20:25:58.2676518Z 2025-05-07T20:25:58.2676523Z 2025-05-07T20:25:58.2676528Z 2025-05-07T20:25:58.2676533Z 2025-05-07T20:25:58.2676756Z  2025-05-07T20:25:58.2677044Z 2025-05-07T20:25:58.2677050Z 2025-05-07T20:25:58.2677055Z 2025-05-07T20:25:58.2677060Z 2025-05-07T20:25:58.2677074Z 2025-05-07T20:25:58.2677079Z 2025-05-07T20:25:58.2677084Z 2025-05-07T20:25:58.2677089Z 2025-05-07T20:25:58.2677094Z 2025-05-07T20:25:58.2677099Z 2025-05-07T20:25:58.2677104Z 2025-05-07T20:25:58.2677109Z 2025-05-07T20:25:58.2677124Z 2025-05-07T20:25:58.2677129Z 2025-05-07T20:25:58.2677134Z 2025-05-07T20:25:58.2677139Z 2025-05-07T20:25:58.2677374Z  2025-05-07T20:25:58.2677679Z 2025-05-07T20:25:58.2677685Z 2025-05-07T20:25:58.2677690Z 2025-05-07T20:25:58.2677695Z 2025-05-07T20:25:58.2677701Z 2025-05-07T20:25:58.2677706Z 2025-05-07T20:25:58.2677711Z 2025-05-07T20:25:58.2677716Z 2025-05-07T20:25:58.2677721Z 2025-05-07T20:25:58.2677726Z 2025-05-07T20:25:58.2677731Z 2025-05-07T20:25:58.2677736Z 2025-05-07T20:25:58.2677741Z 2025-05-07T20:25:58.2677756Z 2025-05-07T20:25:58.2677761Z 2025-05-07T20:25:58.2677766Z 2025-05-07T20:25:58.2677771Z 2025-05-07T20:25:58.2678006Z  2025-05-07T20:25:58.2678324Z 2025-05-07T20:25:58.2678329Z 2025-05-07T20:25:58.2678334Z 2025-05-07T20:25:58.2678347Z 2025-05-07T20:25:58.2678352Z 2025-05-07T20:25:58.2678357Z 2025-05-07T20:25:58.2678362Z 2025-05-07T20:25:58.2678367Z 2025-05-07T20:25:58.2678379Z 2025-05-07T20:25:58.2678384Z 2025-05-07T20:25:58.2678389Z 2025-05-07T20:25:58.2678394Z 2025-05-07T20:25:58.2678399Z 2025-05-07T20:25:58.2678404Z 2025-05-07T20:25:58.2678409Z 2025-05-07T20:25:58.2678414Z 2025-05-07T20:25:58.2678419Z 2025-05-07T20:25:58.2678424Z 2025-05-07T20:25:58.2678674Z  2025-05-07T20:25:58.2679024Z 2025-05-07T20:25:58.2679030Z 2025-05-07T20:25:58.2679177Z  2025-05-07T20:25:58.2679324Z 2025-05-07T20:25:58.2679330Z 2025-05-07T20:25:58.2679467Z  2025-05-07T20:25:58.2679612Z 2025-05-07T20:25:58.2679616Z 2025-05-07T20:25:58.2679622Z 2025-05-07T20:25:58.2679774Z  2025-05-07T20:25:58.2679925Z 2025-05-07T20:25:58.2679931Z 2025-05-07T20:25:58.2679936Z 2025-05-07T20:25:58.2679947Z 2025-05-07T20:25:58.2680103Z  2025-05-07T20:25:58.2680266Z 2025-05-07T20:25:58.2680271Z 2025-05-07T20:25:58.2680276Z 2025-05-07T20:25:58.2680281Z 2025-05-07T20:25:58.2680286Z 2025-05-07T20:25:58.2680494Z  2025-05-07T20:25:58.2680690Z 2025-05-07T20:25:58.2680696Z 2025-05-07T20:25:58.2680702Z 2025-05-07T20:25:58.2680707Z 2025-05-07T20:25:58.2680713Z 2025-05-07T20:25:58.2680718Z 2025-05-07T20:25:58.2680914Z  2025-05-07T20:25:58.2681126Z 2025-05-07T20:25:58.2681131Z 2025-05-07T20:25:58.2681137Z 2025-05-07T20:25:58.2681143Z 2025-05-07T20:25:58.2681149Z 2025-05-07T20:25:58.2681154Z 2025-05-07T20:25:58.2681160Z 2025-05-07T20:25:58.2681342Z  2025-05-07T20:25:58.2681554Z 2025-05-07T20:25:58.2681560Z 2025-05-07T20:25:58.2681566Z 2025-05-07T20:25:58.2681571Z 2025-05-07T20:25:58.2681577Z 2025-05-07T20:25:58.2681582Z 2025-05-07T20:25:58.2681588Z 2025-05-07T20:25:58.2681593Z 2025-05-07T20:25:58.2681802Z  2025-05-07T20:25:58.2682193Z 2025-05-07T20:25:58.2682199Z 2025-05-07T20:25:58.2682205Z 2025-05-07T20:25:58.2682210Z 2025-05-07T20:25:58.2682216Z 2025-05-07T20:25:58.2682222Z 2025-05-07T20:25:58.2682227Z 2025-05-07T20:25:58.2682321Z 2025-05-07T20:25:58.2682327Z 2025-05-07T20:25:58.2682528Z  2025-05-07T20:25:58.2682789Z 2025-05-07T20:25:58.2682795Z 2025-05-07T20:25:58.2682801Z 2025-05-07T20:25:58.2682807Z 2025-05-07T20:25:58.2682813Z 2025-05-07T20:25:58.2682819Z 2025-05-07T20:25:58.2682825Z 2025-05-07T20:25:58.2682831Z 2025-05-07T20:25:58.2682837Z 2025-05-07T20:25:58.2682842Z 2025-05-07T20:25:58.2683067Z  2025-05-07T20:25:58.2683330Z 2025-05-07T20:25:58.2683336Z 2025-05-07T20:25:58.2683341Z 2025-05-07T20:25:58.2683347Z 2025-05-07T20:25:58.2683352Z 2025-05-07T20:25:58.2683357Z 2025-05-07T20:25:58.2683362Z 2025-05-07T20:25:58.2683367Z 2025-05-07T20:25:58.2683372Z 2025-05-07T20:25:58.2683387Z 2025-05-07T20:25:58.2683404Z 2025-05-07T20:25:58.2683633Z  done 2025-05-07T20:25:58.5885837Z Preparing transaction: | / - done 2025-05-07T20:26:00.1662668Z Verifying transaction: | / - \ | / - \ | / - \ | / - done 2025-05-07T20:26:00.9837837Z Executing transaction: | / - \ | / - \ done 2025-05-07T20:26:03.3345211Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:03.3345624Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:03.3346313Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:03.3346880Z 2025-05-07T20:26:03.3359524Z 2025-05-07T20:26:03.3360294Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:03.3361013Z 2025-05-07T20:26:03.3372919Z 2025-05-07T20:26:03.3373334Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:03.3378531Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:03.3382383Z 2025-05-07T20:26:03.5134268Z 2025-05-07T20:26:03.5142508Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:03.5146552Z 2025-05-07T20:26:03.5161501Z 2025-05-07T20:26:03.5161818Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:03.5527955Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:05.4336774Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:05.4975668Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:05.4976200Z 2025-05-07T20:26:05.9320486Z 2025-05-07T20:26:05.9329823Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:05.9840434Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:05.9840929Z 2025-05-07T20:26:06.4206641Z 2025-05-07T20:26:06.4207050Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:06.4208466Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:06.4209199Z 2025-05-07T20:26:06.8582631Z 2025-05-07T20:26:08.8708690Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:10.8982724Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:12.9095576Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:12.9096395Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:14.9486719Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:16.8318103Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:16.8318386Z 2025-05-07T20:26:16.8932374Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:20.7468721Z /tmp/tmp_mbt3l3p: line 3: clang: command not found 2025-05-07T20:26:20.7469005Z 2025-05-07T20:26:20.7469583Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:20.8103153Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:20.8103492Z 2025-05-07T20:26:20.8122381Z total 36 2025-05-07T20:26:20.8122915Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:26:20.8123315Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:20.8123754Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:20.8124258Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:20.8124761Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:20.8125249Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:20.8125875Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:20.8126485Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:26:20.8126833Z 2025-05-07T20:26:20.8127060Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:20.8127815Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:20.8128232Z 2025-05-07T20:26:20.8146556Z 2025-05-07T20:26:20.8148817Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:20.8149178Z 2025-05-07T20:26:22.7776036Z 2025-05-07T20:26:22.7776573Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:22.7777278Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:22.7778060Z 2025-05-07T20:26:23.2118551Z 2025-05-07T20:26:23.2118892Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:23.2119257Z 2025-05-07T20:26:25.0995546Z -allow-unsupported-compiler 2025-05-07T20:26:25.0995868Z 2025-05-07T20:26:25.1630598Z 2025-05-07T20:26:25.1631364Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:25.1632698Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:25.1633379Z 2025-05-07T20:26:27.1189895Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:27.1190688Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:27.1191127Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:27.1191515Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:27.1191849Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:27.1192107Z #define _STL_PAIR_H 1 2025-05-07T20:26:27.1192386Z #define __cpp_attributes 200809L 2025-05-07T20:26:27.1192712Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:27.1193054Z #define __DELETE_THROW throw() 2025-05-07T20:26:27.1193317Z #define _PTRDIFF_T_ 2025-05-07T20:26:27.1193557Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:27.1193843Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:27.1194109Z #define _IO_LEFT 02 2025-05-07T20:26:27.1194334Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:27.1194590Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:27.1194857Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:27.1195278Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:27.1195706Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:27.1196066Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:27.1196409Z #define _IOS_OUTPUT 2 2025-05-07T20:26:27.1196891Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:27.1197285Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:27.1197596Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:27.1197870Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:27.1198144Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:27.1198918Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:27.1200130Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:27.1200555Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:27.1200959Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:27.1201290Z #define _T_WCHAR_ 2025-05-07T20:26:27.1201522Z #define stdout stdout 2025-05-07T20:26:27.1201863Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:27.1202254Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:27.1202512Z #define __flexarr [] 2025-05-07T20:26:27.1202760Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:27.1203098Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:27.1203445Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:27.1203718Z #define _MATH_H 1 2025-05-07T20:26:27.1204009Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:27.1204350Z #define __S64_TYPE long int 2025-05-07T20:26:27.1204614Z #define __stub_fchflags 2025-05-07T20:26:27.1204886Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:27.1205182Z #define __SQUAD_TYPE long int 2025-05-07T20:26:27.1205549Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:27.1206169Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:27.1206548Z #define NL_NMAX INT_MAX 2025-05-07T20:26:27.1206878Z #define _BITS_TIME_H 1 2025-05-07T20:26:27.1207249Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:27.1207821Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:27.1208242Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:27.1208728Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:27.1209707Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:27.1210210Z #define __CHAR_BIT__ 8 2025-05-07T20:26:27.1210745Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1211201Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:27.1211608Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:27.1211882Z #define FP_NAN 0 2025-05-07T20:26:27.1212152Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:27.1212595Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:27.1213089Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:27.1213481Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:27.1213774Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:27.1214039Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:27.1214302Z #define __SM_80_RT_H__ 2025-05-07T20:26:27.1214537Z #define _NEW 2025-05-07T20:26:27.1214769Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:27.1215065Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:27.1215445Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:27.1215856Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:27.1216108Z #define __USE_ANSI 1 2025-05-07T20:26:27.1216403Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:27.1216805Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:27.1217169Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:27.1217479Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:27.1217768Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:27.1218052Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:27.1218333Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:27.1218621Z #define PIPE_BUF 4096 2025-05-07T20:26:27.1218945Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:27.1219309Z #define ADJ_TICK 0x4000 2025-05-07T20:26:27.1219598Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:27.1219915Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:27.1220188Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:27.1220523Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:27.1220993Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:27.1221525Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:27.1221895Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:27.1222157Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:27.1222431Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1222721Z #define __cpp_static_assert 201411L 2025-05-07T20:26:27.1223065Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:27.1223410Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:27.1223692Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:27.1223983Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:27.1224287Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:27.1224577Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:27.1224885Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1225250Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:27.1225595Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:27.1225885Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:27.1226210Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1226571Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:27.1226937Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:27.1227238Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:27.1227535Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:27.1227871Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:27.1228205Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:27.1228611Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:27.1229037Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:27.1229467Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:27.1229744Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:27.1230030Z #define __GCC_IEC_559 2 2025-05-07T20:26:27.1230413Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:27.1230760Z #define _IO_flockfile(_fp) 2025-05-07T20:26:27.1231022Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:27.1231304Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:27.1231578Z #define _IOFBF 0 2025-05-07T20:26:27.1231792Z #define __USE_BSD 1 2025-05-07T20:26:27.1232030Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:27.1232309Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:27.1232583Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:27.1232845Z #define _IO_NO_WRITES 8 2025-05-07T20:26:27.1233112Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:27.1233461Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:27.1233823Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:27.1234143Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:27.1234474Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:27.1234767Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:27.1235048Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:27.1235327Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:27.1235642Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:27.1236038Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:27.1236413Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:27.1236726Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:27.1237044Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:27.1237382Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:27.1237692Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:27.1238000Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:27.1238284Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:27.1238561Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:27.1239152Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:27.1239806Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:27.1240143Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:27.1240473Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:27.1240784Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:27.1241068Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:27.1241338Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:27.1241652Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:27.1241993Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:27.1242299Z #define RAND_MAX 2147483647 2025-05-07T20:26:27.1242571Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:27.1242908Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1243229Z #define __SM_90_RT_H__ 2025-05-07T20:26:27.1243475Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:27.1243748Z #define __COMPAR_FN_T 2025-05-07T20:26:27.1243994Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:27.1244257Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:27.1244738Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:27.1245257Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:27.1245604Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:27.1245966Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:27.1246286Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:27.1246632Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:27.1258216Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:27.1258762Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:27.1259326Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:27.1259670Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:27.1260165Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:27.1260476Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:27.1260782Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:27.1261158Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:27.1261439Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:27.1261705Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:27.1261962Z #define __u_char_defined 2025-05-07T20:26:27.1262293Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:27.1262657Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:27.1262927Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:27.1263191Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:27.1263483Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:27.1263928Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:27.1264365Z #define FP_INFINITE 1 2025-05-07T20:26:27.1264743Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:27.1265172Z #define _IO_pid_t __pid_t 2025-05-07T20:26:27.1265439Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:27.1265713Z #define __LEAF , __leaf__ 2025-05-07T20:26:27.1265964Z #define PATH_MAX 4096 2025-05-07T20:26:27.1266235Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:27.1266586Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:27.1266913Z #define _LIMITS_H___ 2025-05-07T20:26:27.1267152Z #define __size_t 2025-05-07T20:26:27.1267394Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:27.1267957Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:27.1268526Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:27.1268846Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:27.1269190Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:27.1269454Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:27.1269826Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:27.1270241Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:27.1270541Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:27.1270875Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:27.1271174Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:27.1271470Z #define __INT8_C(c) c 2025-05-07T20:26:27.1271738Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:27.1272051Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:27.1272327Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:27.1272587Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:27.1272844Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:27.1273125Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:27.1273447Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1273787Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:27.1274072Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:27.1274344Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:27.1274616Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:27.1274943Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:27.1275255Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:27.1275618Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:27.1276014Z #define NFDBITS __NFDBITS 2025-05-07T20:26:27.1276283Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:27.1276575Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:27.1276902Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:27.1277227Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:27.1277486Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:27.1277783Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:27.1278094Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:27.1278406Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:27.1278837Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:27.1279209Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:27.1279507Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:27.1279957Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:27.1280337Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:27.1280688Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:27.1281132Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:27.1281478Z #define __daddr_t_defined 2025-05-07T20:26:27.1281740Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:27.1282020Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:27.1282349Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:27.1282873Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:27.1283370Z #define _ACRTIMP 2025-05-07T20:26:27.1283600Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:27.1283877Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:27.1284177Z #define _IOS_BIN 128 2025-05-07T20:26:27.1284535Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:27.1284965Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1285242Z #define UNDERFLOW 4 2025-05-07T20:26:27.1285464Z #define NAME_MAX 255 2025-05-07T20:26:27.1285712Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:27.1285989Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:27.1286271Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:27.1286576Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:27.1286960Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:27.1287350Z #define __ptr_t void * 2025-05-07T20:26:27.1287695Z #define M_E 2.7182818284590452354 2025-05-07T20:26:27.1287979Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:27.1288251Z #define __USE_ISOCXX11 1 2025-05-07T20:26:27.1288521Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:27.1288839Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:27.1289138Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:27.1289409Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:27.1289708Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:27.1290024Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:27.1290281Z #define __linux 1 2025-05-07T20:26:27.1290517Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:27.1290793Z #define cudaDeviceMask 0xff 2025-05-07T20:26:27.1291057Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:27.1291354Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:27.1291637Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:27.1291920Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:27.1292234Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:27.1292543Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:27.1292836Z #define _BITS_TYPES_H 1 2025-05-07T20:26:27.1293124Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:27.1293460Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:27.1293759Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:27.1294034Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:27.1294329Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:27.1294623Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:27.1295415Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:27.1296246Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:27.1296539Z #define __unix 1 2025-05-07T20:26:27.1296759Z #define MATH_ERRNO 1 2025-05-07T20:26:27.1297001Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:27.1297282Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:27.1297556Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:27.1297835Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:27.1298120Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:27.1298410Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:27.1298868Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:27.1299441Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:27.1299742Z #define CUDARTAPI_CDECL 2025-05-07T20:26:27.1299993Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:27.1300348Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:27.1300637Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:27.1300899Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:27.1301131Z #define __SIZE_T 2025-05-07T20:26:27.1301384Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:27.1301707Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:27.1301999Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:27.1302261Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:27.1302520Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:27.1302905Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:27.1303334Z #define __WAIT_STATUS void * 2025-05-07T20:26:27.1303596Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:27.1303858Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:27.1304132Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:27.1304417Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:27.1304686Z #define __WINT_MIN__ 0U 2025-05-07T20:26:27.1305282Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:27.1306354Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:27.1306655Z #define WUNTRACED 2 2025-05-07T20:26:27.1306882Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:27.1307159Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:27.1307443Z #define NZERO 20 2025-05-07T20:26:27.1307667Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:27.1307954Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:27.1308256Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:27.1308546Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:27.1308808Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:27.1309105Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:27.1309432Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:27.1309729Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:27.1310008Z #define EXIT_FAILURE 1 2025-05-07T20:26:27.1310258Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:27.1310524Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:27.1310798Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:27.1311057Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:27.1311338Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:27.1311685Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:27.1312052Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:27.1312347Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:27.1312611Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:27.1312891Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:27.1313190Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:27.1313504Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:27.1313804Z #define SEEK_DATA 3 2025-05-07T20:26:27.1314043Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:27.1314345Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:27.1314782Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:27.1315182Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:27.1315433Z #define __INT64_C(c) c ## L 2025-05-07T20:26:27.1315727Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:27.1316058Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:27.1316388Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:27.1316669Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:27.1316970Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:27.1317287Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:27.1317551Z #define __INT_WCHAR_T_H 2025-05-07T20:26:27.1317792Z #define WSTOPPED 2 2025-05-07T20:26:27.1318040Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:27.1318331Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:27.1318579Z #define FP_NORMAL 4 2025-05-07T20:26:27.1319083Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:27.1319373Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:27.1319612Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:27.1319990Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:27.1320288Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:27.1320567Z #define cudaTextureType1D 0x01 2025-05-07T20:26:27.1320838Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:27.1321105Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:27.1321381Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:27.1321674Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:27.1322106Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:27.1322561Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:27.1322826Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:27.1323094Z #define _POSIX_SOURCE 1 2025-05-07T20:26:27.1323352Z #define cudaTextureType2D 0x02 2025-05-07T20:26:27.1323621Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:27.1323904Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:27.1324221Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:27.1324494Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:27.1324834Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:27.1325178Z #define cudaTextureType3D 0x03 2025-05-07T20:26:27.1325456Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:27.1325715Z #define CLOCK_REALTIME 0 2025-05-07T20:26:27.1325969Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:27.1326253Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:27.1326554Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:27.1326842Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:27.1327127Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:27.1327413Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:27.1327785Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:27.1328099Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:27.1328391Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:27.1328683Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:27.1328945Z #define __GLIBC__ 2 2025-05-07T20:26:27.1329159Z #define __END_DECLS } 2025-05-07T20:26:27.1329448Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:27.1329839Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:27.1330224Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:27.1330483Z #define WCONTINUED 8 2025-05-07T20:26:27.1330723Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:27.1330992Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:27.1331266Z #define _ALLOCA_H 1 2025-05-07T20:26:27.1331500Z #define __host__ __location__(host) 2025-05-07T20:26:27.1331926Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:27.1332368Z #define __SLONG32_TYPE int 2025-05-07T20:26:27.1332640Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:27.1332933Z #define _SYS_SELECT_H 1 2025-05-07T20:26:27.1333174Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:27.1333435Z #define _IOS_NOCREATE 32 2025-05-07T20:26:27.1333693Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:27.1333968Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:27.1334277Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:27.1334573Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:27.1334870Z #define __global__ __location__(global) 2025-05-07T20:26:27.1335160Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:27.1335425Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:27.1335707Z #define __DBL_DIG__ 15 2025-05-07T20:26:27.1335933Z #define TIME_UTC 1 2025-05-07T20:26:27.1336155Z #define __FLT32_DIG__ 6 2025-05-07T20:26:27.1336481Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:27.1336873Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:27.1337195Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:27.1337510Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:27.1337814Z #define _G_BUFSIZ 8192 2025-05-07T20:26:27.1338240Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:27.1338614Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:27.1338924Z #define __cudaCDP2GetDevice 2025-05-07T20:26:27.1339283Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:27.1339578Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:27.1339834Z #define __GXX_WEAK__ 1 2025-05-07T20:26:27.1340089Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.1340396Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:27.1340666Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:27.1340962Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:27.1341311Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:27.1341596Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:27.1341877Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:27.1342179Z #define _G_config_h 1 2025-05-07T20:26:27.1342458Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:27.1342792Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:27.1343080Z #define _GCC_WCHAR_T 2025-05-07T20:26:27.1343313Z #define TMP_MAX 238328 2025-05-07T20:26:27.1343560Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:27.1343831Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:27.1344108Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1344395Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:27.1344669Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:27.1344961Z #define _IO_SKIPWS 01 2025-05-07T20:26:27.1345375Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:27.1345835Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:27.1346111Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:27.1346451Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:27.1346818Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:27.1347190Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:27.1347557Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:27.1347823Z #define le32toh(x) (x) 2025-05-07T20:26:27.1348059Z #define _SIZE_T_DEFINED 2025-05-07T20:26:27.1348318Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:27.1348665Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:27.1349016Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:27.1349468Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:27.1349896Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:27.1350163Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:27.1350438Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:27.1350712Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:27.1350989Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:27.1351529Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:27.1352034Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:27.1352351Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:27.1352697Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:27.1353018Z #define _WCHAR_T_ 2025-05-07T20:26:27.1353250Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:27.1353618Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:27.1354017Z #define RTSIG_MAX 32 2025-05-07T20:26:27.1354248Z #define _STDDEF_H 2025-05-07T20:26:27.1354478Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:27.1354756Z #define _VA_LIST_DEFINED 2025-05-07T20:26:27.1355012Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:27.1355345Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:27.1355748Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:27.1356084Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:27.1356378Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:27.1356838Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:27.1357374Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:27.1357905Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:27.1358224Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:27.1360002Z #define __unix__ 1 2025-05-07T20:26:27.1360262Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.1360543Z #define __INT_WIDTH__ 32 2025-05-07T20:26:27.1360790Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:27.1361030Z #define _IONBF 2 2025-05-07T20:26:27.1361476Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:27.1362256Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:27.1362807Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:27.1363068Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:27.1363338Z #define __UINT16_C(c) c 2025-05-07T20:26:27.1363587Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:27.1363877Z #define STA_DEL 0x0020 2025-05-07T20:26:27.1364120Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:26:27.1364392Z #define __id_t_defined 2025-05-07T20:26:27.1364667Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:27.1365124Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:27.1365566Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:27.1365840Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:27.1366108Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:27.1366365Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:27.1366637Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:27.1366910Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:27.1367172Z #define SING 2 2025-05-07T20:26:27.1367397Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:27.1367779Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1368079Z #define cudaStreamDefault 0x00 2025-05-07T20:26:27.1368437Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:27.1368819Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:27.1369098Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:27.1369397Z #define __gnu_linux__ 1 2025-05-07T20:26:27.1369674Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:27.1369931Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:27.1370188Z #define MAX_INPUT 255 2025-05-07T20:26:27.1370659Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:27.1370995Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:27.1371365Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:27.1371690Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:27.1372007Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:27.1372408Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:27.1372845Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:27.1373187Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:27.1373555Z #define _Mfloat_ float 2025-05-07T20:26:27.1373839Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:27.1374157Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:27.1374452Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:27.1374961Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:27.1375476Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1375760Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:27.1376091Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:27.1376460Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:27.1376783Z #define __USE_ISOC11 1 2025-05-07T20:26:27.1377026Z #define _BSD_SIZE_T_ 2025-05-07T20:26:27.1377271Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:27.1377536Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:27.1377807Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:27.1378117Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:27.1378449Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:27.1378892Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:27.1379229Z #define __THROW throw () 2025-05-07T20:26:27.1379493Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:27.1379875Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1380238Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:27.1380606Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:27.1390171Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:27.1390507Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:27.1390785Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:27.1391060Z #define L_tmpnam 20 2025-05-07T20:26:27.1391301Z #define ___int_wchar_t_h 2025-05-07T20:26:27.1391649Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:27.1392049Z #define isascii(c) __isascii (c) 2025-05-07T20:26:27.1392318Z #define _T_PTRDIFF 2025-05-07T20:26:27.1392642Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:27.1393025Z #define toascii(c) __toascii (c) 2025-05-07T20:26:27.1393299Z #define __GNUC__ 11 2025-05-07T20:26:27.1393563Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:27.1393871Z #define __GXX_RTTI 1 2025-05-07T20:26:27.1394113Z #define __pie__ 2 2025-05-07T20:26:27.1394338Z #define __MMX__ 1 2025-05-07T20:26:27.1394568Z #define __cudaCDP2Malloc 2025-05-07T20:26:27.1394843Z #define __timespec_defined 1 2025-05-07T20:26:27.1395111Z #define L_ctermid 9 2025-05-07T20:26:27.1395358Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:27.1395678Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:27.1396092Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:27.1396476Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:27.1396759Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:27.1397067Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:27.1397378Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:27.1397708Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:27.1397991Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:27.1398434Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:27.1399203Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:27.1399818Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:27.1400138Z #define __USE_SVID 1 2025-05-07T20:26:27.1400395Z #define __constant__ __location__(constant) 2025-05-07T20:26:27.1400714Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:27.1401021Z #define __device__ __location__(device) 2025-05-07T20:26:27.1401352Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:27.1401690Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:27.1401964Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:27.1402263Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:27.1402614Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:27.1403000Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:27.1403295Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:27.1403677Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:27.1404064Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:27.1404323Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:27.1404690Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:27.1405128Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:27.1405453Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:27.1406914Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:27.1407188Z #define NGROUPS_MAX 65536 2025-05-07T20:26:27.1407449Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:27.1407806Z #define __USE_ISOC95 1 2025-05-07T20:26:27.1408030Z #define _TIME_H 1 2025-05-07T20:26:27.1408298Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:27.1408617Z #define __USE_ISOC99 1 2025-05-07T20:26:27.1409354Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:27.1409765Z #define HOST_NAME_MAX 64 2025-05-07T20:26:27.1410022Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:27.1410399Z #define _IOS_ATEND 4 2025-05-07T20:26:27.1410643Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:27.1410975Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:27.1411386Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:27.1411731Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:27.1412021Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:27.1412353Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:27.1412670Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:27.1412935Z #define _STDIO_H 1 2025-05-07T20:26:27.1413335Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:27.1413802Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:27.1414167Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:27.1414552Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:27.1414846Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:27.1415126Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:27.1415402Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:27.1415699Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:27.1416001Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1416324Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:27.1416604Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:27.1416881Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:27.1417191Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:27.1417467Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:27.1417753Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:27.1418117Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:27.1418494Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:27.1418742Z #define __USE_XOPEN 1 2025-05-07T20:26:27.1419000Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:27.1419484Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:27.1419957Z #define __USE_XOPEN2K 1 2025-05-07T20:26:27.1420202Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:27.1420475Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:27.1420773Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:27.1421040Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:27.1421561Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:27.1422089Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:27.1422371Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:27.1422733Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:27.1423123Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:27.1423505Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:27.1423907Z #define __END_NAMESPACE_C99 2025-05-07T20:26:27.1424191Z #define __glibcxx_integral_traps true 2025-05-07T20:26:27.1424484Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:27.1424742Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:27.1425007Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:27.1425282Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:27.1425531Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:27.1425830Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:27.1426142Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:27.1426511Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:27.1426904Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:27.1427187Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:27.1427449Z #define _IO_UNITBUF 020000 2025-05-07T20:26:27.1427712Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:27.1427980Z #define __FD_SETSIZE 1024 2025-05-07T20:26:27.1428234Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:27.1428513Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:27.1428977Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:27.1429354Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:27.1429655Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:27.1430046Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:27.1430374Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:27.1430649Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:27.1430964Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:27.1431307Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:27.1431594Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:27.1431925Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:27.1432221Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:27.1432495Z #define __USE_POSIX199506 1 2025-05-07T20:26:27.1432753Z #define _FEATURES_H 1 2025-05-07T20:26:27.1432999Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:27.1433400Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:27.1433824Z #define __stub_getmsg 2025-05-07T20:26:27.1434067Z #define _IO_FIXED 010000 2025-05-07T20:26:27.1434345Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:27.1434660Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:27.1434941Z #define __stub_setlogin 2025-05-07T20:26:27.1435189Z #define __stub_fattach 2025-05-07T20:26:27.1435432Z #define __cplusplus 201703L 2025-05-07T20:26:27.1435706Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:27.1435995Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:27.1436247Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:27.1436531Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:27.1437026Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:27.1437562Z #define _IO_INTERNAL 010 2025-05-07T20:26:27.1437817Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:27.1438161Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:27.1438521Z #define __dev_t_defined 2025-05-07T20:26:27.1438766Z #define __DEPRECATED 1 2025-05-07T20:26:27.1439001Z #define __S32_TYPE int 2025-05-07T20:26:27.1439258Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:27.1439562Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:27.1439827Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:27.1440089Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:27.1440695Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:27.1441340Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:27.1441656Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:27.1442000Z #define OVERFLOW 3 2025-05-07T20:26:27.1442256Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:27.1442573Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:27.1442867Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.1443211Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:27.1443548Z #define __SSE2_MATH__ 1 2025-05-07T20:26:27.1443806Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:27.1444115Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.1444424Z #define _IO_STDIO_H 2025-05-07T20:26:27.1444682Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:27.1444973Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:27.1445304Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:27.1445612Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1445928Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:27.1446208Z #define __amd64 1 2025-05-07T20:26:27.1446441Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:27.1446715Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:27.1447002Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:27.1447296Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:27.1447732Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:27.1448003Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:27.1448326Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:27.1448748Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:27.1449004Z #define __bounded 2025-05-07T20:26:27.1449242Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:27.1449677Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:27.1449968Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:27.1450239Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:27.1450519Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1450842Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:27.1451264Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:27.1451668Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:27.1451945Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:27.1452288Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:27.1452650Z #define STA_PLL 0x0001 2025-05-07T20:26:27.1452894Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:27.1453165Z #define __GNUG__ 11 2025-05-07T20:26:27.1453403Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:27.1453675Z #define _T_WCHAR 2025-05-07T20:26:27.1453917Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:27.1454212Z #define __specialization_static 2025-05-07T20:26:27.1454522Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:27.1454834Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:27.1455100Z #define cudaArraySparse 0x40 2025-05-07T20:26:27.1455361Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:27.1455623Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:27.1455907Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:27.1456210Z #define _WCHAR_T 2025-05-07T20:26:27.1456433Z #define __cudaCDP2Free 2025-05-07T20:26:27.1457084Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:27.1460498Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:27.1460923Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:27.1461368Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:27.1461653Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:27.1461917Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:27.1462259Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:27.1462611Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:27.1462856Z #define __NO_CTYPE 1 2025-05-07T20:26:27.1463088Z #define __stub_bdflush 2025-05-07T20:26:27.1463464Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:27.1463882Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:27.1464184Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:27.1464453Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:27.1464735Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:27.1465036Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:27.1465339Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:27.1465683Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:27.1466031Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:27.1466319Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:27.1466613Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:27.1466955Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:27.1467302Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:27.1467588Z #define _IO_STDIO 040000 2025-05-07T20:26:27.1467920Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:27.1468300Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:27.1468625Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:27.1468918Z #define _PTRDIFF_T 2025-05-07T20:26:27.1469137Z #define _MOVE_H 1 2025-05-07T20:26:27.1469395Z #define __cpp_hex_float 201603L 2025-05-07T20:26:27.1469688Z #define ADJ_TAI 0x0080 2025-05-07T20:26:27.1469918Z #define __ptrvalue 2025-05-07T20:26:27.1470152Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:27.1470547Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:27.1470834Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:27.1471142Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:27.1471397Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:27.1471766Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:27.1472171Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:27.1472560Z #define __USE_GNU 1 2025-05-07T20:26:27.1472791Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:27.1473062Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:27.1473338Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:27.1473737Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:27.1474119Z #define WEXITED 4 2025-05-07T20:26:27.1474344Z #define _IO_NO_READS 4 2025-05-07T20:26:27.1474655Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:27.1475006Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:27.1475305Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:27.1475616Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:27.1475929Z #define __uid_t_defined 2025-05-07T20:26:27.1476200Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:27.1476491Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:27.1476761Z #define WNOHANG 1 2025-05-07T20:26:27.1477018Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:27.1477324Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:27.1477598Z #define cudaEventDefault 0x00 2025-05-07T20:26:27.1477899Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:27.1478217Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:27.1478461Z #define __x86_64 1 2025-05-07T20:26:27.1478691Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:27.1479090Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:27.1479567Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:27.1480064Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:27.1480500Z #define __PTRDIFF_T 2025-05-07T20:26:27.1480830Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:27.1481213Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:27.1481490Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.1481784Z #define _Mlong_double_ long double 2025-05-07T20:26:27.1482070Z #define __cpp_lambdas 200907L 2025-05-07T20:26:27.1482320Z #define _IO_DEC 020 2025-05-07T20:26:27.1482552Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:27.1482827Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:27.1483113Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:27.1483403Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:27.1483670Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:27.1483965Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:27.1484293Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:27.1484570Z #define _ANSI_STDDEF_H 2025-05-07T20:26:27.1484844Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:27.1485158Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:27.1485535Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:27.1485926Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:27.1486213Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:27.1486509Z #define __cpp_template_auto 201606L 2025-05-07T20:26:27.1486868Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:27.1487239Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:27.1487513Z #define __key_t_defined 2025-05-07T20:26:27.1487915Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:27.1488284Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:27.1488749Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:27.1489117Z #define __GNUC_VA_LIST 2025-05-07T20:26:27.1489458Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:27.1489951Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:27.1490220Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:27.1490582Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:27.1490871Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:27.1491124Z #define __WCOREFLAG 0x80 2025-05-07T20:26:27.1491381Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:27.1491685Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:27.1491965Z #define __LP64__ 1 2025-05-07T20:26:27.1492219Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:27.1492532Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:27.1492820Z #define _IO_off64_t __off64_t 2025-05-07T20:26:27.1493087Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1493346Z #define __time_t_defined 1 2025-05-07T20:26:27.1493611Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:27.1493964Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:27.1494339Z #define __USE_UNIX98 1 2025-05-07T20:26:27.1494582Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:27.1494860Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:27.1495139Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:27.1495442Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:27.1495755Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:27.1496021Z #define SEEK_CUR 1 2025-05-07T20:26:27.1496254Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1496528Z #define _ASSERT_H 1 2025-05-07T20:26:27.1497102Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:27.1497723Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:27.1498004Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:27.1498267Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:27.1498543Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:27.1498817Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:27.1499206Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:27.1499619Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:27.1500280Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:27.1500925Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:27.1501228Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:27.1501573Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:27.1501951Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:27.1502227Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:27.1502513Z #define cudaArrayDefault 0x00 2025-05-07T20:26:27.1502796Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:27.1503090Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:27.1503381Z #define TLOSS 5 2025-05-07T20:26:27.1503592Z #define __ssize_t_defined 2025-05-07T20:26:27.1503851Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:26:27.1504129Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:27.1504415Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:27.1504719Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:27.1505081Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:27.1505465Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:27.1506105Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:27.1506442Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:27.1506754Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:27.1507040Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:27.1507322Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:27.1507581Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:27.1507906Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:27.1508262Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:27.1508497Z #define __cdecl 2025-05-07T20:26:27.1508725Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:27.1509296Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:27.1509622Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:27.1509867Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:27.1510293Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:27.1510587Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:27.1510846Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:27.1511156Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:27.1511487Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:27.1511895Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:27.1512345Z #define ADJ_NANO 0x2000 2025-05-07T20:26:27.1512661Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:27.1513021Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:27.1513307Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:27.1513573Z #define __FLT_DIG__ 6 2025-05-07T20:26:27.1513932Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:27.1514322Z #define __NO_INLINE__ 1 2025-05-07T20:26:27.1514630Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:27.1514986Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:27.1515251Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:27.1515513Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:27.1515805Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:27.1516082Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:27.1516379Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:27.1516671Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:27.1526259Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:27.1526733Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:27.1527086Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:27.1527449Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:27.1527802Z #define MAX_CANON 255 2025-05-07T20:26:27.1528044Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:27.1528312Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:27.1528593Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:27.1528888Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:27.1529213Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:27.1529523Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:27.1529811Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:27.1530140Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:27.1530471Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:27.1530751Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:27.1531054Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:27.1531365Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:27.1531665Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:27.1531989Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:27.1532303Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:27.1532583Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:27.1532854Z #define _SYS_TYPES_H 1 2025-05-07T20:26:27.1533113Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:27.1533390Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:27.1533657Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:27.1533911Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:27.1534201Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:27.1534509Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:27.1534773Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:27.1535087Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:27.1535374Z #define FP_SUBNORMAL 3 2025-05-07T20:26:27.1535634Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:27.1535930Z #define _INITIALIZER_LIST 2025-05-07T20:26:27.1536195Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:27.1536454Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:27.1536741Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:27.1537039Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:27.1537316Z #define _IO_file_flags _flags 2025-05-07T20:26:27.1537749Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:27.1538008Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:27.1538299Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:27.1538592Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:27.1538953Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:27.1539343Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:27.1539749Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:27.1540069Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:27.1540346Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:27.1540615Z #define _BSD_SOURCE 1 2025-05-07T20:26:27.1540862Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:27.1541718Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:27.1542578Z #define __catch(X) catch(X) 2025-05-07T20:26:27.1542859Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:27.1543167Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:27.1543448Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:27.1543718Z #define __STRING(x) #x 2025-05-07T20:26:27.1543974Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:27.1544254Z #define _T_PTRDIFF_ 2025-05-07T20:26:27.1544513Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:27.1544830Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:27.1545112Z #define __unbounded 2025-05-07T20:26:27.1545370Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.1545672Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:27.1545961Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.1546278Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:27.1546570Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:27.1546881Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:27.1547215Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:27.1547541Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:27.1547828Z #define __managed__ __location__(managed) 2025-05-07T20:26:27.1548130Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:27.1548541Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:27.1548967Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:27.1549230Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:27.1549605Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:27.1550013Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:27.1550269Z #define _SYS_SIZE_T_H 2025-05-07T20:26:27.1550566Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:27.1550919Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:27.1551205Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:27.1551498Z #define _CRTIMP 2025-05-07T20:26:27.1551730Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:27.1552035Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:27.1552369Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:27.1552731Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:27.1553154Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1553473Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:27.1553755Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:27.1554049Z #define __SIZE_T__ 2025-05-07T20:26:27.1554268Z #define __stub_gtty 2025-05-07T20:26:27.1554502Z #define __pid_t_defined 2025-05-07T20:26:27.1554768Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:27.1555074Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.1555392Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:27.1555695Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:27.1555946Z #define __need_clockid_t 2025-05-07T20:26:27.1556194Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:27.1556453Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:27.1556774Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:27.1557201Z #define _IO_HEX 0100 2025-05-07T20:26:27.1557462Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:27.1557807Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:27.1558208Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:27.1558497Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:27.1558913Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:27.1559377Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:27.1559717Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:27.1560011Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:27.1560124Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:27.1560240Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:27.1560328Z #define __stub_sstk 2025-05-07T20:26:27.1560426Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:27.1560588Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:27.1560676Z #define __wur 2025-05-07T20:26:27.1560804Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:27.1560901Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:27.1560990Z #define _IO_OCT 040 2025-05-07T20:26:27.1561092Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:27.1561192Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:27.1561289Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:27.1561420Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:27.1561522Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:27.1561630Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:27.1561829Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:27.1561928Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:27.1562022Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:27.1562136Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:27.1562227Z #define __off64_t_defined 2025-05-07T20:26:27.1562326Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:27.1562421Z #define __FLT128_DIG__ 33 2025-05-07T20:26:27.1562528Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:27.1562631Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:27.1562725Z #define __INT32_C(c) c 2025-05-07T20:26:27.1562823Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:27.1562934Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:27.1563034Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:27.1563126Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:27.1563223Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:27.1563322Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:27.1563453Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:27.1563559Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:27.1563648Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:27.1563748Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:27.1563851Z #define __have_pthread_attr_t 1 2025-05-07T20:26:27.1563953Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:27.1564176Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:27.1564298Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:27.1564401Z #define __cudaCDP2EventRecord 2025-05-07T20:26:27.1564501Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:27.1564590Z #define htole32(x) (x) 2025-05-07T20:26:27.1564845Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:27.1564974Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:27.1565076Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:27.1565233Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:27.1565380Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:27.1565507Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:27.1565646Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:27.1565745Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:27.1565849Z #define cudaArrayLayered 0x01 2025-05-07T20:26:27.1566024Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:27.1566232Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:27.1566332Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:27.1566442Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:27.1566523Z #define unix 1 2025-05-07T20:26:27.1566696Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:27.1566801Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:27.1566899Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:27.1567022Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:27.1567118Z #define __USE_POSIX 1 2025-05-07T20:26:27.1567217Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:27.1567352Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:27.1567453Z #define __THROWNL throw () 2025-05-07T20:26:27.1567654Z #define __cpp_rtti 199711L 2025-05-07T20:26:27.1567768Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:27.1567861Z #define __PMT(args) args 2025-05-07T20:26:27.1567977Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1568138Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:27.1568261Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:27.1568354Z #define _SIZE_T_DECLARED 2025-05-07T20:26:27.1568460Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:27.1568561Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:27.1568955Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:27.1569066Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:27.1569163Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:27.1569268Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:27.1569413Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:27.1569501Z #define _WCHAR_T_H 2025-05-07T20:26:27.1569600Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:27.1569692Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:27.1569784Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:27.1569892Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:27.1569993Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:27.1570090Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:27.1570206Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:27.1570291Z #define __ELF__ 1 2025-05-07T20:26:27.1570395Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:27.1570511Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:27.1570604Z #define STA_INS 0x0010 2025-05-07T20:26:27.1570711Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:27.1570884Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:27.1570977Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:27.1571078Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:27.1571194Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1571302Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1571406Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:27.1571509Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:27.1571605Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:27.1571768Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:27.1571931Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:27.1572034Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:27.1572358Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:27.1572490Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:27.1572591Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:27.1572677Z #define __FLT_RADIX__ 2 2025-05-07T20:26:27.1572778Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:27.1572951Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:27.1573047Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:27.1573140Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:27.1573246Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:27.1573344Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:27.1573446Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:27.1573551Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:27.1573767Z #define WORD_BIT 32 2025-05-07T20:26:27.1573863Z #define _IO_USER_BUF 1 2025-05-07T20:26:27.1573956Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:27.1574061Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1574253Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:27.1574356Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:27.1574457Z #define __long_double_t long double 2025-05-07T20:26:27.1574561Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:27.1574652Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:27.1575051Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:27.1575144Z #define __k8 1 2025-05-07T20:26:27.1575337Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:27.1575513Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:27.1575628Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:27.1575728Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:27.1575841Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:27.1575945Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:27.1576039Z #define __blksize_t_defined 2025-05-07T20:26:27.1576138Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:27.1576260Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:27.1576375Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:27.1576468Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:27.1576582Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:27.1576677Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:27.1576783Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:27.1577035Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:27.1577378Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:27.1577488Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:27.1577585Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:27.1577674Z #define SEEK_SET 0 2025-05-07T20:26:27.1577776Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:27.1577875Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:26:27.1578072Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:27.1578185Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:27.1578289Z #define __cudaCDP2GetLastError 2025-05-07T20:26:27.1578390Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:27.1578482Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:27.1578806Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:27.1578915Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:27.1579014Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:27.1579107Z #define __stub_sigreturn 2025-05-07T20:26:27.1579354Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:27.1579452Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:27.1579550Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:27.1579656Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:27.1579743Z #define CLOCK_TAI 11 2025-05-07T20:26:27.1579858Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:27.1579946Z #define __restrict_arr 2025-05-07T20:26:27.1580060Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:27.1580207Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:27.1580728Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:27.1580910Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:27.1581000Z #define __USE_MISC 1 2025-05-07T20:26:27.1581107Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:27.1581206Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:27.1581300Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:27.1581477Z #define __LDBL_DIG__ 18 2025-05-07T20:26:27.1581577Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:27.1581677Z #define __malloc_and_calloc_defined 2025-05-07T20:26:27.1581843Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:27.1581952Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:27.1582034Z #define __x86_64__ 1 2025-05-07T20:26:27.1582115Z #define _SIZE_T_ 2025-05-07T20:26:27.1582995Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:27.1583097Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:27.1583201Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:27.1583315Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:27.1583437Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:27.1583539Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:27.1583647Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:27.1583772Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:27.1583916Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:27.1584014Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:27.1584475Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:27.1584596Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:27.1584739Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:27.1584846Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:27.1584941Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:27.1585030Z #define STA_FLL 0x0008 2025-05-07T20:26:27.1585176Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:27.1585277Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:27.1585398Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1585513Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:27.1585604Z #define __stub_revoke 2025-05-07T20:26:27.1585701Z #define __timer_t_defined 1 2025-05-07T20:26:27.1585832Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:27.1585920Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:27.1586029Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:27.1586134Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:27.1586229Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:27.1586333Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:27.1586442Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:27.1586541Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:27.1586689Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:27.1586785Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:27.1586878Z #define _IO_off_t __off_t 2025-05-07T20:26:27.1586970Z #define __FLT64_DIG__ 15 2025-05-07T20:26:27.1587188Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:27.1587297Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:27.1587423Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1587546Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:27.1587647Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:27.1587752Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:27.1587840Z #define NULL __null 2025-05-07T20:26:27.1587977Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:27.1588080Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:27.1588181Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:27.1588285Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1588379Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:27.1588463Z #define FP_ZERO 2 2025-05-07T20:26:27.1588569Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:27.1588812Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:27.1588931Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1589020Z #define __WCHAR_T__ 2025-05-07T20:26:27.1589196Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:27.1589403Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:27.1589557Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:27.1589657Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:27.1589788Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:27.1589908Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:27.1590038Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:27.1590176Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:27.1590273Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:27.1590378Z #define _SIGSET_H_types 1 2025-05-07T20:26:27.1590493Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:27.1590602Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:27.1590766Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:27.1590873Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:27.1591000Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:27.1591142Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:27.1591256Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:27.1591388Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:27.1591571Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:27.1591673Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:27.1591788Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:27.1591890Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:27.1591982Z #define STA_MODE 0x4000 2025-05-07T20:26:27.1592103Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:27.1592209Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:27.1592327Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:27.1592443Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:27.1592543Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:27.1592653Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:27.1592764Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:27.1592886Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:27.1592987Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:27.1593110Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1593199Z #define __SEG_FS 1 2025-05-07T20:26:27.1593300Z #define _IO_size_t size_t 2025-05-07T20:26:27.1593402Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:27.1593503Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:27.1593599Z #define __stub_lchmod 2025-05-07T20:26:27.1593695Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:27.1593808Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1593915Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:27.1594004Z #define __SEG_GS 1 2025-05-07T20:26:27.1594188Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:27.1594291Z #define _IOS_APPEND 8 2025-05-07T20:26:27.1594392Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:27.1594500Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:27.1594602Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:27.1594704Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:27.1594814Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:27.1594903Z #define htole16(x) (x) 2025-05-07T20:26:27.1595015Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:27.1595127Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:27.1595224Z #define __INT16_TYPE__ short int 2025-05-07T20:26:27.1595329Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:27.1595446Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:27.1595560Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:27.1595685Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:27.1595783Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:27.1595968Z #define __WCLONE 0x80000000 2025-05-07T20:26:27.1596073Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:27.1596165Z #define SEEK_HOLE 4 2025-05-07T20:26:27.1596260Z #define TIMER_ABSTIME 1 2025-05-07T20:26:27.1596442Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:27.1596538Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:27.1596714Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:27.1596841Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1596940Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:27.1597052Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:27.1597160Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1597286Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:27.1597378Z #define _LINUX_LIMITS_H 2025-05-07T20:26:27.1597470Z #define linux 1 2025-05-07T20:26:27.1597565Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:27.1597688Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:27.1597797Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:27.1597894Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:27.1598010Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:27.1598163Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:27.1598264Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:27.1598372Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1598475Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:27.1598569Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:27.1598665Z #define htole64(x) (x) 2025-05-07T20:26:27.1598769Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:27.1598904Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:27.1599003Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:27.1599496Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:27.1599592Z #define __USE_POSIX2 1 2025-05-07T20:26:27.1599695Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:27.1599791Z #define __WALL 0x40000000 2025-05-07T20:26:27.1599902Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:27.1599989Z #define _XLOCALE_H 1 2025-05-07T20:26:27.1600090Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:27.1600202Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:27.1600303Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:27.1600411Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:27.1600508Z #define __EXCEPTIONS 1 2025-05-07T20:26:27.1600614Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:27.1600817Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:27.1600907Z #define __WORDSIZE 64 2025-05-07T20:26:27.1601006Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:27.1601103Z #define _STL_RELOPS_H 1 2025-05-07T20:26:27.1601201Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:27.1601305Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:27.1601415Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:27.1601513Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:27.1601622Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:27.1601931Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:27.1602170Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:27.1602304Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:27.1602408Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:27.1602518Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:27.1602639Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:27.1602747Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:27.1602859Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:27.1603052Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:27.1603155Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:27.1603251Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:27.1603366Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:27.1603543Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:27.1603799Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:27.1603889Z #define _STRING_H 1 2025-05-07T20:26:27.1604065Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:27.1604169Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:27.1604272Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:27.1604410Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:27.1604516Z #define __code_model_small__ 1 2025-05-07T20:26:27.1604610Z #define _PSTL_CONFIG_H 2025-05-07T20:26:27.1604717Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:27.1604842Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:27.1604939Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:27.1605045Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:27.1605387Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:27.1605486Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:27.1605588Z #define le64toh(x) (x) 2025-05-07T20:26:27.1606585Z #define FILENAME_MAX 4096 2025-05-07T20:26:27.1606754Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:27.1606891Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:27.1606983Z #define L_cuserid 9 2025-05-07T20:26:27.1607077Z #define __ino_t_defined 2025-05-07T20:26:27.1607171Z #define __k8__ 1 2025-05-07T20:26:27.1607275Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:27.1607390Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:27.1607489Z #define __int8_t_defined 2025-05-07T20:26:27.1607669Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:27.1607781Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:27.1607901Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:27.1608006Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:27.1608103Z #define _IOS_TRUNC 16 2025-05-07T20:26:27.1608227Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:27.1608383Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:27.1608488Z #define __HAVE_COLUMN 2025-05-07T20:26:27.1608581Z #define __stub_fdetach 2025-05-07T20:26:27.1608997Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:27.1609097Z #define __pic__ 2 2025-05-07T20:26:27.1609224Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1609334Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:27.1609434Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:27.1609541Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:27.1609645Z #define __stub_chflags 2025-05-07T20:26:27.1609742Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:27.1609834Z #define __need_IOV_MAX 2025-05-07T20:26:27.1609956Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:27.1610066Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:27.1610171Z #define __cpp_decltype 200707L 2025-05-07T20:26:27.1610281Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:27.1610383Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:27.1610495Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:27.1610595Z #define TTY_NAME_MAX 32 2025-05-07T20:26:27.1610775Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:27.1610911Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1611085Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:27.1611204Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:27.1611311Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:27.1611412Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:27.1611503Z #define __import__ 2025-05-07T20:26:27.1611607Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:27.1611750Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:27.1611841Z #define __export__ 2025-05-07T20:26:27.1611976Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:27.1612084Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:27.1612493Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:27.1612604Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:27.1612700Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:27.1612925Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:27.1613026Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:27.1613153Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:27.1613284Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:27.1613399Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:27.1613497Z #define WNOWAIT 0x01000000 2025-05-07T20:26:27.1613594Z #define PLOSS 6 2025-05-07T20:26:27.1613697Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:27.1613964Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:27.1614065Z #define EXIT_SUCCESS 0 2025-05-07T20:26:27.1614169Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:27.1614281Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:27.1614396Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:27.1614495Z #define __thread__ __thread 2025-05-07T20:26:27.1614605Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:27.1614710Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:27.1614821Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:27.1615062Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:27.1615186Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:27.1615289Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:27.1615385Z #define __linux__ 1 2025-05-07T20:26:27.1615488Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:27.1615623Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:27.1615729Z #define __S16_TYPE short int 2025-05-07T20:26:27.1616081Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:27.1616203Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:27.1616404Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:27.1616509Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:27.1616621Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:27.1616716Z #define _T_SIZE_ 2025-05-07T20:26:27.1616821Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:27.1616953Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:27.1617054Z #define _PSTL_VERSION 12000 2025-05-07T20:26:27.1617183Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:27.1617292Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:27.1617397Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:27.1617540Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:27.1617633Z #define _IOS_INPUT 1 2025-05-07T20:26:27.1617732Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:27.1617850Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:27.1617951Z #define __INT64_TYPE__ long int 2025-05-07T20:26:27.1618053Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:27.1618170Z #define __shared__ __location__(shared) 2025-05-07T20:26:27.1618269Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:27.1618430Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:27.1618537Z #define __gid_t_defined 2025-05-07T20:26:27.1618656Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:27.1618766Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:27.1618971Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:27.1619077Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:27.1619180Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:27.1619275Z #define ___int_size_t_h 2025-05-07T20:26:27.1619390Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1619550Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:27.1619740Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:27.1619849Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:27.1619957Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:27.1620149Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:27.1620250Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:27.1620387Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1620578Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:27.1620717Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:27.1620815Z #define __clock_t_defined 1 2025-05-07T20:26:27.1620922Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:27.1621045Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:27.1621143Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:27.1621243Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:27.1621354Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:27.1621470Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:27.1621569Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:27.1621753Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:27.1621843Z #define __SSE__ 1 2025-05-07T20:26:27.1621961Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:27.1622068Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:27.1622158Z #define _CTYPE_H 1 2025-05-07T20:26:27.1622261Z #define __sigset_t_defined 2025-05-07T20:26:27.1622368Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:27.1622470Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:27.1622569Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:27.1622673Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:27.1622774Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:27.1622870Z #define __SM_70_RT_H__ 2025-05-07T20:26:27.1622972Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:27.1623085Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:27.1623192Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:27.1623359Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:27.1623467Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:27.1623583Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:27.1623685Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:27.1623792Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:27.1623881Z #define __amd64__ 1 2025-05-07T20:26:27.1623976Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:27.1624096Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:27.1624367Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:27.1624474Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:27.1624583Z #define EOF (-1) 2025-05-07T20:26:27.1624687Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:27.1624786Z #define __USE_POSIX199309 1 2025-05-07T20:26:27.1624897Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:27.1624997Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:27.1625106Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:27.1625211Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:27.1625331Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:27.1625441Z #define ____mbstate_t_defined 1 2025-05-07T20:26:27.1625537Z #define STA_NANO 0x2000 2025-05-07T20:26:27.1625647Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:27.1625757Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:27.1625851Z #define _IO_LINKED 0x80 2025-05-07T20:26:27.1625958Z #define __cpp_lib_launder 201606 2025-05-07T20:26:27.1626065Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:27.1626173Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:27.1626276Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:27.1626385Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:27.1626535Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:27.1626656Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1626765Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:27.1626867Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:27.1626975Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:27.1627075Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:27.1627214Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:27.1627351Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:27.1627646Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:27.1627839Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:27.1628044Z #define __stub_stty 2025-05-07T20:26:27.1628225Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:27.1628326Z #define le16toh(x) (x) 2025-05-07T20:26:27.1628439Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:27.1641361Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:27.1641498Z #define _SIZET_ 2025-05-07T20:26:27.1641602Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:27.1641694Z #define _SVID_SOURCE 1 2025-05-07T20:26:27.1641781Z #define _LP64 1 2025-05-07T20:26:27.1641879Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:27.1642132Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:27.1642249Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:27.1642356Z #define __UINT8_C(c) c 2025-05-07T20:26:27.1642458Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:27.1642558Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:27.1642672Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:27.1642775Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:27.1642878Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:27.1642981Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:27.1643070Z #define CUDARTAPI 2025-05-07T20:26:27.1643165Z #define IOV_MAX 1024 2025-05-07T20:26:27.1643315Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:27.1643418Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:27.1643529Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:27.1643616Z #define __wchar_t__ 2025-05-07T20:26:27.1643723Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:27.1643813Z #define SEEK_END 2 2025-05-07T20:26:27.1643910Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:27.1644092Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:27.1644198Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:27.1644345Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:27.1644444Z #define ____FILE_defined 1 2025-05-07T20:26:27.1644568Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:27.1644667Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:27.1644763Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:27.1644863Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:27.1645116Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:27.1645258Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:27.1645347Z #define _IO_RIGHT 04 2025-05-07T20:26:27.1645451Z #define __END_NAMESPACE_STD 2025-05-07T20:26:27.1645643Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:27.1645740Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:27.1645868Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:27.1645967Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:27.1646075Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:27.1646167Z #define _STDDEF_H_ 2025-05-07T20:26:27.1646344Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:27.1646451Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1646578Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:27.1646784Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:27.1646907Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:27.1647052Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:27.1647179Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:27.1647291Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:27.1647404Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:27.1647503Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:27.1647700Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:27.1647802Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:27.1648069Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:27.1648177Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:27.1648354Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:27.1648542Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:27.1648725Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:27.1648827Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:27.1648933Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:27.1649080Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:27.1649194Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:27.1649310Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:27.1649437Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:27.1649538Z #define P_tmpdir "/tmp" 2025-05-07T20:26:27.1649669Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:27.1649767Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:27.1649870Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:27.1650057Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:27.1650229Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:27.1650345Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:27.1650471Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:27.1650587Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:27.1650697Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:27.1650927Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:27.1651029Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:27.1651151Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:27.1651250Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:27.1651344Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:27.1651447Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:27.1651549Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:27.1651654Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:27.1651746Z #define __FXSR__ 1 2025-05-07T20:26:27.1651832Z #define _SIZE_T 2025-05-07T20:26:27.1651944Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:27.1652059Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:27.1652236Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:27.1652394Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:27.1652491Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:27.1652595Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:27.1652789Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:27.1652991Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:27.1653092Z #define _GXX_NULLPTR_T 2025-05-07T20:26:27.1653218Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:27.1653309Z #define FOPEN_MAX 16 2025-05-07T20:26:27.1653409Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:27.1653530Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:27.1653636Z #define __suseconds_t_defined 2025-05-07T20:26:27.1653735Z #define __off_t_defined 2025-05-07T20:26:27.1653824Z #define stderr stderr 2025-05-07T20:26:27.1653931Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:27.1654054Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:27.1654155Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:27.1654249Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:27.1654667Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:27.1654762Z #define __mode_t_defined 2025-05-07T20:26:27.1654859Z #define _GCC_SIZE_T 2025-05-07T20:26:27.1654961Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1655066Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:27.1655183Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:27.1655282Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:27.1655377Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:27.1655655Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:27.1655764Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:27.1655872Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:27.1656068Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:27.1656155Z #define __size_t__ 2025-05-07T20:26:27.1656297Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:27.1656395Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:27.1656508Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:27.1656668Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:27.1656769Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:27.1656941Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:27.1657038Z #define _ENDIAN_H 1 2025-05-07T20:26:27.1657147Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:27.1657248Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:27.1657357Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:27.1657441Z #define __try try 2025-05-07T20:26:27.1657557Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:27.1657654Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:27.1657746Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:27.1658016Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:27.1658109Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:27.1658195Z #define __PIC__ 2 2025-05-07T20:26:27.1658315Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:27.1658437Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:27.1658571Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:27.1658675Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:27.1658772Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:27.1658958Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:27.1659066Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:27.1659171Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:27.1659268Z #define _IO_uid_t __uid_t 2025-05-07T20:26:27.1659373Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:27.1659504Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:27.1659607Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:27.1659763Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:27.1659868Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:27.1660000Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:27.1660087Z #define LONG_BIT 64 2025-05-07T20:26:27.1660201Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:27.1660308Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:27.1660437Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:27.1660546Z #define __fsfilcnt_t_defined 2025-05-07T20:26:27.1660640Z #define __blkcnt_t_defined 2025-05-07T20:26:27.1660911Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:27.1661012Z #define __USE_LARGEFILE 1 2025-05-07T20:26:27.1661115Z #define __cpp_constexpr 201603L 2025-05-07T20:26:27.1661217Z #define CUDART_VERSION 12060 2025-05-07T20:26:27.1661318Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:27.1661423Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:27.1661519Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:27.1661725Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:27.1661819Z #define __lldiv_t_defined 1 2025-05-07T20:26:27.1661905Z #define __SSE2__ 1 2025-05-07T20:26:27.1661995Z #define _IOLBF 1 2025-05-07T20:26:27.1662099Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:27.1662205Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:27.1662312Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:27.1662410Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:27.1662528Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:27.1662622Z #define __INT32_TYPE__ int 2025-05-07T20:26:27.1662717Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:27.1662834Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:27.1663021Z #define __cpp_exceptions 199711L 2025-05-07T20:26:27.1663125Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:27.1663248Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:27.1663346Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:27.1663569Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:27.1663743Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:27.1663846Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:27.1663954Z #define __SWORD_TYPE long int 2025-05-07T20:26:27.1664056Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:27.1664160Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:27.1664266Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:27.1664366Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:27.1664654Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:27.1664763Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:27.1664915Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:27.1665009Z #define _T_SIZE 2025-05-07T20:26:27.1665129Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:27.1665262Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:27.1665404Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:27.1665504Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:27.1665603Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:27.1665737Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:27.1665838Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:27.1665945Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1666049Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:27.1666229Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:27.1666326Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:27.1666443Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:27.1666545Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:27.1666671Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1666766Z #define __PIE__ 2 2025-05-07T20:26:27.1666881Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:27.1666993Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:27.1667189Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:27.1667415Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:27.1667518Z #define __nlink_t_defined 2025-05-07T20:26:27.1667649Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:27.1667764Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:27.1667864Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:27.1668127Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:27.1668257Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:27.1668367Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:27.1668472Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:27.1668580Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:27.1668680Z #define __FILE_defined 1 2025-05-07T20:26:27.1668862Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:27.1668968Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:27.1669071Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:27.1669185Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:27.1669314Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:27.1669448Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:27.1669570Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:27.1669675Z #define __INT16_C(c) c 2025-05-07T20:26:27.1669773Z #define __U32_TYPE unsigned int 2025-05-07T20:26:27.1669881Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:27.1670007Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:27.1670095Z #define __STDC__ 1 2025-05-07T20:26:27.1670200Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:27.1670306Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:27.1670405Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:27.1670653Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:27.1670747Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:27.1670850Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:27.1671029Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:27.1671147Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:27.1671266Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:27.1671368Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:27.1671472Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:27.1671566Z #define stdin stdin 2025-05-07T20:26:27.1671662Z #define __ino64_t_defined 2025-05-07T20:26:27.1671753Z #define STA_CLK 0x8000 2025-05-07T20:26:27.1671855Z #define __clockid_t_defined 1 2025-05-07T20:26:27.1672006Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:27.1672173Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:27.1672285Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:27.1672391Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:27.1672505Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:27.1672619Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:27.1672822Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:27.1672924Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:27.1673451Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:27.1673539Z #define DOMAIN 1 2025-05-07T20:26:27.1673642Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:27.1673730Z #define __NVCC__ 1 2025-05-07T20:26:27.1673838Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:27.1673962Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:27.1674070Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:27.1674182Z #define __throw_exception_again throw 2025-05-07T20:26:27.1674284Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:27.1674378Z #define __EXCEPTION_H 1 2025-05-07T20:26:27.1674485Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:27.1674600Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:27.1674907Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:27.1675031Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:27.1675135Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:27.1675233Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:27.1675345Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:27.1675447Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:27.1675603Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:27.1675716Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:27.1675830Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:27.1675934Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:27.1676043Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:27.1676146Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:27.1676259Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:27.1676403Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:27.1676505Z #define __useconds_t_defined 2025-05-07T20:26:27.1676615Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:27.1676800Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:27.1676952Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:27.1677049Z #define __SSE_MATH__ 1 2025-05-07T20:26:27.1677145Z #define _IO_wint_t wint_t 2025-05-07T20:26:27.1677249Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:27.1677343Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:27.1677441Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:27.1677565Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:27.1677666Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:27.1677765Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:27.1677948Z #define __USE_ATFILE 1 2025-05-07T20:26:27.1678047Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:27.1678150Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:27.1678247Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:27.1678550Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:27.1678662Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:27.1678768Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:27.1678875Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:27.1678995Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:27.1679085Z #define _STDLIB_H 1 2025-05-07T20:26:27.1679226Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:27.1679340Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:27.1679458Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:27.1679610Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:27.1679734Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:27.1679835Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:27.1680027Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:27.1680193Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:27.1680305Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:27.1680433Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:27.1680531Z #define __ldiv_t_defined 1 2025-05-07T20:26:27.1680715Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:27.1680820Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:27.1680992Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:27.1681099Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:27.1681200Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:27.1681306Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:27.1681411Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:27.1681521Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:27.1681611Z #define CUDART_CB 2025-05-07T20:26:27.1681726Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:27.1681855Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:27.1681946Z #define MB_LEN_MAX 16 2025-05-07T20:26:27.1682181Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:27.1682286Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:27.1682416Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:27.1682537Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:27.1682637Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:27.1682787Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:27.1682903Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:27.1682994Z #define _GNU_SOURCE 1 2025-05-07T20:26:27.1683091Z #define __stub_putmsg 2025-05-07T20:26:27.1683182Z #define __CUDACC__ 1 2025-05-07T20:26:27.1683274Z #define __N(msgid) (msgid) 2025-05-07T20:26:27.1683368Z #define __P(args) args 2025-05-07T20:26:27.1683623Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:27.1683734Z #define __cpp_init_captures 201304L 2025-05-07T20:26:27.1683851Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:27.1683949Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:27.1684051Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:27.1684145Z #define __WCHAR_T 2025-05-07T20:26:27.1684240Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:27.1684338Z #define __fsblkcnt_t_defined 2025-05-07T20:26:27.1684464Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:27.1684570Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:27.1684576Z 2025-05-07T20:26:27.1854798Z 2025-05-07T20:26:27.1855530Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:27.1855545Z 2025-05-07T20:26:29.0932888Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:29.0933254Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:26:29.0933572Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:26:29.0933892Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:26:29.0934578Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:26:29.0934782Z 2025-05-07T20:26:29.1561189Z 2025-05-07T20:26:29.1565668Z /usr/bin/nvidia-smi 2025-05-07T20:26:29.1571284Z + nvidia-smi 2025-05-07T20:26:29.1571437Z 2025-05-07T20:26:29.1747034Z Wed May 7 20:26:29 2025 2025-05-07T20:26:29.1747500Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:29.1748185Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:29.1748676Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:29.1749162Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:29.1749680Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:29.1750107Z | | | MIG M. | 2025-05-07T20:26:29.1750465Z |=========================================+========================+======================| 2025-05-07T20:26:29.1915409Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:29.1915856Z | 0% 25C P8 15W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:29.1916239Z | | | N/A | 2025-05-07T20:26:29.1916643Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:29.1920215Z 2025-05-07T20:26:29.1920616Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:29.1921040Z | Processes: | 2025-05-07T20:26:29.1921485Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:29.1921898Z | ID ID Usage | 2025-05-07T20:26:29.1922247Z |=========================================================================================| 2025-05-07T20:26:29.1925184Z | No running processes found | 2025-05-07T20:26:29.1925656Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:29.4620571Z 2025-05-07T20:26:29.4625215Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:26:29.4673807Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:29.4674362Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:29.4685957Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:29.4686315Z env: 2025-05-07T20:26:29.4686542Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:29.4686836Z BUILD_ENV: build_binary 2025-05-07T20:26:29.4687081Z BUILD_TARGET: genai 2025-05-07T20:26:29.4687311Z BUILD_VARIANT: cuda 2025-05-07T20:26:29.4687618Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:26:29.4687879Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:29.4688177Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:29.4688507Z ##[endgroup] 2025-05-07T20:26:29.8037568Z ################################################################################ 2025-05-07T20:26:29.8038049Z # Install PyTorch (PIP) 2025-05-07T20:26:29.8038345Z # 2025-05-07T20:26:29.8054054Z # [2025-05-07T20:26:29.805Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:26:29.8054654Z ################################################################################ 2025-05-07T20:26:29.8054950Z 2025-05-07T20:26:29.8082310Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:30.7878268Z Channels: 2025-05-07T20:26:30.7878519Z - conda-forge 2025-05-07T20:26:30.7878767Z Platform: linux-64 2025-05-07T20:26:34.1872332Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:34.9102969Z Solving environment: \ | / done 2025-05-07T20:26:35.1475363Z 2025-05-07T20:26:35.1475765Z ## Package Plan ## 2025-05-07T20:26:35.1475925Z 2025-05-07T20:26:35.1476141Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:35.1476465Z 2025-05-07T20:26:35.1476566Z added / updated specs: 2025-05-07T20:26:35.1476829Z - numpy 2025-05-07T20:26:35.1476951Z 2025-05-07T20:26:35.1476990Z 2025-05-07T20:26:35.1477124Z The following packages will be downloaded: 2025-05-07T20:26:35.1477340Z 2025-05-07T20:26:35.1477467Z package | build 2025-05-07T20:26:35.1477783Z ---------------------------|----------------- 2025-05-07T20:26:35.1478165Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:35.1478624Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:35.1479065Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:35.1479509Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:35.1479960Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:35.1480422Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:35.1480861Z numpy-2.2.5 | py311h5d046bc_0 8.6 MB conda-forge 2025-05-07T20:26:35.1481246Z ------------------------------------------------------------ 2025-05-07T20:26:35.1481588Z Total: 15.9 MB 2025-05-07T20:26:35.1481793Z 2025-05-07T20:26:35.1481960Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:35.1482259Z 2025-05-07T20:26:35.1482473Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:35.1482964Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:35.1483458Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:35.1484023Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:35.1484538Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:35.1485070Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:35.1485890Z numpy conda-forge/linux-64::numpy-2.2.5-py311h5d046bc_0 2025-05-07T20:26:35.1486160Z 2025-05-07T20:26:35.1486164Z 2025-05-07T20:26:35.1486168Z 2025-05-07T20:26:35.1486311Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:35.1486672Z numpy-2.2.5 | 8.6 MB | | 0% 2025-05-07T20:26:35.1486900Z 2025-05-07T20:26:35.1487168Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:35.1487407Z 2025-05-07T20:26:35.1487688Z 2025-05-07T20:26:35.1512135Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:35.1512457Z 2025-05-07T20:26:35.1512463Z 2025-05-07T20:26:35.1512468Z 2025-05-07T20:26:35.1525150Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:35.1525433Z 2025-05-07T20:26:35.1525441Z 2025-05-07T20:26:35.1525447Z 2025-05-07T20:26:35.1525455Z 2025-05-07T20:26:35.1530426Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:35.1530864Z 2025-05-07T20:26:35.1530892Z 2025-05-07T20:26:35.1530898Z 2025-05-07T20:26:35.1530904Z 2025-05-07T20:26:35.1530910Z 2025-05-07T20:26:35.1532115Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:35.1532425Z 2025-05-07T20:26:35.1532429Z 2025-05-07T20:26:35.1532433Z 2025-05-07T20:26:35.1532437Z 2025-05-07T20:26:35.1532747Z 2025-05-07T20:26:35.1532760Z 2025-05-07T20:26:35.2418984Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:35.2419272Z 2025-05-07T20:26:35.2419277Z 2025-05-07T20:26:35.2419281Z 2025-05-07T20:26:35.2434751Z 2025-05-07T20:26:35.3273758Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.3274035Z 2025-05-07T20:26:35.3274040Z 2025-05-07T20:26:35.3274044Z 2025-05-07T20:26:35.3274047Z 2025-05-07T20:26:35.3495623Z 2025-05-07T20:26:35.3655376Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:35.3655661Z 2025-05-07T20:26:35.3655665Z 2025-05-07T20:26:35.3655669Z 2025-05-07T20:26:35.3655697Z 2025-05-07T20:26:35.3655701Z 2025-05-07T20:26:35.4319461Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.4319750Z 2025-05-07T20:26:35.4319754Z 2025-05-07T20:26:35.4319758Z 2025-05-07T20:26:35.4319762Z 2025-05-07T20:26:35.4319765Z 2025-05-07T20:26:35.4319788Z 2025-05-07T20:26:35.5533609Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:35.5533911Z 2025-05-07T20:26:35.5558797Z 2025-05-07T20:26:35.5575332Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:26:35.5575735Z 2025-05-07T20:26:35.5575741Z 2025-05-07T20:26:35.5575746Z 2025-05-07T20:26:35.5575752Z 2025-05-07T20:26:35.5575758Z 2025-05-07T20:26:35.5583868Z 2025-05-07T20:26:35.5586520Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.6132022Z numpy-2.2.5 | 8.6 MB | | 0% 2025-05-07T20:26:35.6132323Z 2025-05-07T20:26:35.6132328Z 2025-05-07T20:26:35.6132363Z 2025-05-07T20:26:35.6133058Z 2025-05-07T20:26:35.6145718Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.6146104Z 2025-05-07T20:26:35.6146109Z 2025-05-07T20:26:35.6146113Z 2025-05-07T20:26:35.6146117Z 2025-05-07T20:26:35.6216089Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.6216392Z 2025-05-07T20:26:35.6216396Z 2025-05-07T20:26:35.6282598Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:35.6283162Z 2025-05-07T20:26:35.6283171Z 2025-05-07T20:26:35.6283178Z 2025-05-07T20:26:35.6283186Z 2025-05-07T20:26:35.6283193Z 2025-05-07T20:26:35.6322668Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.6323054Z 2025-05-07T20:26:35.6327357Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:35.6327826Z 2025-05-07T20:26:35.6327832Z 2025-05-07T20:26:35.6327837Z 2025-05-07T20:26:35.6327854Z 2025-05-07T20:26:35.6327859Z 2025-05-07T20:26:35.6328185Z 2025-05-07T20:26:35.6446986Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:35.6447295Z 2025-05-07T20:26:35.6447299Z 2025-05-07T20:26:35.6447303Z 2025-05-07T20:26:35.6471128Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:26:35.6471423Z 2025-05-07T20:26:35.6471428Z 2025-05-07T20:26:35.6471600Z 2025-05-07T20:26:35.6608169Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:35.7065671Z numpy-2.2.5 | 8.6 MB | ###3 | 34% 2025-05-07T20:26:35.7065997Z 2025-05-07T20:26:35.7066001Z 2025-05-07T20:26:35.7066729Z 2025-05-07T20:26:35.7230384Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:35.7230699Z 2025-05-07T20:26:35.7230703Z 2025-05-07T20:26:35.7231194Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:35.7231474Z 2025-05-07T20:26:35.7231483Z 2025-05-07T20:26:35.7338949Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:35.7339225Z 2025-05-07T20:26:35.7394183Z libopenblas-0.3.29 | 5.6 MB | #########9 | 99%  2025-05-07T20:26:35.7394669Z 2025-05-07T20:26:35.7621438Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:35.8368464Z numpy-2.2.5 | 8.6 MB | #####3 | 53% 2025-05-07T20:26:35.8369009Z 2025-05-07T20:26:35.8620419Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:35.8730728Z numpy-2.2.5 | 8.6 MB | #########6 | 96% 2025-05-07T20:26:36.3021441Z numpy-2.2.5 | 8.6 MB | ########## | 100% 2025-05-07T20:26:36.3029304Z numpy-2.2.5 | 8.6 MB | ########## | 100% 2025-05-07T20:26:36.3029649Z 2025-05-07T20:26:36.3029848Z 2025-05-07T20:26:36.3030120Z  2025-05-07T20:26:36.3030331Z 2025-05-07T20:26:36.3030335Z 2025-05-07T20:26:36.3030523Z  2025-05-07T20:26:36.3030731Z 2025-05-07T20:26:36.3030735Z 2025-05-07T20:26:36.3030752Z 2025-05-07T20:26:36.3030926Z  2025-05-07T20:26:36.3031132Z 2025-05-07T20:26:36.3031136Z 2025-05-07T20:26:36.3031140Z 2025-05-07T20:26:36.3031152Z 2025-05-07T20:26:36.3031339Z  2025-05-07T20:26:36.3031547Z 2025-05-07T20:26:36.3031552Z 2025-05-07T20:26:36.3031556Z 2025-05-07T20:26:36.3031560Z 2025-05-07T20:26:36.3031563Z 2025-05-07T20:26:36.3031751Z  2025-05-07T20:26:36.3031961Z 2025-05-07T20:26:36.3031965Z 2025-05-07T20:26:36.3031969Z 2025-05-07T20:26:36.3031972Z 2025-05-07T20:26:36.3031976Z 2025-05-07T20:26:36.3031980Z 2025-05-07T20:26:36.3032340Z  done 2025-05-07T20:26:36.4044064Z Preparing transaction: \ done 2025-05-07T20:26:36.6054065Z Verifying transaction: / - done 2025-05-07T20:26:36.7062547Z Executing transaction: | done 2025-05-07T20:26:36.8838121Z ################################################################################ 2025-05-07T20:26:36.8838512Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:36.8838835Z # 2025-05-07T20:26:36.8855274Z # [2025-05-07T20:26:36.885Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:26:36.8855759Z ################################################################################ 2025-05-07T20:26:36.8855975Z 2025-05-07T20:26:36.8873122Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:36.9763657Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:36.9764033Z ################################################################################ 2025-05-07T20:26:36.9764363Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:36.9764652Z # 2025-05-07T20:26:36.9782355Z # [2025-05-07T20:26:36.977Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:26:36.9782789Z ################################################################################ 2025-05-07T20:26:36.9783005Z 2025-05-07T20:26:36.9806732Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:36.9828146Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:26:36.9844418Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:36.9844970Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:26:36.9852015Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:36.9859413Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:26:36.9880582Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:28:00.8791455Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:28:00.8791885Z Collecting torch 2025-05-07T20:28:00.8792545Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:00.8793252Z Collecting filelock (from torch) 2025-05-07T20:28:00.8794047Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:00.8794979Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from torch) (4.13.2) 2025-05-07T20:28:00.8795681Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:00.8796184Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:00.8797007Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 48.1 MB/s eta 0:00:00 2025-05-07T20:28:00.8797352Z Collecting networkx (from torch) 2025-05-07T20:28:00.8797863Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:28:00.8798507Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 23.4 MB/s eta 0:00:00 2025-05-07T20:28:00.8798855Z Collecting jinja2 (from torch) 2025-05-07T20:28:00.8799328Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:00.8799842Z Collecting fsspec (from torch) 2025-05-07T20:28:00.8800331Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:00.8800895Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:28:00.8801619Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:28:00.8802440Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 51.0 MB/s eta 0:00:00 2025-05-07T20:28:00.8802857Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:28:00.8803574Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:28:00.8804358Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 9.7 MB/s eta 0:00:00 2025-05-07T20:28:00.8804765Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:28:00.8805451Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:28:00.8806739Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 57.5 MB/s eta 0:00:00 2025-05-07T20:28:00.8807227Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:28:00.8808091Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:28:00.8808869Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 27.7 MB/s eta 0:00:00 2025-05-07T20:28:00.8809494Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:28:00.8810274Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:28:00.8811134Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 71.5 MB/s eta 0:00:00 2025-05-07T20:28:00.8811535Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:28:00.8812214Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:28:00.8812978Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 112.8 MB/s eta 0:00:00 2025-05-07T20:28:00.8813369Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:28:00.8814055Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:28:00.8814836Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 135.0 MB/s eta 0:00:00 2025-05-07T20:28:00.8815227Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:28:00.8815931Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:28:00.8816848Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 159.2 MB/s eta 0:00:00 2025-05-07T20:28:00.8817245Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:28:00.8817940Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:28:00.8818720Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 123.9 MB/s eta 0:00:00 2025-05-07T20:28:00.8819120Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:00.8819825Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:00.8820611Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 147.4 MB/s eta 0:00:00 2025-05-07T20:28:00.8820987Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:00.8821755Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:00.8822537Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:28:00.8823184Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:28:00.8823855Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:28:00.8824633Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:28:00.8825486Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 63.2 MB/s eta 0:00:00 2025-05-07T20:28:00.8825883Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:28:00.8826667Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:00.8827475Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:00.8828309Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:00.8829575Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:28:00.8830443Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:00.8831003Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:00.8831728Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 44.2 MB/s eta 0:00:00 2025-05-07T20:28:00.8832100Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:00.8832797Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB) 2025-05-07T20:28:00.8833849Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl (825.6 MB) 2025-05-07T20:28:00.8834737Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.6/825.6 MB 37.2 MB/s eta 0:00:00 2025-05-07T20:28:00.8835505Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:28:00.8836388Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 19.4 MB/s eta 0:00:00 2025-05-07T20:28:00.8837172Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:00.8838020Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 56.8 MB/s eta 0:00:00 2025-05-07T20:28:00.8838815Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:28:00.8839690Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 134.1 MB/s eta 0:00:00 2025-05-07T20:28:00.8841433Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:00.8843017Z 2025-05-07T20:28:00.8845007Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:28:00.8847080Z 2025-05-07T20:28:03.0794179Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:28:03.0796471Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:06.4688891Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:09.8884711Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:09.8885153Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:13.2228983Z True 2025-05-07T20:28:13.2229287Z True 2025-05-07T20:28:13.2229820Z 2025-05-07T20:28:13.2855279Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:13.2893832Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:13.2894439Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:13.2909319Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:13.2909708Z env: 2025-05-07T20:28:13.2909945Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:13.2910275Z BUILD_ENV: build_binary 2025-05-07T20:28:13.2910521Z BUILD_TARGET: genai 2025-05-07T20:28:13.2910746Z BUILD_VARIANT: cuda 2025-05-07T20:28:13.2910983Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:13.2911230Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:13.2911533Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:13.2911865Z ##[endgroup] 2025-05-07T20:28:13.6254919Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:13.6257772Z ################################################################################ 2025-05-07T20:28:13.6258904Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:13.6259635Z # 2025-05-07T20:28:13.6272955Z # [2025-05-07T20:28:13.626Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:13.6273698Z ################################################################################ 2025-05-07T20:28:13.6273917Z 2025-05-07T20:28:13.6288410Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:13.7233552Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:13.7243261Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:13.7243889Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:13.7244289Z 2025-05-07T20:28:13.8227856Z 2025-05-07T20:28:13.8228517Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:13.8251719Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:19.9267834Z Collecting environment information... 2025-05-07T20:28:19.9268386Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:28:19.9268703Z Is debug build: False 2025-05-07T20:28:19.9268955Z CUDA used to build PyTorch: 12.6 2025-05-07T20:28:19.9269250Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:19.9269432Z 2025-05-07T20:28:19.9269535Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:19.9269854Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:19.9270166Z Clang version: Could not collect 2025-05-07T20:28:19.9270440Z CMake version: Could not collect 2025-05-07T20:28:19.9270708Z Libc version: glibc-2.34 2025-05-07T20:28:19.9270858Z 2025-05-07T20:28:19.9271158Z Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:19.9271766Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:19.9272169Z Is CUDA available: True 2025-05-07T20:28:19.9272426Z CUDA runtime version: 12.6.85 2025-05-07T20:28:19.9272806Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:19.9273233Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:19.9273590Z Nvidia driver version: 570.133.07 2025-05-07T20:28:19.9273958Z cuDNN version: Could not collect 2025-05-07T20:28:19.9274341Z HIP runtime version: N/A 2025-05-07T20:28:19.9274684Z MIOpen runtime version: N/A 2025-05-07T20:28:19.9275025Z Is XNNPACK available: True 2025-05-07T20:28:19.9275236Z 2025-05-07T20:28:19.9275318Z CPU: 2025-05-07T20:28:19.9275540Z Architecture: x86_64 2025-05-07T20:28:19.9275868Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:19.9276258Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:19.9276643Z Byte Order: Little Endian 2025-05-07T20:28:19.9276954Z CPU(s): 16 2025-05-07T20:28:19.9277253Z On-line CPU(s) list: 0-15 2025-05-07T20:28:19.9277987Z Vendor ID: AuthenticAMD 2025-05-07T20:28:19.9278341Z Model name: AMD EPYC 7R32 2025-05-07T20:28:19.9278654Z CPU family: 23 2025-05-07T20:28:19.9278948Z Model: 49 2025-05-07T20:28:19.9279239Z Thread(s) per core: 2 2025-05-07T20:28:19.9279524Z Core(s) per socket: 8 2025-05-07T20:28:19.9279816Z Socket(s): 1 2025-05-07T20:28:19.9280098Z Stepping: 0 2025-05-07T20:28:19.9280393Z BogoMIPS: 5600.00 2025-05-07T20:28:19.9282472Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:19.9284716Z Hypervisor vendor: KVM 2025-05-07T20:28:19.9285029Z Virtualization type: full 2025-05-07T20:28:19.9285370Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:19.9285734Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:19.9286091Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:19.9286449Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:19.9286767Z NUMA node(s): 1 2025-05-07T20:28:19.9287068Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:19.9287407Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:19.9287903Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:19.9288262Z Vulnerability L1tf: Not affected 2025-05-07T20:28:19.9288623Z Vulnerability Mds: Not affected 2025-05-07T20:28:19.9288985Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:19.9289346Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:19.9289717Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:19.9290271Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:19.9290854Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:19.9291402Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:19.9292094Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:19.9292957Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:19.9293633Z Vulnerability Srbds: Not affected 2025-05-07T20:28:19.9293995Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:19.9294237Z 2025-05-07T20:28:19.9294342Z Versions of relevant libraries: 2025-05-07T20:28:19.9294609Z [pip3] numpy==2.2.5 2025-05-07T20:28:19.9294845Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:28:19.9295147Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:28:19.9295453Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:28:19.9295763Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:28:19.9296082Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:28:19.9296375Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:28:19.9296659Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:28:19.9296965Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:28:19.9297273Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:28:19.9297695Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:19.9298002Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:19.9298292Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:28:19.9298595Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:28:19.9298894Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:19.9299205Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:28:19.9299577Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:19.9300058Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:19.9300574Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:19.9301095Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:19.9301626Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:19.9302160Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:19.9302643Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:28:19.9303113Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:28:19.9303674Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:19.9304170Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:19.9304655Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:19.9305119Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:19.9305573Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:19.9306396Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:19.9306954Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:19.9307512Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:28:19.9308049Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:19.9308595Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:19.9309140Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:28:19.9309666Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:28:19.9310206Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:19.9310750Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:28:19.9311296Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:19.9311860Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:19.9312426Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:28:19.9312991Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:28:19.9313552Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:19.9314127Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:19.9314664Z [conda] numpy 2.2.5 py311h5d046bc_0 conda-forge 2025-05-07T20:28:19.9315196Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:28:19.9315770Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:28:19.9316354Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:19.9316940Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:19.9317505Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:28:19.9318196Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:28:19.9318680Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:28:19.9319166Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:28:19.9319656Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:28:19.9320155Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:19.9320640Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:19.9321116Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:28:19.9321600Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:19.9322080Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:19.9322602Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:28:19.9322879Z 2025-05-07T20:28:20.0005028Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:20.0005907Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:20.0019386Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:20.0019732Z env: 2025-05-07T20:28:20.0019964Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:20.0020259Z BUILD_ENV: build_binary 2025-05-07T20:28:20.0020510Z BUILD_TARGET: genai 2025-05-07T20:28:20.0020747Z BUILD_VARIANT: cuda 2025-05-07T20:28:20.0020987Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:20.0021241Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:20.0021547Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:20.0021880Z ##[endgroup] 2025-05-07T20:28:20.3370501Z ################################################################################ 2025-05-07T20:28:20.3370885Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:20.3371139Z # 2025-05-07T20:28:20.3385767Z # [2025-05-07T20:28:20.338Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:20.3386174Z ################################################################################ 2025-05-07T20:28:20.3386389Z 2025-05-07T20:28:20.3400834Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:20.4300174Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:20.4323747Z [BUILD] Running git submodules update ... 2025-05-07T20:28:20.4346325Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:20.4707736Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:20.4708209Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:20.4708649Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:20.4709044Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:20.4709436Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:20.4709883Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:20.4710279Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:20.4742683Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:20.5285802Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:20.5308122Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:22.9021200Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:22.9207932Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:23.0155817Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:23.0185534Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:23.2334460Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:23.2366225Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:23.3404648Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:23.3437623Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:23.6516411Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:23.6564176Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:23.7080579Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:23.7083443Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:23.7731833Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:23.7761196Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:23.8188496Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:23.8699174Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:23.8729838Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:23.9953405Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:23.9987903Z Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:24.0948522Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:24.0996041Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:24.1441914Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:24.2066899Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:24.2098069Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:24.3014921Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:24.3042361Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:24.4095731Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:24.4143902Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:24.5171993Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:24.5266078Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:24.6169513Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:24.6205181Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:24.7209588Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:24.7239416Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:24.8309627Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:24.8339061Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:24.8832451Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:24.9374538Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:24.9418132Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:24.9909318Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:25.0425686Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:25.0453468Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:25.0955598Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:25.1617553Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:25.1662165Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:25.2138161Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:25.2629791Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:25.3157006Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:25.8172488Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 55.6 MB/s eta 0:00:00 2025-05-07T20:28:25.8208888Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:25.8771913Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:25.9486893Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:26.0098782Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:26.0698475Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:26.1248979Z Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB) 2025-05-07T20:28:26.1860611Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 763.0/763.0 kB 8.5 MB/s eta 0:00:00 2025-05-07T20:28:26.1897953Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:26.2371664Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:26.2905812Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:26.3444600Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:26.4127553Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:26.4646964Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:26.5295039Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:26.5837561Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:26.6444549Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:26.7051328Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:26.8825122Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:29.2653915Z 2025-05-07T20:28:29.2680533Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:28:29.4480534Z ################################################################################ 2025-05-07T20:28:29.4480894Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:29.4481160Z # 2025-05-07T20:28:29.4499509Z # [2025-05-07T20:28:29.449Z] + install_triton_pip build_binary 2025-05-07T20:28:29.4499890Z ################################################################################ 2025-05-07T20:28:29.4500103Z 2025-05-07T20:28:29.4500332Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:29.4500767Z ################################################################################ 2025-05-07T20:28:29.4501117Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:29.4501433Z # 2025-05-07T20:28:29.4516936Z # [2025-05-07T20:28:29.451Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:29.4517448Z ################################################################################ 2025-05-07T20:28:29.4517667Z 2025-05-07T20:28:29.4532687Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:29.5407559Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:29.5407947Z ################################################################################ 2025-05-07T20:28:29.5408371Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:29.5408713Z # 2025-05-07T20:28:29.5425280Z # [2025-05-07T20:28:29.542Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:29.5425901Z ################################################################################ 2025-05-07T20:28:29.5426121Z 2025-05-07T20:28:29.5474824Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:29.5491462Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:29.5491973Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:29.5500295Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:29.5510033Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:29.5530902Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:37.6100365Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:37.6101893Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:37.6102621Z 2025-05-07T20:28:37.6102827Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:37.6103239Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:37.6104043Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:37.6105241Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:28:37.6106657Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 49.3 MB/s eta 0:00:00 2025-05-07T20:28:37.6107129Z Installing collected packages: pytorch-triton 2025-05-07T20:28:37.6107585Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:37.6108203Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:37.6108707Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:37.6117997Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:37.6118493Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:37.6118762Z 2025-05-07T20:28:39.8340886Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:39.8344286Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:28:41.9702307Z ################################################################################ 2025-05-07T20:28:41.9703130Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:28:41.9703820Z ################################################################################ 2025-05-07T20:28:41.9704211Z 2025-05-07T20:28:44.0023884Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:28:46.1049307Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:28:46.1052955Z [BUILD] Successfully ran git submodules update 2025-05-07T20:28:46.1086597Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:46.1087088Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:46.1102995Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:46.1103341Z env: 2025-05-07T20:28:46.1103578Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:46.1103868Z BUILD_ENV: build_binary 2025-05-07T20:28:46.1104114Z BUILD_TARGET: genai 2025-05-07T20:28:46.1104338Z BUILD_VARIANT: cuda 2025-05-07T20:28:46.1104566Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:46.1104820Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:46.1105123Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:46.1105453Z ##[endgroup] 2025-05-07T20:28:46.4460320Z ################################################################################ 2025-05-07T20:28:46.4460691Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:28:46.4460958Z # 2025-05-07T20:28:46.4475483Z # [2025-05-07T20:28:46.447Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.4476119Z ################################################################################ 2025-05-07T20:28:46.4476338Z 2025-05-07T20:28:46.4476695Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.4477694Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.4478027Z 2025-05-07T20:28:46.4593859Z 7d736fee50ce6716a3e7a5042537bb0127686eb5 fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.4596821Z 2025-05-07T20:28:46.4598086Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.4598437Z 2025-05-07T20:28:46.4732845Z 0a23a86eb2b2d57e022570aa036f879ef469d611e4dea9b68c6aaab9d3746d15 fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.4735885Z 2025-05-07T20:28:46.4736599Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.4736954Z 2025-05-07T20:28:46.4967100Z d1d3b7cfd6b55cbd576df724965b7f7c fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.4969233Z 2025-05-07T20:28:46.4978997Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl ... 2025-05-07T20:28:46.5000195Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:49.1473264Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:49.1475144Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:28:49.1476807Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:28:49.1477655Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:28:49.1478190Z 2025-05-07T20:28:55.9671041Z ################################################################################ 2025-05-07T20:28:55.9671420Z [CHECK] !!!! INFO !!!! 2025-05-07T20:28:55.9671817Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:55.9672248Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:28:55.9672568Z [CHECK] 2025-05-07T20:28:55.9672891Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:28:55.9673384Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:28:55.9673775Z ################################################################################ 2025-05-07T20:28:55.9673985Z 2025-05-07T20:28:55.9674107Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:28:59.8784743Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:03.7660029Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:07.6438901Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:07.6445314Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:19.3388789Z ################################################################################ 2025-05-07T20:29:19.3391244Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:19.3391672Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:19.3392085Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:19.3392427Z ################################################################################ 2025-05-07T20:29:19.3392657Z 2025-05-07T20:29:27.1544678Z ################################################################################ 2025-05-07T20:29:27.1545201Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:27.1547123Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:27.1550028Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:27.1550545Z ################################################################################ 2025-05-07T20:29:27.1550773Z 2025-05-07T20:29:27.1550927Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:31.0603531Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:34.9460854Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:38.9749945Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:29:42.8631773Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:29:42.8637164Z [INSTALL] Check for operator registrations ... 2025-05-07T20:29:46.6968109Z fbgemm.nccl_init 2025-05-07T20:29:46.6968360Z 2025-05-07T20:29:46.7598601Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:29:50.5842528Z fbgemm.gqa_attn_splitk 2025-05-07T20:29:50.5842790Z 2025-05-07T20:29:50.6484579Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:29:54.4724513Z fbgemm.rope_qkv_decoding 2025-05-07T20:29:54.4724722Z 2025-05-07T20:29:54.5347646Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:29:54.5348466Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:29:54.5383291Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:54.5383766Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:54.5396556Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:54.5396901Z env: 2025-05-07T20:29:54.5397131Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:54.5397428Z BUILD_ENV: build_binary 2025-05-07T20:29:54.5397683Z BUILD_TARGET: genai 2025-05-07T20:29:54.5397930Z BUILD_VARIANT: cuda 2025-05-07T20:29:54.5398167Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:54.5398419Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:54.5398724Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:54.5399063Z ##[endgroup] 2025-05-07T20:29:54.8749094Z ################################################################################ 2025-05-07T20:29:54.8749573Z # Test All FBGEMM-GPU Modules 2025-05-07T20:29:54.8749830Z # 2025-05-07T20:29:54.8764275Z # [2025-05-07T20:29:54.876Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:29:54.8764820Z ################################################################################ 2025-05-07T20:29:54.8765098Z 2025-05-07T20:30:02.6885464Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:02.6886034Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:02.6886429Z [TEST] Determined the test directories: 2025-05-07T20:30:02.6886769Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:02.6887069Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:02.6887366Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:02.6887667Z 2025-05-07T20:30:02.6897996Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:02.6905100Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:02.6905761Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:02.6906174Z 2025-05-07T20:30:03.1138587Z 2025-05-07T20:30:03.1138999Z [TEST] Installing PyTest ... 2025-05-07T20:30:03.1161436Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:04.2286530Z Channels: 2025-05-07T20:30:04.2286768Z - conda-forge 2025-05-07T20:30:04.2287005Z Platform: linux-64 2025-05-07T20:30:07.5537151Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:08.6978878Z Solving environment: \ | / done 2025-05-07T20:30:08.9444216Z 2025-05-07T20:30:08.9444487Z ## Package Plan ## 2025-05-07T20:30:08.9444658Z 2025-05-07T20:30:08.9444867Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:08.9445165Z 2025-05-07T20:30:08.9445272Z added / updated specs: 2025-05-07T20:30:08.9445525Z - expecttest 2025-05-07T20:30:08.9445755Z - pytest 2025-05-07T20:30:08.9445879Z 2025-05-07T20:30:08.9445893Z 2025-05-07T20:30:08.9446015Z The following packages will be downloaded: 2025-05-07T20:30:08.9446254Z 2025-05-07T20:30:08.9446379Z package | build 2025-05-07T20:30:08.9446692Z ---------------------------|----------------- 2025-05-07T20:30:08.9447062Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:08.9447586Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:08.9448046Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:08.9448499Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:08.9448940Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:08.9449359Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:08.9449763Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:08.9450508Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:08.9450899Z ------------------------------------------------------------ 2025-05-07T20:30:08.9451236Z Total: 428 KB 2025-05-07T20:30:08.9451441Z 2025-05-07T20:30:08.9451566Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:08.9451788Z 2025-05-07T20:30:08.9451990Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:08.9452487Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:08.9453009Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:08.9453477Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:08.9453937Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:08.9454381Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:08.9454808Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:08.9455216Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:08.9455476Z 2025-05-07T20:30:08.9455479Z 2025-05-07T20:30:08.9455483Z 2025-05-07T20:30:08.9455622Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:08.9455985Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:08.9456203Z 2025-05-07T20:30:08.9456845Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:08.9457126Z 2025-05-07T20:30:08.9457135Z 2025-05-07T20:30:08.9468561Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:08.9468926Z 2025-05-07T20:30:08.9468933Z 2025-05-07T20:30:08.9470584Z 2025-05-07T20:30:08.9484116Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:08.9484477Z 2025-05-07T20:30:08.9484482Z 2025-05-07T20:30:08.9484488Z 2025-05-07T20:30:08.9488014Z 2025-05-07T20:30:08.9498847Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:08.9499273Z 2025-05-07T20:30:08.9499279Z 2025-05-07T20:30:08.9499285Z 2025-05-07T20:30:08.9499288Z 2025-05-07T20:30:08.9501649Z 2025-05-07T20:30:08.9503152Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:08.9503536Z 2025-05-07T20:30:08.9503540Z 2025-05-07T20:30:08.9503544Z 2025-05-07T20:30:08.9503548Z 2025-05-07T20:30:08.9503551Z 2025-05-07T20:30:08.9503555Z 2025-05-07T20:30:08.9508636Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:08.9509226Z 2025-05-07T20:30:08.9509231Z 2025-05-07T20:30:08.9509237Z 2025-05-07T20:30:08.9509242Z 2025-05-07T20:30:08.9509246Z 2025-05-07T20:30:08.9509251Z 2025-05-07T20:30:08.9509256Z 2025-05-07T20:30:09.0335163Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:09.0335447Z 2025-05-07T20:30:09.0335451Z 2025-05-07T20:30:09.0335454Z 2025-05-07T20:30:09.0347707Z 2025-05-07T20:30:09.1238470Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:09.1238772Z 2025-05-07T20:30:09.1238776Z 2025-05-07T20:30:09.1238780Z 2025-05-07T20:30:09.1238784Z 2025-05-07T20:30:09.1245333Z 2025-05-07T20:30:09.1399618Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:09.1400333Z 2025-05-07T20:30:09.1400345Z 2025-05-07T20:30:09.1400355Z 2025-05-07T20:30:09.1400366Z 2025-05-07T20:30:09.1402661Z 2025-05-07T20:30:09.2660282Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:09.2660666Z 2025-05-07T20:30:09.2660670Z 2025-05-07T20:30:09.2660674Z 2025-05-07T20:30:09.2660677Z 2025-05-07T20:30:09.2660681Z 2025-05-07T20:30:09.2660685Z 2025-05-07T20:30:09.2664338Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:09.2665855Z 2025-05-07T20:30:09.3283854Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:09.3284226Z 2025-05-07T20:30:09.3284232Z 2025-05-07T20:30:09.3284237Z 2025-05-07T20:30:09.3284463Z 2025-05-07T20:30:09.3284468Z 2025-05-07T20:30:09.3284811Z 2025-05-07T20:30:09.3385024Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:09.3387218Z 2025-05-07T20:30:09.3496125Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:09.3496437Z 2025-05-07T20:30:09.3496442Z 2025-05-07T20:30:09.3497379Z 2025-05-07T20:30:09.3539034Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:09.3539378Z 2025-05-07T20:30:09.3539384Z 2025-05-07T20:30:09.3542022Z 2025-05-07T20:30:09.3598370Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:09.3598723Z 2025-05-07T20:30:09.3598737Z 2025-05-07T20:30:09.3598742Z 2025-05-07T20:30:09.3598747Z 2025-05-07T20:30:09.3604535Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:09.3604820Z 2025-05-07T20:30:09.3604824Z 2025-05-07T20:30:09.3604836Z 2025-05-07T20:30:09.3608214Z 2025-05-07T20:30:09.3652328Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:09.3652610Z 2025-05-07T20:30:09.3652614Z 2025-05-07T20:30:09.3652624Z 2025-05-07T20:30:09.3652628Z 2025-05-07T20:30:09.3652632Z 2025-05-07T20:30:09.3680095Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:09.3680346Z 2025-05-07T20:30:09.3680350Z 2025-05-07T20:30:09.3680360Z 2025-05-07T20:30:09.3680364Z 2025-05-07T20:30:09.3680367Z 2025-05-07T20:30:09.3680422Z 2025-05-07T20:30:09.3811807Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:09.3812093Z 2025-05-07T20:30:09.3812097Z 2025-05-07T20:30:09.3812101Z 2025-05-07T20:30:09.3816920Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:09.3817169Z 2025-05-07T20:30:09.3817173Z 2025-05-07T20:30:09.3817183Z 2025-05-07T20:30:09.3817187Z 2025-05-07T20:30:09.3817190Z 2025-05-07T20:30:09.3817194Z 2025-05-07T20:30:09.3817198Z 2025-05-07T20:30:09.3819595Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:09.3829904Z 2025-05-07T20:30:09.3830133Z 2025-05-07T20:30:09.3830141Z 2025-05-07T20:30:09.3830146Z 2025-05-07T20:30:09.3830150Z 2025-05-07T20:30:09.3830154Z 2025-05-07T20:30:09.3830166Z 2025-05-07T20:30:09.3830573Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:09.3830859Z 2025-05-07T20:30:09.3831080Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:09.3831527Z 2025-05-07T20:30:09.3898977Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:09.3899227Z 2025-05-07T20:30:09.3899231Z 2025-05-07T20:30:09.3899235Z 2025-05-07T20:30:09.3899239Z 2025-05-07T20:30:09.3899243Z 2025-05-07T20:30:09.3899246Z 2025-05-07T20:30:09.3899250Z 2025-05-07T20:30:09.4094300Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:09.4141424Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:09.4236558Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:09.4236889Z 2025-05-07T20:30:09.4237043Z 2025-05-07T20:30:09.4247883Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:09.4248150Z 2025-05-07T20:30:09.4249674Z 2025-05-07T20:30:09.4412551Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:09.4412805Z 2025-05-07T20:30:09.4412809Z 2025-05-07T20:30:09.4479000Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:09.4485320Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:09.4485791Z 2025-05-07T20:30:09.4486081Z 2025-05-07T20:30:09.4486358Z  2025-05-07T20:30:09.4486633Z 2025-05-07T20:30:09.4486639Z 2025-05-07T20:30:09.4486872Z  2025-05-07T20:30:09.4487150Z 2025-05-07T20:30:09.4487156Z 2025-05-07T20:30:09.4487494Z 2025-05-07T20:30:09.4487694Z  2025-05-07T20:30:09.4488128Z 2025-05-07T20:30:09.4488134Z 2025-05-07T20:30:09.4488139Z 2025-05-07T20:30:09.4488145Z 2025-05-07T20:30:09.4488429Z  2025-05-07T20:30:09.4488715Z 2025-05-07T20:30:09.4488720Z 2025-05-07T20:30:09.4488726Z 2025-05-07T20:30:09.4488731Z 2025-05-07T20:30:09.4488736Z 2025-05-07T20:30:09.4488993Z  2025-05-07T20:30:09.4489235Z 2025-05-07T20:30:09.4489239Z 2025-05-07T20:30:09.4489243Z 2025-05-07T20:30:09.4489247Z 2025-05-07T20:30:09.4489251Z 2025-05-07T20:30:09.4489254Z 2025-05-07T20:30:09.4489435Z  2025-05-07T20:30:09.4489653Z 2025-05-07T20:30:09.4489657Z 2025-05-07T20:30:09.4489660Z 2025-05-07T20:30:09.4489664Z 2025-05-07T20:30:09.4489668Z 2025-05-07T20:30:09.4489671Z 2025-05-07T20:30:09.4489681Z 2025-05-07T20:30:09.4489870Z  done 2025-05-07T20:30:09.5491191Z Preparing transaction: \ done 2025-05-07T20:30:09.6496008Z Verifying transaction: / done 2025-05-07T20:30:11.5522083Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:11.6837607Z [TEST] Checking imports ... 2025-05-07T20:30:15.5818114Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:15.5830851Z [TEST] Setting feature flags ... 2025-05-07T20:30:15.5831451Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:15.5831913Z 2025-05-07T20:30:16.0039441Z 2025-05-07T20:30:16.0039986Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:16.0041187Z ################################################################################ 2025-05-07T20:30:16.0041662Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:16.0041994Z # 2025-05-07T20:30:16.0061343Z # [2025-05-07T20:30:16.005Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:16.0061922Z ################################################################################ 2025-05-07T20:30:16.0062210Z 2025-05-07T20:30:16.0068982Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:16.0097409Z ./attention/gqa_test.py 2025-05-07T20:30:16.0097776Z ./coalesce/coalesce_test.py 2025-05-07T20:30:16.0098430Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:16.0098809Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:16.0099173Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:16.0099432Z ./moe/activation_test.py 2025-05-07T20:30:16.0099685Z ./moe/gather_scatter_test.py 2025-05-07T20:30:16.0099932Z ./moe/layers_test.py 2025-05-07T20:30:16.0100162Z ./moe/shuffling_test.py 2025-05-07T20:30:16.0100404Z ./quantize/quantize_test.py 2025-05-07T20:30:16.0100566Z 2025-05-07T20:30:16.0100678Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:16.0100896Z 2025-05-07T20:30:16.0118080Z ################################################################################ 2025-05-07T20:30:16.0133348Z # [2025-05-07T20:30:16.013Z] Run Python Test Suite: 2025-05-07T20:30:16.0133808Z # ./attention/gqa_test.py 2025-05-07T20:30:16.0134169Z ################################################################################ 2025-05-07T20:30:16.0157842Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:16.0158602Z 2025-05-07T20:30:18.5123110Z ============================= test session starts ============================== 2025-05-07T20:30:18.5124167Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:18.5125062Z cachedir: .pytest_cache 2025-05-07T20:30:18.5126346Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:18.5127752Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:18.5128433Z plugins: hypothesis-6.131.14 2025-05-07T20:30:20.2352422Z collecting ... collected 2 items 2025-05-07T20:30:20.2352655Z 2025-05-07T20:30:58.2441904Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:30:58.2445033Z self=, 2025-05-07T20:30:58.2445686Z int4_kv=False, 2025-05-07T20:30:58.2446052Z num_groups=1, 2025-05-07T20:30:58.2446364Z B=1, 2025-05-07T20:30:58.2446660Z MAX_T=4, 2025-05-07T20:30:58.2447000Z N_H_L=1, 2025-05-07T20:30:58.2447325Z ) 2025-05-07T20:30:58.2447806Z Trying example: test_gqa( 2025-05-07T20:30:58.2448299Z self=, 2025-05-07T20:30:58.2448787Z int4_kv=True, 2025-05-07T20:30:58.2449110Z num_groups=1, 2025-05-07T20:30:58.2449418Z B=1, 2025-05-07T20:30:58.2449723Z MAX_T=4, 2025-05-07T20:30:58.2450018Z N_H_L=1, 2025-05-07T20:30:58.2450316Z ) 2025-05-07T20:30:58.2450593Z Trying example: test_gqa( 2025-05-07T20:30:58.2451014Z self=, 2025-05-07T20:30:58.2451508Z int4_kv=True, 2025-05-07T20:30:58.2451838Z num_groups=4, 2025-05-07T20:30:58.2452166Z B=23, 2025-05-07T20:30:58.2452465Z MAX_T=33, 2025-05-07T20:30:58.2452777Z N_H_L=68, 2025-05-07T20:30:58.2453079Z ) 2025-05-07T20:30:58.2453400Z Trying example: test_gqa( 2025-05-07T20:30:58.2453876Z self=, 2025-05-07T20:30:58.2454388Z int4_kv=True, 2025-05-07T20:30:58.2454727Z num_groups=4, 2025-05-07T20:30:58.2455054Z B=77, 2025-05-07T20:30:58.2455348Z MAX_T=4, 2025-05-07T20:30:58.2455655Z N_H_L=1, 2025-05-07T20:30:58.2455956Z ) 2025-05-07T20:30:58.2456259Z Trying example: test_gqa( 2025-05-07T20:30:58.2456739Z self=, 2025-05-07T20:30:58.2457265Z int4_kv=True, 2025-05-07T20:30:58.2457593Z num_groups=4, 2025-05-07T20:30:58.2457924Z B=77, 2025-05-07T20:30:58.2458224Z MAX_T=52, 2025-05-07T20:30:58.2458527Z N_H_L=67, 2025-05-07T20:30:58.2458833Z ) 2025-05-07T20:30:58.2459140Z Trying example: test_gqa( 2025-05-07T20:30:58.2459606Z self=, 2025-05-07T20:30:58.2460127Z int4_kv=False, 2025-05-07T20:30:58.2460471Z num_groups=4, 2025-05-07T20:30:58.2461283Z B=57, 2025-05-07T20:30:58.2461580Z MAX_T=45, 2025-05-07T20:30:58.2461890Z N_H_L=120, 2025-05-07T20:30:58.2462185Z ) 2025-05-07T20:30:58.2462488Z Trying example: test_gqa( 2025-05-07T20:30:58.2462961Z self=, 2025-05-07T20:30:58.2463467Z int4_kv=True, 2025-05-07T20:30:58.2463801Z num_groups=4, 2025-05-07T20:30:58.2464128Z B=52, 2025-05-07T20:30:58.2464416Z MAX_T=42, 2025-05-07T20:30:58.2464730Z N_H_L=53, 2025-05-07T20:30:58.2465031Z ) 2025-05-07T20:30:58.2465343Z Trying example: test_gqa( 2025-05-07T20:30:58.2465810Z self=, 2025-05-07T20:30:58.2466323Z int4_kv=True, 2025-05-07T20:30:58.2466661Z num_groups=1, 2025-05-07T20:30:58.2466983Z B=77, 2025-05-07T20:30:58.2467292Z MAX_T=95, 2025-05-07T20:30:58.2467649Z N_H_L=53, 2025-05-07T20:30:58.2467951Z ) 2025-05-07T20:30:58.2468258Z Trying example: test_gqa( 2025-05-07T20:30:58.2468733Z self=, 2025-05-07T20:30:58.2469249Z int4_kv=True, 2025-05-07T20:30:58.2469595Z num_groups=4, 2025-05-07T20:30:58.2469925Z B=113, 2025-05-07T20:30:58.2470220Z MAX_T=48, 2025-05-07T20:30:58.2470532Z N_H_L=96, 2025-05-07T20:30:58.2470842Z ) 2025-05-07T20:30:58.2471138Z Trying example: test_gqa( 2025-05-07T20:30:58.2471616Z self=, 2025-05-07T20:30:58.2472136Z int4_kv=False, 2025-05-07T20:30:58.2472469Z num_groups=1, 2025-05-07T20:30:58.2473047Z B=51, 2025-05-07T20:30:58.2473357Z MAX_T=61, 2025-05-07T20:30:58.2473663Z N_H_L=69, 2025-05-07T20:30:58.2473970Z ) 2025-05-07T20:30:58.2474278Z Trying example: test_gqa( 2025-05-07T20:30:58.2474752Z self=, 2025-05-07T20:30:58.2475266Z int4_kv=False, 2025-05-07T20:30:58.2475610Z num_groups=4, 2025-05-07T20:30:58.2475925Z B=17, 2025-05-07T20:30:58.2476225Z MAX_T=113, 2025-05-07T20:30:58.2476544Z N_H_L=65, 2025-05-07T20:30:58.2476859Z ) 2025-05-07T20:30:58.2477157Z Trying example: test_gqa( 2025-05-07T20:30:58.2477630Z self=, 2025-05-07T20:30:58.2478148Z int4_kv=False, 2025-05-07T20:30:58.2478485Z num_groups=4, 2025-05-07T20:30:58.2478814Z B=17, 2025-05-07T20:30:58.2479109Z MAX_T=65, 2025-05-07T20:30:58.2479417Z N_H_L=65, 2025-05-07T20:30:58.2479723Z ) 2025-05-07T20:30:58.2480034Z Trying example: test_gqa( 2025-05-07T20:30:58.2480509Z self=, 2025-05-07T20:30:58.2481028Z int4_kv=False, 2025-05-07T20:30:58.2481368Z num_groups=4, 2025-05-07T20:30:58.2481696Z B=65, 2025-05-07T20:30:58.2481998Z MAX_T=65, 2025-05-07T20:30:58.2482317Z N_H_L=65, 2025-05-07T20:30:58.2482619Z ) 2025-05-07T20:30:58.2482925Z Trying example: test_gqa( 2025-05-07T20:30:58.2483406Z self=, 2025-05-07T20:30:58.2483921Z int4_kv=False, 2025-05-07T20:30:58.2484270Z num_groups=1, 2025-05-07T20:30:58.2484610Z B=6, 2025-05-07T20:30:58.2484900Z MAX_T=108, 2025-05-07T20:30:58.2485218Z N_H_L=14, 2025-05-07T20:30:58.2485528Z ) 2025-05-07T20:30:58.2485828Z Trying example: test_gqa( 2025-05-07T20:30:58.2486300Z self=, 2025-05-07T20:30:58.2486815Z int4_kv=False, 2025-05-07T20:30:58.2487146Z num_groups=1, 2025-05-07T20:30:58.2487592Z B=6, 2025-05-07T20:30:58.2487900Z MAX_T=14, 2025-05-07T20:30:58.2488211Z N_H_L=14, 2025-05-07T20:30:58.2488529Z ) 2025-05-07T20:30:58.2488840Z Trying example: test_gqa( 2025-05-07T20:30:58.2489315Z self=, 2025-05-07T20:30:58.2489824Z int4_kv=False, 2025-05-07T20:30:58.2490169Z num_groups=1, 2025-05-07T20:30:58.2490499Z B=6, 2025-05-07T20:30:58.2490790Z MAX_T=6, 2025-05-07T20:30:58.2491099Z N_H_L=14, 2025-05-07T20:30:58.2491406Z ) 2025-05-07T20:30:58.2491705Z Trying example: test_gqa( 2025-05-07T20:30:58.2492322Z self=, 2025-05-07T20:30:58.2492841Z int4_kv=False, 2025-05-07T20:30:58.2493180Z num_groups=1, 2025-05-07T20:30:58.2493511Z B=6, 2025-05-07T20:30:58.2493811Z MAX_T=6, 2025-05-07T20:30:58.2494116Z N_H_L=6, 2025-05-07T20:30:58.2494421Z ) 2025-05-07T20:30:58.2494731Z Trying example: test_gqa( 2025-05-07T20:30:58.2495198Z self=, 2025-05-07T20:30:58.2495716Z int4_kv=False, 2025-05-07T20:30:58.2496056Z num_groups=1, 2025-05-07T20:30:58.2496386Z B=70, 2025-05-07T20:30:58.2496689Z MAX_T=94, 2025-05-07T20:30:58.2497001Z N_H_L=78, 2025-05-07T20:30:58.2497297Z ) 2025-05-07T20:30:58.2497606Z Trying example: test_gqa( 2025-05-07T20:30:58.2498083Z self=, 2025-05-07T20:30:58.2498586Z int4_kv=False, 2025-05-07T20:30:58.2498925Z num_groups=1, 2025-05-07T20:30:58.2499251Z B=78, 2025-05-07T20:30:58.2499540Z MAX_T=94, 2025-05-07T20:30:58.2499866Z N_H_L=78, 2025-05-07T20:30:58.2500174Z ) 2025-05-07T20:30:58.2500474Z Trying example: test_gqa( 2025-05-07T20:30:58.2500945Z self=, 2025-05-07T20:30:58.2501466Z int4_kv=False, 2025-05-07T20:30:58.2501804Z num_groups=1, 2025-05-07T20:30:58.2502124Z B=94, 2025-05-07T20:30:58.2502427Z MAX_T=94, 2025-05-07T20:30:58.2502740Z N_H_L=78, 2025-05-07T20:30:58.2503034Z ) 2025-05-07T20:30:58.2503337Z Trying example: test_gqa( 2025-05-07T20:30:58.2503950Z self=, 2025-05-07T20:30:58.2504467Z int4_kv=False, 2025-05-07T20:30:58.2504811Z num_groups=1, 2025-05-07T20:30:58.2505144Z B=94, 2025-05-07T20:30:58.2505437Z MAX_T=94, 2025-05-07T20:30:58.2506080Z N_H_L=94, 2025-05-07T20:30:58.2506391Z ) 2025-05-07T20:30:58.2506689Z Trying example: test_gqa( 2025-05-07T20:30:58.2507212Z self=, 2025-05-07T20:30:58.2507774Z int4_kv=False, 2025-05-07T20:30:58.2508125Z num_groups=4, 2025-05-07T20:30:58.2508453Z B=41, 2025-05-07T20:30:58.2508751Z MAX_T=105, 2025-05-07T20:30:58.2509077Z N_H_L=126, 2025-05-07T20:30:58.2509389Z ) 2025-05-07T20:30:58.2509692Z Trying example: test_gqa( 2025-05-07T20:30:58.2510167Z self=, 2025-05-07T20:30:58.2510680Z int4_kv=False, 2025-05-07T20:30:58.2511012Z num_groups=4, 2025-05-07T20:30:58.2511344Z B=105, 2025-05-07T20:30:58.2511652Z MAX_T=105, 2025-05-07T20:30:58.2511982Z N_H_L=126, 2025-05-07T20:30:58.2512299Z ) 2025-05-07T20:30:58.2512609Z Trying example: test_gqa( 2025-05-07T20:30:58.2513090Z self=, 2025-05-07T20:30:58.2513605Z int4_kv=False, 2025-05-07T20:30:58.2513948Z num_groups=4, 2025-05-07T20:30:58.2514278Z B=105, 2025-05-07T20:30:58.2514576Z MAX_T=105, 2025-05-07T20:30:58.2514899Z N_H_L=105, 2025-05-07T20:30:58.2515209Z ) 2025-05-07T20:30:58.2515505Z Trying example: test_gqa( 2025-05-07T20:30:58.2515992Z self=, 2025-05-07T20:30:58.2516503Z int4_kv=True, 2025-05-07T20:30:58.2516829Z num_groups=1, 2025-05-07T20:30:58.2517158Z B=95, 2025-05-07T20:30:58.2517456Z MAX_T=114, 2025-05-07T20:30:58.2517761Z N_H_L=43, 2025-05-07T20:30:58.2518064Z ) 2025-05-07T20:30:58.2518370Z Trying example: test_gqa( 2025-05-07T20:30:58.2518837Z self=, 2025-05-07T20:30:58.2519345Z int4_kv=True, 2025-05-07T20:30:58.2519686Z num_groups=1, 2025-05-07T20:30:58.2520018Z B=43, 2025-05-07T20:30:58.2520311Z MAX_T=114, 2025-05-07T20:30:58.2520628Z N_H_L=43, 2025-05-07T20:30:58.2520942Z ) 2025-05-07T20:30:58.2521245Z Trying example: test_gqa( 2025-05-07T20:30:58.2521719Z self=, 2025-05-07T20:30:58.2522230Z int4_kv=True, 2025-05-07T20:30:58.2522565Z num_groups=1, 2025-05-07T20:30:58.2522897Z B=43, 2025-05-07T20:30:58.2523416Z MAX_T=43, 2025-05-07T20:30:58.2523722Z N_H_L=43, 2025-05-07T20:30:58.2524028Z ) 2025-05-07T20:30:58.2524336Z Trying example: test_gqa( 2025-05-07T20:30:58.2524800Z self=, 2025-05-07T20:30:58.2525318Z int4_kv=False, 2025-05-07T20:30:58.2525659Z num_groups=1, 2025-05-07T20:30:58.2525980Z B=21, 2025-05-07T20:30:58.2526282Z MAX_T=38, 2025-05-07T20:30:58.2526594Z N_H_L=42, 2025-05-07T20:30:58.2526896Z ) 2025-05-07T20:30:58.2527204Z Trying example: test_gqa( 2025-05-07T20:30:58.2527780Z self=, 2025-05-07T20:30:58.2528291Z int4_kv=False, 2025-05-07T20:30:58.2528634Z num_groups=1, 2025-05-07T20:30:58.2528963Z B=38, 2025-05-07T20:30:58.2529251Z MAX_T=38, 2025-05-07T20:30:58.2529562Z N_H_L=42, 2025-05-07T20:30:58.2529869Z ) 2025-05-07T20:30:58.2530173Z Trying example: test_gqa( 2025-05-07T20:30:58.2530636Z self=, 2025-05-07T20:30:58.2531164Z int4_kv=False, 2025-05-07T20:30:58.2531508Z num_groups=1, 2025-05-07T20:30:58.2531826Z B=38, 2025-05-07T20:30:58.2532129Z MAX_T=42, 2025-05-07T20:30:58.2532441Z N_H_L=42, 2025-05-07T20:30:58.2532737Z ) 2025-05-07T20:30:58.2533042Z Trying example: test_gqa( 2025-05-07T20:30:58.2533512Z self=, 2025-05-07T20:30:58.2533982Z int4_kv=False, 2025-05-07T20:30:58.2534270Z num_groups=1, 2025-05-07T20:30:58.2534558Z B=42, 2025-05-07T20:30:58.2535120Z MAX_T=42, 2025-05-07T20:30:58.2535447Z N_H_L=42, 2025-05-07T20:30:58.2535756Z ) 2025-05-07T20:30:58.2536052Z Trying example: test_gqa( 2025-05-07T20:30:58.2536528Z self=, 2025-05-07T20:30:58.2537043Z int4_kv=True, 2025-05-07T20:30:58.2537367Z num_groups=1, 2025-05-07T20:30:58.2537691Z B=74, 2025-05-07T20:30:58.2537989Z MAX_T=20, 2025-05-07T20:30:58.2538291Z N_H_L=15, 2025-05-07T20:30:58.2538606Z ) 2025-05-07T20:30:58.2538918Z Trying example: test_gqa( 2025-05-07T20:30:58.2539378Z self=, 2025-05-07T20:30:58.2539895Z int4_kv=True, 2025-05-07T20:30:58.2540233Z num_groups=1, 2025-05-07T20:30:58.2540553Z B=20, 2025-05-07T20:30:58.2540857Z MAX_T=20, 2025-05-07T20:30:58.2541167Z N_H_L=15, 2025-05-07T20:30:58.2541464Z ) 2025-05-07T20:30:58.2541775Z Trying example: test_gqa( 2025-05-07T20:30:58.2542253Z self=, 2025-05-07T20:30:58.2542778Z int4_kv=True, 2025-05-07T20:30:58.2543111Z num_groups=1, 2025-05-07T20:30:58.2543443Z B=20, 2025-05-07T20:30:58.2543740Z MAX_T=15, 2025-05-07T20:30:58.2544045Z N_H_L=15, 2025-05-07T20:30:58.2544348Z ) 2025-05-07T20:30:58.2544654Z Trying example: test_gqa( 2025-05-07T20:30:58.2545122Z self=, 2025-05-07T20:30:58.2545640Z int4_kv=True, 2025-05-07T20:30:58.2545997Z num_groups=1, 2025-05-07T20:30:58.2546327Z B=15, 2025-05-07T20:30:58.2546623Z MAX_T=20, 2025-05-07T20:30:58.2546935Z N_H_L=15, 2025-05-07T20:30:58.2547239Z ) 2025-05-07T20:30:58.2547529Z Trying example: test_gqa( 2025-05-07T20:30:58.2547996Z self=, 2025-05-07T20:30:58.2548510Z int4_kv=True, 2025-05-07T20:30:58.2548838Z num_groups=1, 2025-05-07T20:30:58.2549165Z B=15, 2025-05-07T20:30:58.2549459Z MAX_T=15, 2025-05-07T20:30:58.2549766Z N_H_L=15, 2025-05-07T20:30:58.2550072Z ) 2025-05-07T20:30:58.2550388Z Trying example: test_gqa( 2025-05-07T20:30:58.2550854Z self=, 2025-05-07T20:30:58.2551378Z int4_kv=False, 2025-05-07T20:30:58.2551720Z num_groups=4, 2025-05-07T20:30:58.2552041Z B=117, 2025-05-07T20:30:58.2552349Z MAX_T=104, 2025-05-07T20:30:58.2552662Z N_H_L=69, 2025-05-07T20:30:58.2553062Z ) 2025-05-07T20:30:58.2565734Z Trying example: test_gqa( 2025-05-07T20:30:58.2566234Z self=, 2025-05-07T20:30:58.2566936Z int4_kv=False, 2025-05-07T20:30:58.2567335Z num_groups=4, 2025-05-07T20:30:58.2567769Z B=117, 2025-05-07T20:30:58.2568078Z MAX_T=117, 2025-05-07T20:30:58.2568392Z N_H_L=69, 2025-05-07T20:30:58.2568718Z ) 2025-05-07T20:30:58.2569032Z Trying example: test_gqa( 2025-05-07T20:30:58.2569510Z self=, 2025-05-07T20:30:58.2570038Z int4_kv=False, 2025-05-07T20:30:58.2570396Z num_groups=4, 2025-05-07T20:30:58.2570723Z B=69, 2025-05-07T20:30:58.2571043Z MAX_T=117, 2025-05-07T20:30:58.2571366Z N_H_L=69, 2025-05-07T20:30:58.2571668Z ) 2025-05-07T20:30:58.2571980Z Trying example: test_gqa( 2025-05-07T20:30:58.2572461Z self=, 2025-05-07T20:30:58.2572976Z int4_kv=False, 2025-05-07T20:30:58.2573325Z num_groups=4, 2025-05-07T20:30:58.2573659Z B=117, 2025-05-07T20:30:58.2573953Z MAX_T=69, 2025-05-07T20:30:58.2574271Z N_H_L=69, 2025-05-07T20:30:58.2574595Z ) 2025-05-07T20:30:58.2574874Z PASSED 2025-05-07T20:30:58.2795380Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:30:58.2795835Z 2025-05-07T20:30:58.2796050Z =========================== short test summary info ============================ 2025-05-07T20:30:58.2797105Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when CUDA is not available or xformers is not available 2025-05-07T20:30:58.2798381Z ======================== 1 passed, 1 skipped in 40.24s ========================= 2025-05-07T20:30:58.9330533Z 2025-05-07T20:30:58.9331095Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:30:58.9351980Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds 2025-05-07T20:30:58.9352255Z 2025-05-07T20:30:58.9352265Z 2025-05-07T20:30:58.9352269Z 2025-05-07T20:30:58.9352333Z 2025-05-07T20:30:58.9372707Z ################################################################################ 2025-05-07T20:30:58.9387967Z # [2025-05-07T20:30:58.938Z] Run Python Test Suite: 2025-05-07T20:30:58.9388309Z # ./coalesce/coalesce_test.py 2025-05-07T20:30:58.9388603Z ################################################################################ 2025-05-07T20:30:58.9414362Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:30:58.9414975Z 2025-05-07T20:31:01.0803012Z ============================= test session starts ============================== 2025-05-07T20:31:01.0803677Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:01.0804205Z cachedir: .pytest_cache 2025-05-07T20:31:01.0804781Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:01.0805514Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:01.0806329Z plugins: hypothesis-6.131.14 2025-05-07T20:31:02.7497889Z collecting ... collected 1 item 2025-05-07T20:31:02.7498132Z 2025-05-07T20:31:03.4745707Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:03.4746053Z 2025-05-07T20:31:03.4746217Z ============================== 1 passed in 2.51s =============================== 2025-05-07T20:31:04.1194604Z 2025-05-07T20:31:04.1194888Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:04.1218014Z [TEST] Python test time for ./coalesce/coalesce_test.py: 6 seconds 2025-05-07T20:31:04.1218308Z 2025-05-07T20:31:04.1218313Z 2025-05-07T20:31:04.1218317Z 2025-05-07T20:31:04.1218321Z 2025-05-07T20:31:04.1238404Z ################################################################################ 2025-05-07T20:31:04.1253697Z # [2025-05-07T20:31:04.125Z] Run Python Test Suite: 2025-05-07T20:31:04.1254039Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:04.1254583Z ################################################################################ 2025-05-07T20:31:04.1279111Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:04.1279744Z 2025-05-07T20:31:06.2531653Z ============================= test session starts ============================== 2025-05-07T20:31:06.2532384Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:06.2532913Z cachedir: .pytest_cache 2025-05-07T20:31:06.2533495Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:06.2534232Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:06.2534648Z plugins: hypothesis-6.131.14 2025-05-07T20:31:07.9550448Z collecting ... collected 5 items 2025-05-07T20:31:07.9550726Z 2025-05-07T20:31:07.9560071Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:07.9578699Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:07.9585817Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:07.9592425Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:07.9608465Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:07.9608930Z 2025-05-07T20:31:07.9609140Z =========================== short test summary info ============================ 2025-05-07T20:31:07.9609836Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.9610758Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.9611682Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.9612591Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.9613519Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.9614165Z ============================== 5 skipped in 1.82s ============================== 2025-05-07T20:31:08.5245651Z 2025-05-07T20:31:08.5246009Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:08.5267234Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds 2025-05-07T20:31:08.5267652Z 2025-05-07T20:31:08.5267658Z 2025-05-07T20:31:08.5267664Z 2025-05-07T20:31:08.5267678Z 2025-05-07T20:31:08.5289013Z ################################################################################ 2025-05-07T20:31:08.5304517Z # [2025-05-07T20:31:08.530Z] Run Python Test Suite: 2025-05-07T20:31:08.5304972Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:08.5305290Z ################################################################################ 2025-05-07T20:31:08.5329529Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:08.5330182Z 2025-05-07T20:31:10.6632463Z ============================= test session starts ============================== 2025-05-07T20:31:10.6633701Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:10.6634730Z cachedir: .pytest_cache 2025-05-07T20:31:10.6635878Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:10.6637829Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:10.6638634Z plugins: hypothesis-6.131.14 2025-05-07T20:31:12.4831122Z collecting ... collected 2 items 2025-05-07T20:31:12.4831361Z 2025-05-07T20:31:12.4840158Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:12.4854542Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:12.4855011Z 2025-05-07T20:31:12.4855161Z =========================== short test summary info ============================ 2025-05-07T20:31:12.4855785Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:12.4856618Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:12.4857216Z ============================== 2 skipped in 1.94s ============================== 2025-05-07T20:31:13.0745553Z 2025-05-07T20:31:13.0745953Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:13.0766672Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:31:13.0767137Z 2025-05-07T20:31:13.0767143Z 2025-05-07T20:31:13.0767149Z 2025-05-07T20:31:13.0767154Z 2025-05-07T20:31:13.0788698Z ################################################################################ 2025-05-07T20:31:13.0804430Z # [2025-05-07T20:31:13.080Z] Run Python Test Suite: 2025-05-07T20:31:13.0804905Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:13.0805304Z ################################################################################ 2025-05-07T20:31:13.0829499Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:13.0830237Z 2025-05-07T20:31:15.2105293Z ============================= test session starts ============================== 2025-05-07T20:31:15.2106217Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:15.2106744Z cachedir: .pytest_cache 2025-05-07T20:31:15.2107335Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:15.2108094Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:15.2108506Z plugins: hypothesis-6.131.14 2025-05-07T20:31:16.9521015Z collecting ... collected 4 items 2025-05-07T20:31:16.9521249Z 2025-05-07T20:31:19.6839856Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:19.6960375Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:19.7104454Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:19.7228969Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:19.7229403Z 2025-05-07T20:31:19.7229557Z =========================== short test summary info ============================ 2025-05-07T20:31:19.7230266Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:19.7231197Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when xformers is not available 2025-05-07T20:31:19.7231805Z ============================== 4 skipped in 4.63s ============================== 2025-05-07T20:31:21.6094861Z 2025-05-07T20:31:21.6095463Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:21.6116138Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:21.6116440Z 2025-05-07T20:31:21.6116445Z 2025-05-07T20:31:21.6117510Z 2025-05-07T20:31:21.6117515Z 2025-05-07T20:31:21.6138710Z ################################################################################ 2025-05-07T20:31:21.6154381Z # [2025-05-07T20:31:21.615Z] Run Python Test Suite: 2025-05-07T20:31:21.6154740Z # ./moe/activation_test.py 2025-05-07T20:31:21.6155029Z ################################################################################ 2025-05-07T20:31:21.6179238Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:21.6179849Z 2025-05-07T20:31:23.7526148Z ============================= test session starts ============================== 2025-05-07T20:31:23.7526796Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:23.7527317Z cachedir: .pytest_cache 2025-05-07T20:31:23.7527972Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:23.7528734Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:23.7529139Z plugins: hypothesis-6.131.14 2025-05-07T20:31:25.3786148Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:25.5279640Z collecting ... collected 2 items 2025-05-07T20:31:25.5279859Z 2025-05-07T20:31:30.8611975Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:30.8613387Z self=, 2025-05-07T20:31:30.8613848Z T=1, 2025-05-07T20:31:30.8614045Z D=5120, 2025-05-07T20:31:30.8614251Z contiguous=True, 2025-05-07T20:31:30.8614476Z compiled=True, 2025-05-07T20:31:30.8614697Z ) 2025-05-07T20:31:30.8614902Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8615274Z self=, 2025-05-07T20:31:30.8615672Z T=4096, 2025-05-07T20:31:30.8615869Z D=5120, 2025-05-07T20:31:30.8616061Z contiguous=True, 2025-05-07T20:31:30.8616294Z compiled=True, 2025-05-07T20:31:30.8616502Z ) 2025-05-07T20:31:30.8616699Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8617077Z self=, 2025-05-07T20:31:30.8617459Z T=4096, 2025-05-07T20:31:30.8617650Z D=7168, 2025-05-07T20:31:30.8617845Z contiguous=False, 2025-05-07T20:31:30.8618075Z compiled=False, 2025-05-07T20:31:30.8618291Z ) 2025-05-07T20:31:30.8618485Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8618860Z self=, 2025-05-07T20:31:30.8619241Z T=4096, 2025-05-07T20:31:30.8619426Z D=5120, 2025-05-07T20:31:30.8619631Z contiguous=False, 2025-05-07T20:31:30.8619860Z compiled=True, 2025-05-07T20:31:30.8620062Z ) 2025-05-07T20:31:30.8620264Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8620643Z self=, 2025-05-07T20:31:30.8621017Z T=1, 2025-05-07T20:31:30.8621206Z D=7168, 2025-05-07T20:31:30.8621404Z contiguous=True, 2025-05-07T20:31:30.8621624Z compiled=True, 2025-05-07T20:31:30.8621832Z ) 2025-05-07T20:31:30.8622037Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8622400Z self=, 2025-05-07T20:31:30.8622784Z T=1, 2025-05-07T20:31:30.8622978Z D=7168, 2025-05-07T20:31:30.8623176Z contiguous=False, 2025-05-07T20:31:30.8623405Z compiled=True, 2025-05-07T20:31:30.8623614Z ) 2025-05-07T20:31:30.8623810Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8624182Z self=, 2025-05-07T20:31:30.8624565Z T=4096, 2025-05-07T20:31:30.8624754Z D=5120, 2025-05-07T20:31:30.8624946Z contiguous=False, 2025-05-07T20:31:30.8625175Z compiled=False, 2025-05-07T20:31:30.8625560Z ) 2025-05-07T20:31:30.8625756Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8626124Z self=, 2025-05-07T20:31:30.8626508Z T=1, 2025-05-07T20:31:30.8626688Z D=7168, 2025-05-07T20:31:30.8626883Z contiguous=True, 2025-05-07T20:31:30.8627107Z compiled=False, 2025-05-07T20:31:30.8627308Z ) 2025-05-07T20:31:30.8627508Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8627888Z self=, 2025-05-07T20:31:30.8628261Z T=2048, 2025-05-07T20:31:30.8628449Z D=5120, 2025-05-07T20:31:30.8628647Z contiguous=True, 2025-05-07T20:31:30.8628867Z compiled=True, 2025-05-07T20:31:30.8629072Z ) 2025-05-07T20:31:30.8629269Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8629632Z self=, 2025-05-07T20:31:30.8630009Z T=2048, 2025-05-07T20:31:30.8630201Z D=7168, 2025-05-07T20:31:30.8630408Z contiguous=True, 2025-05-07T20:31:30.8630626Z compiled=True, 2025-05-07T20:31:30.8630836Z ) 2025-05-07T20:31:30.8631035Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8631399Z self=, 2025-05-07T20:31:30.8631774Z T=2048, 2025-05-07T20:31:30.8631965Z D=7168, 2025-05-07T20:31:30.8632156Z contiguous=True, 2025-05-07T20:31:30.8632383Z compiled=False, 2025-05-07T20:31:30.8632593Z ) 2025-05-07T20:31:30.8632791Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8633257Z self=, 2025-05-07T20:31:30.8633642Z T=128, 2025-05-07T20:31:30.8633829Z D=5120, 2025-05-07T20:31:30.8634030Z contiguous=False, 2025-05-07T20:31:30.8634262Z compiled=True, 2025-05-07T20:31:30.8634463Z ) 2025-05-07T20:31:30.8634667Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8635043Z self=, 2025-05-07T20:31:30.8635421Z T=128, 2025-05-07T20:31:30.8635612Z D=5120, 2025-05-07T20:31:30.8635817Z contiguous=True, 2025-05-07T20:31:30.8636037Z compiled=True, 2025-05-07T20:31:30.8636246Z ) 2025-05-07T20:31:30.8636449Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8636822Z self=, 2025-05-07T20:31:30.8637198Z T=16384, 2025-05-07T20:31:30.8637395Z D=5120, 2025-05-07T20:31:30.8637598Z contiguous=False, 2025-05-07T20:31:30.8637822Z compiled=True, 2025-05-07T20:31:30.8638027Z ) 2025-05-07T20:31:30.8638227Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8638591Z self=, 2025-05-07T20:31:30.8638968Z T=16384, 2025-05-07T20:31:30.8639163Z D=5120, 2025-05-07T20:31:30.8639357Z contiguous=False, 2025-05-07T20:31:30.8639587Z compiled=False, 2025-05-07T20:31:30.8639796Z ) 2025-05-07T20:31:30.8639996Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8640366Z self=, 2025-05-07T20:31:30.8640747Z T=128, 2025-05-07T20:31:30.8640938Z D=7168, 2025-05-07T20:31:30.8641127Z contiguous=True, 2025-05-07T20:31:30.8641355Z compiled=False, 2025-05-07T20:31:30.8641563Z ) 2025-05-07T20:31:30.8641756Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8642129Z self=, 2025-05-07T20:31:30.8642509Z T=128, 2025-05-07T20:31:30.8642703Z D=7168, 2025-05-07T20:31:30.8642894Z contiguous=False, 2025-05-07T20:31:30.8643122Z compiled=False, 2025-05-07T20:31:30.8643330Z ) 2025-05-07T20:31:30.8643523Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8643894Z self=, 2025-05-07T20:31:30.8644269Z T=1, 2025-05-07T20:31:30.8644449Z D=5120, 2025-05-07T20:31:30.8644649Z contiguous=False, 2025-05-07T20:31:30.8644877Z compiled=False, 2025-05-07T20:31:30.8645167Z ) 2025-05-07T20:31:30.8645369Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8645741Z self=, 2025-05-07T20:31:30.8646112Z T=1, 2025-05-07T20:31:30.8646304Z D=7168, 2025-05-07T20:31:30.8646507Z contiguous=False, 2025-05-07T20:31:30.8646734Z compiled=False, 2025-05-07T20:31:30.8646959Z ) 2025-05-07T20:31:30.8647191Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8647658Z self=, 2025-05-07T20:31:30.8648037Z T=4096, 2025-05-07T20:31:30.8648228Z D=5120, 2025-05-07T20:31:30.8648421Z contiguous=True, 2025-05-07T20:31:30.8648649Z compiled=False, 2025-05-07T20:31:30.8648856Z ) 2025-05-07T20:31:30.8649054Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8649425Z self=, 2025-05-07T20:31:30.8649806Z T=128, 2025-05-07T20:31:30.8649999Z D=7168, 2025-05-07T20:31:30.8650195Z contiguous=True, 2025-05-07T20:31:30.8650417Z compiled=True, 2025-05-07T20:31:30.8650623Z ) 2025-05-07T20:31:30.8650820Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8651188Z self=, 2025-05-07T20:31:30.8651567Z T=1, 2025-05-07T20:31:30.8651750Z D=5120, 2025-05-07T20:31:30.8651948Z contiguous=False, 2025-05-07T20:31:30.8652175Z compiled=True, 2025-05-07T20:31:30.8652378Z ) 2025-05-07T20:31:30.8652668Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8653092Z self=, 2025-05-07T20:31:30.8653521Z T=4096, 2025-05-07T20:31:30.8653718Z D=7168, 2025-05-07T20:31:30.8653925Z contiguous=True, 2025-05-07T20:31:30.8654155Z compiled=False, 2025-05-07T20:31:30.8654376Z ) 2025-05-07T20:31:30.8654583Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8654998Z self=, 2025-05-07T20:31:30.8655449Z T=4096, 2025-05-07T20:31:30.8655645Z D=7168, 2025-05-07T20:31:30.8655854Z contiguous=False, 2025-05-07T20:31:30.8656088Z compiled=True, 2025-05-07T20:31:30.8656308Z ) 2025-05-07T20:31:30.8656517Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8656935Z self=, 2025-05-07T20:31:30.8657375Z T=128, 2025-05-07T20:31:30.8657571Z D=5120, 2025-05-07T20:31:30.8657772Z contiguous=True, 2025-05-07T20:31:30.8658017Z compiled=False, 2025-05-07T20:31:30.8658237Z ) 2025-05-07T20:31:30.8658442Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8658872Z self=, 2025-05-07T20:31:30.8659313Z T=128, 2025-05-07T20:31:30.8659506Z D=5120, 2025-05-07T20:31:30.8659716Z contiguous=False, 2025-05-07T20:31:30.8659961Z compiled=False, 2025-05-07T20:31:30.8660180Z ) 2025-05-07T20:31:30.8660393Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8660822Z self=, 2025-05-07T20:31:30.8661256Z T=1, 2025-05-07T20:31:30.8661452Z D=5120, 2025-05-07T20:31:30.8661660Z contiguous=True, 2025-05-07T20:31:30.8661895Z compiled=False, 2025-05-07T20:31:30.8662118Z ) 2025-05-07T20:31:30.8662330Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8662755Z self=, 2025-05-07T20:31:30.8663185Z T=2048, 2025-05-07T20:31:30.8663388Z D=7168, 2025-05-07T20:31:30.8663598Z contiguous=False, 2025-05-07T20:31:30.8663839Z compiled=True, 2025-05-07T20:31:30.8664075Z ) 2025-05-07T20:31:30.8664289Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8664706Z self=, 2025-05-07T20:31:30.8665148Z T=2048, 2025-05-07T20:31:30.8665348Z D=7168, 2025-05-07T20:31:30.8665555Z contiguous=False, 2025-05-07T20:31:30.8665792Z compiled=False, 2025-05-07T20:31:30.8666099Z ) 2025-05-07T20:31:30.8666313Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8666732Z self=, 2025-05-07T20:31:30.8667171Z T=16384, 2025-05-07T20:31:30.8667377Z D=7168, 2025-05-07T20:31:30.8667580Z contiguous=False, 2025-05-07T20:31:30.8667824Z compiled=True, 2025-05-07T20:31:30.8668047Z ) 2025-05-07T20:31:30.8668249Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8668679Z self=, 2025-05-07T20:31:30.8678035Z T=16384, 2025-05-07T20:31:30.8678253Z D=7168, 2025-05-07T20:31:30.8678466Z contiguous=True, 2025-05-07T20:31:30.8678704Z compiled=True, 2025-05-07T20:31:30.8678919Z ) 2025-05-07T20:31:30.8679138Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8679529Z self=, 2025-05-07T20:31:30.8679916Z T=4096, 2025-05-07T20:31:30.8680116Z D=7168, 2025-05-07T20:31:30.8680337Z contiguous=True, 2025-05-07T20:31:30.8680565Z compiled=True, 2025-05-07T20:31:30.8680784Z ) 2025-05-07T20:31:30.8680996Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8681380Z self=, 2025-05-07T20:31:30.8681762Z T=2048, 2025-05-07T20:31:30.8681959Z D=5120, 2025-05-07T20:31:30.8682167Z contiguous=False, 2025-05-07T20:31:30.8682397Z compiled=False, 2025-05-07T20:31:30.8682616Z ) 2025-05-07T20:31:30.8682951Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8683327Z self=, 2025-05-07T20:31:30.8683712Z T=2048, 2025-05-07T20:31:30.8683913Z D=5120, 2025-05-07T20:31:30.8684110Z contiguous=True, 2025-05-07T20:31:30.8684345Z compiled=False, 2025-05-07T20:31:30.8684561Z ) 2025-05-07T20:31:30.8684763Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8685139Z self=, 2025-05-07T20:31:30.8685531Z T=128, 2025-05-07T20:31:30.8685721Z D=7168, 2025-05-07T20:31:30.8685933Z contiguous=False, 2025-05-07T20:31:30.8686167Z compiled=True, 2025-05-07T20:31:30.8686374Z ) 2025-05-07T20:31:30.8686578Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8686953Z self=, 2025-05-07T20:31:30.8687330Z T=16384, 2025-05-07T20:31:30.8687578Z D=5120, 2025-05-07T20:31:30.8687781Z contiguous=True, 2025-05-07T20:31:30.8688016Z compiled=True, 2025-05-07T20:31:30.8688223Z ) 2025-05-07T20:31:30.8688432Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8688808Z self=, 2025-05-07T20:31:30.8689191Z T=2048, 2025-05-07T20:31:30.8689387Z D=5120, 2025-05-07T20:31:30.8689590Z contiguous=False, 2025-05-07T20:31:30.8689818Z compiled=True, 2025-05-07T20:31:30.8690030Z ) 2025-05-07T20:31:30.8690238Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8690614Z self=, 2025-05-07T20:31:30.8691000Z T=16384, 2025-05-07T20:31:30.8691204Z D=5120, 2025-05-07T20:31:30.8691405Z contiguous=True, 2025-05-07T20:31:30.8691639Z compiled=False, 2025-05-07T20:31:30.8691857Z ) 2025-05-07T20:31:30.8692056Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8692434Z self=, 2025-05-07T20:31:30.8692823Z T=16384, 2025-05-07T20:31:30.8693028Z D=7168, 2025-05-07T20:31:30.8693238Z contiguous=False, 2025-05-07T20:31:30.8693475Z compiled=False, 2025-05-07T20:31:30.8693683Z ) 2025-05-07T20:31:30.8693894Z Trying example: test_silu_mul( 2025-05-07T20:31:30.8694276Z self=, 2025-05-07T20:31:30.8694664Z T=16384, 2025-05-07T20:31:30.8694861Z D=7168, 2025-05-07T20:31:30.8695070Z contiguous=True, 2025-05-07T20:31:30.8695427Z compiled=False, 2025-05-07T20:31:30.8695635Z ) 2025-05-07T20:31:30.8695836Z PASSED 2025-05-07T20:31:30.9291490Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.9293716Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:30.9296569Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.9298056Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.9299047Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9300363Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.9301932Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.9302928Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9304158Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.9305541Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.9306926Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9308220Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.9309474Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:30.9310699Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.9311919Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:30.9312748Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9313782Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.9314806Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:30.9315607Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.9316965Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.9318243Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.9319368Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.9320418Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:30.9321606Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.9322965Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.9324028Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.9325052Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.9325801Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:30.9326820Z W0507 20:31:30.927000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.9446665Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.9448963Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:30.9452338Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.9455713Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.9457090Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9458404Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.9459786Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.9460772Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9462004Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.9463382Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.9464625Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9465916Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.9467212Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:30.9468420Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.9469630Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:30.9470455Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9471622Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.9472632Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:30.9473424Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.9474631Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.9475912Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.9477024Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.9478064Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:30.9479236Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.9480587Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.9481662Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.9482570Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.9483314Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:30.9484334Z W0507 20:31:30.943000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.9825677Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.9827132Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:30.9828458Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.9829871Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.9830845Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9832146Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.9833521Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.9834501Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9835864Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.9837235Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.9838296Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9839570Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.9840818Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:30.9842025Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.9843225Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:30.9844055Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9845074Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.9846091Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:30.9846879Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.9848147Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.9850295Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.9851405Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.9852435Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:30.9853611Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.9854959Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.9856019Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.9856926Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.9857651Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:30.9858748Z W0507 20:31:30.981000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.9865536Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.9866788Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:30.9868117Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.9869516Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.9870502Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9871802Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.9873175Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.9874155Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9875372Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.9876740Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.9877795Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9879180Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.9880416Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:30.9881628Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.9882822Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:30.9883639Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.9884659Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.9885667Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:30.9886522Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.9887836Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.9889116Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.9890231Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.9891269Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:30.9892438Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.9893786Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.9894839Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.9895751Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.9896481Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:30.9897492Z W0507 20:31:30.985000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.3968962Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:31.3969648Z self=, 2025-05-07T20:31:31.3970065Z T=1, 2025-05-07T20:31:31.3970260Z D=5120, 2025-05-07T20:31:31.3970455Z scale_ub=None, 2025-05-07T20:31:31.3970677Z contiguous=True, 2025-05-07T20:31:31.3970903Z compiled=True, 2025-05-07T20:31:31.3971109Z ) 2025-05-07T20:31:31.3971807Z self = 2025-05-07T20:31:31.3972300Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:31.3972560Z 2025-05-07T20:31:31.3972644Z @given( 2025-05-07T20:31:31.3972886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:31.3973205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:31.3973510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:31.3973850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:31.3974183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:31.3974476Z ) 2025-05-07T20:31:31.3974825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:31.3975271Z def test_silu_mul_quant( 2025-05-07T20:31:31.3975522Z self, 2025-05-07T20:31:31.3975718Z T: int, 2025-05-07T20:31:31.3975922Z D: int, 2025-05-07T20:31:31.3976151Z scale_ub: Optional[float], 2025-05-07T20:31:31.3976422Z contiguous: bool, 2025-05-07T20:31:31.3976675Z compiled: bool, 2025-05-07T20:31:31.3976907Z ) -> None: 2025-05-07T20:31:31.3977120Z torch.manual_seed(2025) 2025-05-07T20:31:31.3977370Z 2025-05-07T20:31:31.3977652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:31.3977996Z 2025-05-07T20:31:31.3978194Z x_sign = torch.sign(x) 2025-05-07T20:31:31.3978633Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:31.3978943Z x = x_sign * x_clamp 2025-05-07T20:31:31.3979191Z x0 = x[:, :D] 2025-05-07T20:31:31.3979414Z x1 = x[:, D:] 2025-05-07T20:31:31.3979625Z 2025-05-07T20:31:31.3979816Z if contiguous: 2025-05-07T20:31:31.3980049Z x0 = x0.contiguous() 2025-05-07T20:31:31.3980313Z x1 = x1.contiguous() 2025-05-07T20:31:31.3980548Z 2025-05-07T20:31:31.3980742Z if scale_ub is not None: 2025-05-07T20:31:31.3981026Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:31.3981360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:31.3981675Z ) 2025-05-07T20:31:31.3981879Z else: 2025-05-07T20:31:31.3982088Z scale_ub_tensor = None 2025-05-07T20:31:31.3982345Z 2025-05-07T20:31:31.3982580Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:31.3982892Z op = silu_mul_quant 2025-05-07T20:31:31.3983155Z if compiled: 2025-05-07T20:31:31.3983404Z op = torch.compile(op) 2025-05-07T20:31:31.3983699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:31.3983980Z 2025-05-07T20:31:31.3984177Z y_fp8, y_scale = fn() 2025-05-07T20:31:31.3984465Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:31.3984756Z 2025-05-07T20:31:31.3984997Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:31.3985338Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:31.3985629Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:31.3985945Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:31.3986310Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:31.3986619Z 2025-05-07T20:31:31.3986829Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:31.3987023Z 2025-05-07T20:31:31.3987137Z moe/activation_test.py:126: 2025-05-07T20:31:31.3987438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:31.3987781Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:31.3988110Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:31.3988907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:31.3989664Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:31.3990308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:31.3990998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:31.3991694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:31.3992416Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:31.3993173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:31.3993931Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:31.3994660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:31.3995293Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:31.3995898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:31.3996420Z fn() 2025-05-07T20:31:31.3996927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:31.3997508Z self.fn.run( 2025-05-07T20:31:31.3997978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:31.3998620Z kernel = self.compile( 2025-05-07T20:31:31.3999157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:31.3999809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.4000207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:31.4000439Z 2025-05-07T20:31:31.4000651Z self = 2025-05-07T20:31:31.4001734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:31.4003146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f3937380>} 2025-05-07T20:31:31.4004509Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:31.4005548Z context = 2025-05-07T20:31:31.4006175Z 2025-05-07T20:31:31.4006355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:31.4006881Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.4007361Z module_map=module_map) 2025-05-07T20:31:31.4007796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.4008149Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:31.4008427Z E ^ 2025-05-07T20:31:31.4008897Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.4009349Z 2025-05-07T20:31:31.4009783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:31.4010301Z 2025-05-07T20:31:31.4010405Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:31.4010822Z self=, 2025-05-07T20:31:31.4011229Z T=2048, 2025-05-07T20:31:31.4011417Z D=5120, 2025-05-07T20:31:31.4011617Z scale_ub=1200.0, 2025-05-07T20:31:31.4011993Z contiguous=True, 2025-05-07T20:31:31.4012217Z compiled=False, 2025-05-07T20:31:31.4012426Z ) 2025-05-07T20:31:31.7364873Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.7367038Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:31.7368687Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.7370124Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.7371111Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.7372431Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.7374057Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.7375194Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.7376432Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.7377874Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.7378945Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.7380241Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.7381500Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:31.7382716Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.7383933Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:31.7384762Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.7385794Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:31.7386811Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:31.7387603Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:31.7389261Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.7390875Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.7392275Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:31.7393567Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:31.7395047Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.7396765Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.7398096Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.7399299Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.7400199Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:31.7401474Z W0507 20:31:31.733000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.8318561Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.8320714Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:31.8323415Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.8326270Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.8327670Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8328994Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.8330378Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.8331369Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8332597Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.8333981Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.8335232Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8336523Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.8337780Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:31.8339007Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.8340233Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:31.8341066Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8342201Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:31.8343227Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:31.8344015Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:31.8345231Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.8346528Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.8347650Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:31.8348703Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:31.8349881Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.8351245Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.8352321Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.8353237Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.8353980Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:31.8355004Z W0507 20:31:31.829000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.0976916Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.0978566Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:32.0979925Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.0981405Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.0982388Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.0983704Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.0985101Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.0986093Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.0987448Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.0988826Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.0989905Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.0991195Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.0992454Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:32.0993680Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.0994895Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:32.0995732Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.0996766Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.0997802Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:32.0998608Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.0999815Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.1001207Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.1002329Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.1003373Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:32.1004561Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.1006295Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.1007378Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.1008357Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.1009105Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:32.1010248Z W0507 20:31:32.095000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.1117779Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.1119038Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:32.1120378Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.1121798Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.1122782Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.1124084Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.1125477Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.1126460Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.1127813Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.1129185Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.1130244Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.1131697Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.1132941Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:32.1134168Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.1135380Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:32.1136205Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.1137235Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.1138303Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:32.1139208Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.1140415Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.1141705Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.1142833Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.1143879Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:32.1145071Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.1146419Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.1147502Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.1148462Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.1149208Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:32.1150240Z W0507 20:31:32.109000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.4259108Z self = 2025-05-07T20:31:32.4259902Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:32.4260324Z 2025-05-07T20:31:32.4260438Z @given( 2025-05-07T20:31:32.4260765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.4261138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.4261449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.4261998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.4262325Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.4262618Z ) 2025-05-07T20:31:32.4262975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.4263417Z def test_silu_mul_quant( 2025-05-07T20:31:32.4263677Z self, 2025-05-07T20:31:32.4263882Z T: int, 2025-05-07T20:31:32.4264085Z D: int, 2025-05-07T20:31:32.4264317Z scale_ub: Optional[float], 2025-05-07T20:31:32.4264601Z contiguous: bool, 2025-05-07T20:31:32.4264847Z compiled: bool, 2025-05-07T20:31:32.4265074Z ) -> None: 2025-05-07T20:31:32.4265303Z torch.manual_seed(2025) 2025-05-07T20:31:32.4265556Z 2025-05-07T20:31:32.4265829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.4266181Z 2025-05-07T20:31:32.4266381Z x_sign = torch.sign(x) 2025-05-07T20:31:32.4266679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.4266997Z x = x_sign * x_clamp 2025-05-07T20:31:32.4267248Z x0 = x[:, :D] 2025-05-07T20:31:32.4267464Z x1 = x[:, D:] 2025-05-07T20:31:32.4267682Z 2025-05-07T20:31:32.4267877Z if contiguous: 2025-05-07T20:31:32.4268108Z x0 = x0.contiguous() 2025-05-07T20:31:32.4268375Z x1 = x1.contiguous() 2025-05-07T20:31:32.4268619Z 2025-05-07T20:31:32.4268929Z if scale_ub is not None: 2025-05-07T20:31:32.4269209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.4269550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.4269870Z ) 2025-05-07T20:31:32.4270069Z else: 2025-05-07T20:31:32.4270287Z scale_ub_tensor = None 2025-05-07T20:31:32.4270544Z 2025-05-07T20:31:32.4270776Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.4271103Z op = silu_mul_quant 2025-05-07T20:31:32.4271358Z if compiled: 2025-05-07T20:31:32.4271604Z op = torch.compile(op) 2025-05-07T20:31:32.4271904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.4272186Z 2025-05-07T20:31:32.4272378Z > y_fp8, y_scale = fn() 2025-05-07T20:31:32.4272551Z 2025-05-07T20:31:32.4272653Z moe/activation_test.py:117: 2025-05-07T20:31:32.4272954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.4273294Z moe/activation_test.py:115: in fn 2025-05-07T20:31:32.4273587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.4274285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:32.4274983Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:32.4275517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.4276205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.4276868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.4277402Z kernel = self.compile( 2025-05-07T20:31:32.4277939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.4278600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.4279001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.4279230Z 2025-05-07T20:31:32.4279438Z self = 2025-05-07T20:31:32.4280519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.4281980Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f3756660>} 2025-05-07T20:31:32.4283321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.4284352Z context = 2025-05-07T20:31:32.4284638Z 2025-05-07T20:31:32.4284805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.4285328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.4285801Z module_map=module_map) 2025-05-07T20:31:32.4286160Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.4286526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.4286791Z E ^ 2025-05-07T20:31:32.4287260Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.4287843Z 2025-05-07T20:31:32.4288259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.4288776Z 2025-05-07T20:31:32.4288883Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.4289382Z self=, 2025-05-07T20:31:32.4289792Z T=2048, 2025-05-07T20:31:32.4289983Z D=5120, 2025-05-07T20:31:32.4290182Z scale_ub=1200.0, 2025-05-07T20:31:32.4290408Z contiguous=True, 2025-05-07T20:31:32.4290626Z compiled=True, 2025-05-07T20:31:32.4290839Z ) 2025-05-07T20:31:32.4291162Z self = 2025-05-07T20:31:32.4291656Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:32.4291932Z 2025-05-07T20:31:32.4292014Z @given( 2025-05-07T20:31:32.4292270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.4292588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.4292893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.4293227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.4293557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.4293853Z ) 2025-05-07T20:31:32.4294200Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.4294645Z def test_silu_mul_quant( 2025-05-07T20:31:32.4294895Z self, 2025-05-07T20:31:32.4295088Z T: int, 2025-05-07T20:31:32.4295292Z D: int, 2025-05-07T20:31:32.4295517Z scale_ub: Optional[float], 2025-05-07T20:31:32.4295784Z contiguous: bool, 2025-05-07T20:31:32.4296034Z compiled: bool, 2025-05-07T20:31:32.4296265Z ) -> None: 2025-05-07T20:31:32.4296478Z torch.manual_seed(2025) 2025-05-07T20:31:32.4296728Z 2025-05-07T20:31:32.4297005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.4297347Z 2025-05-07T20:31:32.4297545Z x_sign = torch.sign(x) 2025-05-07T20:31:32.4297848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.4298193Z x = x_sign * x_clamp 2025-05-07T20:31:32.4298451Z x0 = x[:, :D] 2025-05-07T20:31:32.4298676Z x1 = x[:, D:] 2025-05-07T20:31:32.4298879Z 2025-05-07T20:31:32.4299073Z if contiguous: 2025-05-07T20:31:32.4299305Z x0 = x0.contiguous() 2025-05-07T20:31:32.4299568Z x1 = x1.contiguous() 2025-05-07T20:31:32.4299805Z 2025-05-07T20:31:32.4300004Z if scale_ub is not None: 2025-05-07T20:31:32.4300277Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.4300612Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.4301015Z ) 2025-05-07T20:31:32.4301218Z else: 2025-05-07T20:31:32.4301429Z scale_ub_tensor = None 2025-05-07T20:31:32.4301687Z 2025-05-07T20:31:32.4301923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.4302231Z op = silu_mul_quant 2025-05-07T20:31:32.4302489Z if compiled: 2025-05-07T20:31:32.4302739Z op = torch.compile(op) 2025-05-07T20:31:32.4303039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.4303320Z 2025-05-07T20:31:32.4303520Z y_fp8, y_scale = fn() 2025-05-07T20:31:32.4303799Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:32.4304096Z 2025-05-07T20:31:32.4304337Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.4304676Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:32.4304962Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:32.4305284Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:32.4305822Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:32.4306136Z 2025-05-07T20:31:32.4306382Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:32.4306661Z 2025-05-07T20:31:32.4306809Z moe/activation_test.py:126: 2025-05-07T20:31:32.4307228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.4307925Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:32.4308423Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:32.4309621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:32.4310390Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:32.4310943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.4311641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.4312334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:32.4313058Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:32.4313815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:32.4314570Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:32.4315296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:32.4315938Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:32.4316543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:32.4317101Z fn() 2025-05-07T20:31:32.4317631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:32.4318218Z self.fn.run( 2025-05-07T20:31:32.4318694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.4319226Z kernel = self.compile( 2025-05-07T20:31:32.4319779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.4320441Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.4320843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.4321076Z 2025-05-07T20:31:32.4321288Z self = 2025-05-07T20:31:32.4322376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.4323913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f38cbce0>} 2025-05-07T20:31:32.4325260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.4326289Z context = 2025-05-07T20:31:32.4326578Z 2025-05-07T20:31:32.4326746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.4327276Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.4327815Z module_map=module_map) 2025-05-07T20:31:32.4328191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.4328558Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:32.4328836Z E ^ 2025-05-07T20:31:32.4329308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.4329757Z 2025-05-07T20:31:32.4330285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.4330806Z 2025-05-07T20:31:32.4330915Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.4331335Z self=, 2025-05-07T20:31:32.4331744Z T=16384, 2025-05-07T20:31:32.4331943Z D=7168, 2025-05-07T20:31:32.4332150Z scale_ub=1200.0, 2025-05-07T20:31:32.4332384Z contiguous=False, 2025-05-07T20:31:32.4332616Z compiled=False, 2025-05-07T20:31:32.4332830Z ) 2025-05-07T20:31:32.6685442Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.6686511Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:32.6688217Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.6691032Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.6692972Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6695578Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.6697962Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.6698950Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6700180Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.6701548Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.6702775Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6704056Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.6705291Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:32.6706672Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.6707881Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:32.6708705Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6709856Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.6710865Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:32.6711655Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.6712855Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.6714136Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.6715252Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.6716288Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:32.6717453Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.6718803Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.6719867Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.6720773Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.6721510Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:32.6722528Z W0507 20:31:32.666000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.7376234Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.7379007Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:32.7381648Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.7384466Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.7386396Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.7388061Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.7389442Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.7390535Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.7391760Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.7393121Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.7394184Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.7395458Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.7396702Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:32.7397919Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.7399114Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:32.7399944Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.7400963Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.7401983Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:32.7402767Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.7403977Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.7405335Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.7406610Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.7407699Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:32.7408861Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.7410211Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.7411271Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.7412179Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.7412920Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:32.7414046Z W0507 20:31:32.735000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.1315051Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:33.1317531Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:33.1318883Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:33.1320318Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:33.1321303Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.1322620Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:33.1324013Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.1324997Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.1326228Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:33.1327711Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.1328787Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.1330247Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:33.1331498Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:33.1332725Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:33.1333937Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:33.1334762Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.1335795Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:33.1336819Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:33.1337756Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:33.1338963Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:33.1340249Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:33.1341372Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:33.1342417Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:33.1343600Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:33.1344951Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:33.1346016Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.1346934Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:33.1347728Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:33.1348746Z W0507 20:31:33.129000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.1453814Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:33.1455913Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:33.1458218Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:33.1459788Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:33.1460779Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.1462085Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:33.1463473Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.1464469Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.1465704Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:33.1467200Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.1468276Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.1469564Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:33.1470820Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:33.1472043Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:33.1473255Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:33.1474086Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:33.1475111Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:33.1476138Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:33.1476929Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:33.1478146Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:33.1479432Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:33.1480555Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:33.1481684Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:33.1482867Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:33.1484230Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:33.1485298Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.1493677Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:33.1494478Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:33.1495512Z W0507 20:31:33.143000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.9331145Z self = 2025-05-07T20:31:33.9332305Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:33.9332618Z 2025-05-07T20:31:33.9332704Z @given( 2025-05-07T20:31:33.9332949Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:33.9333260Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:33.9333575Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:33.9333909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:33.9334241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:33.9334530Z ) 2025-05-07T20:31:33.9334883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:33.9335331Z def test_silu_mul_quant( 2025-05-07T20:31:33.9335571Z self, 2025-05-07T20:31:33.9335768Z T: int, 2025-05-07T20:31:33.9335968Z D: int, 2025-05-07T20:31:33.9336183Z scale_ub: Optional[float], 2025-05-07T20:31:33.9336459Z contiguous: bool, 2025-05-07T20:31:33.9336706Z compiled: bool, 2025-05-07T20:31:33.9336929Z ) -> None: 2025-05-07T20:31:33.9337150Z torch.manual_seed(2025) 2025-05-07T20:31:33.9337393Z 2025-05-07T20:31:33.9337665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:33.9338015Z 2025-05-07T20:31:33.9338213Z x_sign = torch.sign(x) 2025-05-07T20:31:33.9338501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:33.9338821Z x = x_sign * x_clamp 2025-05-07T20:31:33.9339064Z x0 = x[:, :D] 2025-05-07T20:31:33.9339277Z x1 = x[:, D:] 2025-05-07T20:31:33.9339485Z 2025-05-07T20:31:33.9339672Z if contiguous: 2025-05-07T20:31:33.9339898Z x0 = x0.contiguous() 2025-05-07T20:31:33.9340158Z x1 = x1.contiguous() 2025-05-07T20:31:33.9340401Z 2025-05-07T20:31:33.9340591Z if scale_ub is not None: 2025-05-07T20:31:33.9340867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:33.9341214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:33.9341525Z ) 2025-05-07T20:31:33.9341713Z else: 2025-05-07T20:31:33.9341928Z scale_ub_tensor = None 2025-05-07T20:31:33.9342182Z 2025-05-07T20:31:33.9342413Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:33.9342726Z op = silu_mul_quant 2025-05-07T20:31:33.9342978Z if compiled: 2025-05-07T20:31:33.9343392Z op = torch.compile(op) 2025-05-07T20:31:33.9343689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:33.9343967Z 2025-05-07T20:31:33.9344154Z > y_fp8, y_scale = fn() 2025-05-07T20:31:33.9344326Z 2025-05-07T20:31:33.9344429Z moe/activation_test.py:117: 2025-05-07T20:31:33.9344729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.9345067Z moe/activation_test.py:115: in fn 2025-05-07T20:31:33.9345346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:33.9346051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:33.9346752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:33.9347288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:33.9347975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:33.9348646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:33.9349181Z kernel = self.compile( 2025-05-07T20:31:33.9349723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:33.9350381Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.9350859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.9351088Z 2025-05-07T20:31:33.9351293Z self = 2025-05-07T20:31:33.9352375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:33.9354051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f3e0c720>} 2025-05-07T20:31:33.9355406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:33.9356432Z context = 2025-05-07T20:31:33.9356720Z 2025-05-07T20:31:33.9356900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:33.9357422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.9357914Z module_map=module_map) 2025-05-07T20:31:33.9358326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.9358678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:33.9358948Z E ^ 2025-05-07T20:31:33.9359553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.9360008Z 2025-05-07T20:31:33.9360433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:33.9360948Z 2025-05-07T20:31:33.9361057Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:33.9361476Z self=, 2025-05-07T20:31:33.9361892Z T=1, 2025-05-07T20:31:33.9362079Z D=7168, 2025-05-07T20:31:33.9362287Z scale_ub=None, 2025-05-07T20:31:33.9362510Z contiguous=True, 2025-05-07T20:31:33.9362732Z compiled=True, 2025-05-07T20:31:33.9362950Z ) 2025-05-07T20:31:33.9363278Z self = 2025-05-07T20:31:33.9363770Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:33.9364029Z 2025-05-07T20:31:33.9364207Z @given( 2025-05-07T20:31:33.9364445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:33.9364764Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:33.9365078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:33.9365418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:33.9365748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:33.9366032Z ) 2025-05-07T20:31:33.9366391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:33.9366837Z def test_silu_mul_quant( 2025-05-07T20:31:33.9367081Z self, 2025-05-07T20:31:33.9367286Z T: int, 2025-05-07T20:31:33.9367584Z D: int, 2025-05-07T20:31:33.9367816Z scale_ub: Optional[float], 2025-05-07T20:31:33.9368086Z contiguous: bool, 2025-05-07T20:31:33.9368334Z compiled: bool, 2025-05-07T20:31:33.9368565Z ) -> None: 2025-05-07T20:31:33.9368789Z torch.manual_seed(2025) 2025-05-07T20:31:33.9369035Z 2025-05-07T20:31:33.9369312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:33.9369661Z 2025-05-07T20:31:33.9369856Z x_sign = torch.sign(x) 2025-05-07T20:31:33.9370147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:33.9370463Z x = x_sign * x_clamp 2025-05-07T20:31:33.9370702Z x0 = x[:, :D] 2025-05-07T20:31:33.9370925Z x1 = x[:, D:] 2025-05-07T20:31:33.9371135Z 2025-05-07T20:31:33.9371419Z if contiguous: 2025-05-07T20:31:33.9371659Z x0 = x0.contiguous() 2025-05-07T20:31:33.9371925Z x1 = x1.contiguous() 2025-05-07T20:31:33.9372163Z 2025-05-07T20:31:33.9372379Z if scale_ub is not None: 2025-05-07T20:31:33.9372761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:33.9373099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:33.9373416Z ) 2025-05-07T20:31:33.9373628Z else: 2025-05-07T20:31:33.9373838Z scale_ub_tensor = None 2025-05-07T20:31:33.9374099Z 2025-05-07T20:31:33.9374335Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:33.9374656Z op = silu_mul_quant 2025-05-07T20:31:33.9374904Z if compiled: 2025-05-07T20:31:33.9375163Z op = torch.compile(op) 2025-05-07T20:31:33.9375464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:33.9375740Z 2025-05-07T20:31:33.9375944Z y_fp8, y_scale = fn() 2025-05-07T20:31:33.9376235Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:33.9376525Z 2025-05-07T20:31:33.9376768Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:33.9377109Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:33.9377400Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:33.9377721Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:33.9378084Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:33.9378407Z 2025-05-07T20:31:33.9378610Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:33.9378819Z 2025-05-07T20:31:33.9378918Z moe/activation_test.py:126: 2025-05-07T20:31:33.9379220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.9379557Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:33.9379885Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:33.9380679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:33.9381436Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:33.9381976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:33.9382664Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:33.9383618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:33.9384498Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:33.9385257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:33.9386004Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:33.9386743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:33.9387377Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:33.9387994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:33.9388559Z fn() 2025-05-07T20:31:33.9389069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:33.9389653Z self.fn.run( 2025-05-07T20:31:33.9390123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:33.9390657Z kernel = self.compile( 2025-05-07T20:31:33.9391194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:33.9391980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.9392385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.9392612Z 2025-05-07T20:31:33.9392827Z self = 2025-05-07T20:31:33.9393908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:33.9395279Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f3e0e2a0>} 2025-05-07T20:31:33.9396616Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:33.9397636Z context = 2025-05-07T20:31:33.9397921Z 2025-05-07T20:31:33.9398085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:33.9398604Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.9399069Z module_map=module_map) 2025-05-07T20:31:33.9399431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.9399785Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:33.9400053Z E ^ 2025-05-07T20:31:33.9400511Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.9400956Z 2025-05-07T20:31:33.9401375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:33.9401883Z 2025-05-07T20:31:33.9401987Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:33.9402408Z self=, 2025-05-07T20:31:33.9402812Z T=4096, 2025-05-07T20:31:33.9402994Z D=5120, 2025-05-07T20:31:33.9403187Z scale_ub=None, 2025-05-07T20:31:33.9403408Z contiguous=False, 2025-05-07T20:31:33.9403629Z compiled=False, 2025-05-07T20:31:33.9403839Z ) 2025-05-07T20:31:34.2771486Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.2773092Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:34.2774441Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.2775865Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.2776841Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.2778148Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.2779534Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.2780641Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.2781862Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.2783237Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.2784306Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.2785594Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.2786837Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:34.2788051Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.2789255Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:34.2790089Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.2791108Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.2792128Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:34.2792911Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.2794115Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.2795487Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.2796607Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.2797650Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:34.2798884Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.2800241Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.2801313Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.2802230Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.2802965Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:34.2804067Z W0507 20:31:34.274000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.5281477Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.5283931Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:34.5286821Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.5288807Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.5289795Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.5291107Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.5292504Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.5293492Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.5294733Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.5296114Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.5297313Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.5298902Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.5300167Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:34.5301384Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.5302594Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:34.5303423Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.5304457Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.5305468Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:34.5306720Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.5307933Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.5309216Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.5310331Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.5311360Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:34.5312536Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.5313886Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.5314940Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.5315846Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.5316585Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:34.5317608Z W0507 20:31:34.525000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.8910743Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.8911811Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:34.8913145Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.8914830Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.8915830Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8917142Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.8918521Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.8919511Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8920845Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.8922222Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.8923291Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8924573Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.8925821Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:34.8927049Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.8928366Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:34.8929198Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.8930226Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.8931249Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:34.8932042Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.8933258Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.8934542Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.8935659Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.8936791Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:34.8937967Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.8939321Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.8940385Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.8941303Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.8942046Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:34.8943069Z W0507 20:31:34.888000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.9048807Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.9049934Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:34.9051269Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.9052689Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.9053657Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.9054960Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.9056323Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.9057298Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.9058519Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.9059881Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.9060926Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.9062189Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.9063561Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:34.9064768Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.9065967Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:34.9066784Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.9067787Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.9068796Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:34.9069579Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.9071007Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.9072310Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.9073417Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.9074461Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:34.9075634Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.9076991Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.9078041Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.9078951Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.9079689Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:34.9080709Z W0507 20:31:34.902000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.6220226Z self = 2025-05-07T20:31:36.6220814Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.6221099Z 2025-05-07T20:31:36.6221248Z @given( 2025-05-07T20:31:36.6221583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.6222008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.6222432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.6222779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.6223098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.6223384Z ) 2025-05-07T20:31:36.6224091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.6224562Z def test_silu_mul_quant( 2025-05-07T20:31:36.6224806Z self, 2025-05-07T20:31:36.6224994Z T: int, 2025-05-07T20:31:36.6225193Z D: int, 2025-05-07T20:31:36.6225412Z scale_ub: Optional[float], 2025-05-07T20:31:36.6225677Z contiguous: bool, 2025-05-07T20:31:36.6225913Z compiled: bool, 2025-05-07T20:31:36.6226135Z ) -> None: 2025-05-07T20:31:36.6226350Z torch.manual_seed(2025) 2025-05-07T20:31:36.6226594Z 2025-05-07T20:31:36.6226868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.6227208Z 2025-05-07T20:31:36.6227400Z x_sign = torch.sign(x) 2025-05-07T20:31:36.6227689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.6227989Z x = x_sign * x_clamp 2025-05-07T20:31:36.6228231Z x0 = x[:, :D] 2025-05-07T20:31:36.6228454Z x1 = x[:, D:] 2025-05-07T20:31:36.6228672Z 2025-05-07T20:31:36.6228884Z if contiguous: 2025-05-07T20:31:36.6229135Z x0 = x0.contiguous() 2025-05-07T20:31:36.6229395Z x1 = x1.contiguous() 2025-05-07T20:31:36.6229629Z 2025-05-07T20:31:36.6229837Z if scale_ub is not None: 2025-05-07T20:31:36.6230111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.6230440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.6230753Z ) 2025-05-07T20:31:36.6231080Z else: 2025-05-07T20:31:36.6231293Z scale_ub_tensor = None 2025-05-07T20:31:36.6231540Z 2025-05-07T20:31:36.6231775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.6232090Z op = silu_mul_quant 2025-05-07T20:31:36.6232334Z if compiled: 2025-05-07T20:31:36.6232584Z op = torch.compile(op) 2025-05-07T20:31:36.6232886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6233165Z 2025-05-07T20:31:36.6233360Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.6233522Z 2025-05-07T20:31:36.6233628Z moe/activation_test.py:117: 2025-05-07T20:31:36.6233921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6234257Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.6234540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6235239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.6235927Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.6244391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.6245092Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.6245763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.6246318Z kernel = self.compile( 2025-05-07T20:31:36.6246866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.6247598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.6248005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6248241Z 2025-05-07T20:31:36.6248467Z self = 2025-05-07T20:31:36.6249548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.6250945Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f2b454e0>} 2025-05-07T20:31:36.6252416Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.6253448Z context = 2025-05-07T20:31:36.6253735Z 2025-05-07T20:31:36.6253901Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.6254435Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.6254909Z module_map=module_map) 2025-05-07T20:31:36.6255282Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6255639Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.6255910Z E ^ 2025-05-07T20:31:36.6256388Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.6256850Z 2025-05-07T20:31:36.6257270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.6257791Z 2025-05-07T20:31:36.6257900Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.6258325Z self=, 2025-05-07T20:31:36.6258740Z T=4096, 2025-05-07T20:31:36.6258933Z D=7168, 2025-05-07T20:31:36.6259136Z scale_ub=None, 2025-05-07T20:31:36.6259473Z contiguous=False, 2025-05-07T20:31:36.6259706Z compiled=False, 2025-05-07T20:31:36.6259926Z ) 2025-05-07T20:31:36.6260254Z self = 2025-05-07T20:31:36.6260751Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.6261033Z 2025-05-07T20:31:36.6261118Z @given( 2025-05-07T20:31:36.6261363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.6261692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.6262000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.6262336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.6262672Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.6262959Z ) 2025-05-07T20:31:36.6263317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.6263769Z def test_silu_mul_quant( 2025-05-07T20:31:36.6264013Z self, 2025-05-07T20:31:36.6264230Z T: int, 2025-05-07T20:31:36.6264438Z D: int, 2025-05-07T20:31:36.6264658Z scale_ub: Optional[float], 2025-05-07T20:31:36.6264939Z contiguous: bool, 2025-05-07T20:31:36.6265188Z compiled: bool, 2025-05-07T20:31:36.6265413Z ) -> None: 2025-05-07T20:31:36.6265638Z torch.manual_seed(2025) 2025-05-07T20:31:36.6265894Z 2025-05-07T20:31:36.6266174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.6266542Z 2025-05-07T20:31:36.6266751Z x_sign = torch.sign(x) 2025-05-07T20:31:36.6267059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.6267376Z x = x_sign * x_clamp 2025-05-07T20:31:36.6267627Z x0 = x[:, :D] 2025-05-07T20:31:36.6267858Z x1 = x[:, D:] 2025-05-07T20:31:36.6268068Z 2025-05-07T20:31:36.6268266Z if contiguous: 2025-05-07T20:31:36.6268510Z x0 = x0.contiguous() 2025-05-07T20:31:36.6268781Z x1 = x1.contiguous() 2025-05-07T20:31:36.6269068Z 2025-05-07T20:31:36.6269289Z if scale_ub is not None: 2025-05-07T20:31:36.6269564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.6269914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.6270239Z ) 2025-05-07T20:31:36.6270450Z else: 2025-05-07T20:31:36.6270683Z scale_ub_tensor = None 2025-05-07T20:31:36.6270954Z 2025-05-07T20:31:36.6271279Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.6271608Z op = silu_mul_quant 2025-05-07T20:31:36.6271868Z if compiled: 2025-05-07T20:31:36.6272127Z op = torch.compile(op) 2025-05-07T20:31:36.6272426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6272716Z 2025-05-07T20:31:36.6272920Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.6273086Z 2025-05-07T20:31:36.6273190Z moe/activation_test.py:117: 2025-05-07T20:31:36.6273504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6273853Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.6274140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6274842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.6275546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.6276094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.6276785Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.6277470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.6278016Z kernel = self.compile( 2025-05-07T20:31:36.6278637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.6279343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.6279761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6279993Z 2025-05-07T20:31:36.6280212Z self = 2025-05-07T20:31:36.6281292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.6282672Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f2b45580>} 2025-05-07T20:31:36.6284030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.6285054Z context = 2025-05-07T20:31:36.6285342Z 2025-05-07T20:31:36.6285513Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.6286032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.6286504Z module_map=module_map) 2025-05-07T20:31:36.6286882Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6287241Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.6287574Z E ^ 2025-05-07T20:31:36.6288053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.6288506Z 2025-05-07T20:31:36.6288927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.6289442Z 2025-05-07T20:31:36.6289556Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.6289964Z self=, 2025-05-07T20:31:36.6290377Z T=128, 2025-05-07T20:31:36.6290573Z D=7168, 2025-05-07T20:31:36.6290765Z scale_ub=None, 2025-05-07T20:31:36.6290988Z contiguous=False, 2025-05-07T20:31:36.6291219Z compiled=True, 2025-05-07T20:31:36.6291423Z ) 2025-05-07T20:31:36.6752260Z self = 2025-05-07T20:31:36.6754023Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:36.6754770Z 2025-05-07T20:31:36.6754981Z @given( 2025-05-07T20:31:36.6755446Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.6756057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.6756667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.6757332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.6757978Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.6758534Z ) 2025-05-07T20:31:36.6758932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.6759377Z def test_silu_mul_quant( 2025-05-07T20:31:36.6759618Z self, 2025-05-07T20:31:36.6759822Z T: int, 2025-05-07T20:31:36.6760022Z D: int, 2025-05-07T20:31:36.6760239Z scale_ub: Optional[float], 2025-05-07T20:31:36.6760519Z contiguous: bool, 2025-05-07T20:31:36.6760761Z compiled: bool, 2025-05-07T20:31:36.6760981Z ) -> None: 2025-05-07T20:31:36.6761198Z torch.manual_seed(2025) 2025-05-07T20:31:36.6761439Z 2025-05-07T20:31:36.6761712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.6762066Z 2025-05-07T20:31:36.6762267Z x_sign = torch.sign(x) 2025-05-07T20:31:36.6762559Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.6763034Z x = x_sign * x_clamp 2025-05-07T20:31:36.6763285Z x0 = x[:, :D] 2025-05-07T20:31:36.6763504Z x1 = x[:, D:] 2025-05-07T20:31:36.6763718Z 2025-05-07T20:31:36.6763914Z if contiguous: 2025-05-07T20:31:36.6764142Z x0 = x0.contiguous() 2025-05-07T20:31:36.6764407Z x1 = x1.contiguous() 2025-05-07T20:31:36.6764649Z 2025-05-07T20:31:36.6764840Z if scale_ub is not None: 2025-05-07T20:31:36.6765127Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.6765467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.6765785Z ) 2025-05-07T20:31:36.6765980Z else: 2025-05-07T20:31:36.6766199Z scale_ub_tensor = None 2025-05-07T20:31:36.6766461Z 2025-05-07T20:31:36.6766693Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.6767013Z op = silu_mul_quant 2025-05-07T20:31:36.6767267Z if compiled: 2025-05-07T20:31:36.6767605Z op = torch.compile(op) 2025-05-07T20:31:36.6767906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6768194Z 2025-05-07T20:31:36.6768387Z y_fp8, y_scale = fn() 2025-05-07T20:31:36.6768675Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:36.6768974Z 2025-05-07T20:31:36.6769211Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.6769557Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:36.6769857Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:36.6770177Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:36.6770535Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.6770854Z 2025-05-07T20:31:36.6771062Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:36.6771258Z 2025-05-07T20:31:36.6771360Z moe/activation_test.py:126: 2025-05-07T20:31:36.6771666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6772003Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:36.6772326Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.6773118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:36.6773874Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:36.6774419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.6775191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.6775879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:36.6776602Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.6777359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:36.6778100Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.6778831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:36.6779471Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:36.6780072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:36.6780593Z fn() 2025-05-07T20:31:36.6781105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:36.6781690Z self.fn.run( 2025-05-07T20:31:36.6782151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.6782682Z kernel = self.compile( 2025-05-07T20:31:36.6783301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.6783959Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.6784351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6784588Z 2025-05-07T20:31:36.6784796Z self = 2025-05-07T20:31:36.6785888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.6787259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f2b46c00>} 2025-05-07T20:31:36.6788652Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.6789678Z context = 2025-05-07T20:31:36.6789973Z 2025-05-07T20:31:36.6790140Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.6790670Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.6791138Z module_map=module_map) 2025-05-07T20:31:36.6791506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6791869Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:36.6792137Z E ^ 2025-05-07T20:31:36.6792605Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.6793060Z 2025-05-07T20:31:36.6793486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.6793996Z 2025-05-07T20:31:36.6794109Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.6794517Z self=, 2025-05-07T20:31:36.6794925Z T=128, 2025-05-07T20:31:36.6795124Z D=7168, 2025-05-07T20:31:36.6795318Z scale_ub=None, 2025-05-07T20:31:36.6795541Z contiguous=False, 2025-05-07T20:31:36.6795863Z compiled=False, 2025-05-07T20:31:36.6796067Z ) 2025-05-07T20:31:36.8313387Z self = 2025-05-07T20:31:36.8314905Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.8315648Z 2025-05-07T20:31:36.8315861Z @given( 2025-05-07T20:31:36.8316423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.8317051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.8317666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.8318324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.8318814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.8319122Z ) 2025-05-07T20:31:36.8319474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.8319917Z def test_silu_mul_quant( 2025-05-07T20:31:36.8320158Z self, 2025-05-07T20:31:36.8320364Z T: int, 2025-05-07T20:31:36.8320567Z D: int, 2025-05-07T20:31:36.8320789Z scale_ub: Optional[float], 2025-05-07T20:31:36.8321057Z contiguous: bool, 2025-05-07T20:31:36.8321297Z compiled: bool, 2025-05-07T20:31:36.8321523Z ) -> None: 2025-05-07T20:31:36.8321735Z torch.manual_seed(2025) 2025-05-07T20:31:36.8321983Z 2025-05-07T20:31:36.8322262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.8322603Z 2025-05-07T20:31:36.8322968Z x_sign = torch.sign(x) 2025-05-07T20:31:36.8323273Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.8323580Z x = x_sign * x_clamp 2025-05-07T20:31:36.8323828Z x0 = x[:, :D] 2025-05-07T20:31:36.8324053Z x1 = x[:, D:] 2025-05-07T20:31:36.8324258Z 2025-05-07T20:31:36.8324449Z if contiguous: 2025-05-07T20:31:36.8324683Z x0 = x0.contiguous() 2025-05-07T20:31:36.8324940Z x1 = x1.contiguous() 2025-05-07T20:31:36.8325188Z 2025-05-07T20:31:36.8325385Z if scale_ub is not None: 2025-05-07T20:31:36.8325654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.8325992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.8326308Z ) 2025-05-07T20:31:36.8326508Z else: 2025-05-07T20:31:36.8326720Z scale_ub_tensor = None 2025-05-07T20:31:36.8326977Z 2025-05-07T20:31:36.8327212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.8327619Z op = silu_mul_quant 2025-05-07T20:31:36.8327875Z if compiled: 2025-05-07T20:31:36.8328125Z op = torch.compile(op) 2025-05-07T20:31:36.8328417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.8328701Z 2025-05-07T20:31:36.8328899Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.8329062Z 2025-05-07T20:31:36.8329163Z moe/activation_test.py:117: 2025-05-07T20:31:36.8329464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.8329810Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.8330094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.8330777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.8331467Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.8332010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.8332688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.8333351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.8333889Z kernel = self.compile( 2025-05-07T20:31:36.8334430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.8335207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.8335604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.8335832Z 2025-05-07T20:31:36.8336043Z self = 2025-05-07T20:31:36.8337118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.8338483Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51ca943100>} 2025-05-07T20:31:36.8339816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.8340843Z context = 2025-05-07T20:31:36.8341128Z 2025-05-07T20:31:36.8341301Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.8341820Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.8342286Z module_map=module_map) 2025-05-07T20:31:36.8342732Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.8343087Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.8343352Z E ^ 2025-05-07T20:31:36.8343817Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.8344262Z 2025-05-07T20:31:36.8344682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.8345190Z 2025-05-07T20:31:36.8345302Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.8345715Z self=, 2025-05-07T20:31:36.8346126Z T=4096, 2025-05-07T20:31:36.8346313Z D=5120, 2025-05-07T20:31:36.8346511Z scale_ub=1200.0, 2025-05-07T20:31:36.8346740Z contiguous=True, 2025-05-07T20:31:36.8346959Z compiled=False, 2025-05-07T20:31:36.8347176Z ) 2025-05-07T20:31:36.8347494Z self = 2025-05-07T20:31:36.8347993Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:36.8348266Z 2025-05-07T20:31:36.8348346Z @given( 2025-05-07T20:31:36.8348577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.8348891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.8349193Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.8349524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.8349859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.8350147Z ) 2025-05-07T20:31:36.8350499Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.8350942Z def test_silu_mul_quant( 2025-05-07T20:31:36.8351187Z self, 2025-05-07T20:31:36.8351380Z T: int, 2025-05-07T20:31:36.8351585Z D: int, 2025-05-07T20:31:36.8351810Z scale_ub: Optional[float], 2025-05-07T20:31:36.8352080Z contiguous: bool, 2025-05-07T20:31:36.8352331Z compiled: bool, 2025-05-07T20:31:36.8352561Z ) -> None: 2025-05-07T20:31:36.8352771Z torch.manual_seed(2025) 2025-05-07T20:31:36.8353017Z 2025-05-07T20:31:36.8353299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.8353637Z 2025-05-07T20:31:36.8353837Z x_sign = torch.sign(x) 2025-05-07T20:31:36.8354130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.8354549Z x = x_sign * x_clamp 2025-05-07T20:31:36.8354794Z x0 = x[:, :D] 2025-05-07T20:31:36.8355015Z x1 = x[:, D:] 2025-05-07T20:31:36.8355223Z 2025-05-07T20:31:36.8355420Z if contiguous: 2025-05-07T20:31:36.8355653Z x0 = x0.contiguous() 2025-05-07T20:31:36.8355907Z x1 = x1.contiguous() 2025-05-07T20:31:36.8356148Z 2025-05-07T20:31:36.8356348Z if scale_ub is not None: 2025-05-07T20:31:36.8356614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.8356957Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.8357273Z ) 2025-05-07T20:31:36.8357472Z else: 2025-05-07T20:31:36.8357681Z scale_ub_tensor = None 2025-05-07T20:31:36.8357929Z 2025-05-07T20:31:36.8358159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.8358466Z op = silu_mul_quant 2025-05-07T20:31:36.8358718Z if compiled: 2025-05-07T20:31:36.8358972Z op = torch.compile(op) 2025-05-07T20:31:36.8359260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.8359533Z 2025-05-07T20:31:36.8359722Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.8359883Z 2025-05-07T20:31:36.8359983Z moe/activation_test.py:117: 2025-05-07T20:31:36.8360274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.8360601Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.8360882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.8361648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.8362341Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.8362873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.8363550Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.8364215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.8364745Z kernel = self.compile( 2025-05-07T20:31:36.8365280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.8365925Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.8366325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.8366552Z 2025-05-07T20:31:36.8366762Z self = 2025-05-07T20:31:36.8367881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.8369291Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51cb66b560>} 2025-05-07T20:31:36.8370628Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.8371645Z context = 2025-05-07T20:31:36.8371930Z 2025-05-07T20:31:36.8372104Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.8372619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.8373082Z module_map=module_map) 2025-05-07T20:31:36.8373447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.8373803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.8374059Z E ^ 2025-05-07T20:31:36.8374613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.8375066Z 2025-05-07T20:31:36.8375481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.8375987Z 2025-05-07T20:31:36.8376098Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.8376506Z self=, 2025-05-07T20:31:36.8376914Z T=1, 2025-05-07T20:31:36.8377107Z D=5120, 2025-05-07T20:31:36.8377296Z scale_ub=None, 2025-05-07T20:31:36.8377508Z contiguous=True, 2025-05-07T20:31:36.8377732Z compiled=True, 2025-05-07T20:31:36.8377929Z ) 2025-05-07T20:31:37.1786711Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.1787926Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.1789279Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.1790881Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.1791867Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.1793168Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.1794545Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.1795528Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.1796765Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.1798134Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.1799191Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.1800467Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.1801711Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.1802921Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.1804117Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.1805070Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.1806369Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.1807381Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.1808246Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.1809448Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.1810727Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.1811840Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.1812882Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.1814179Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.1815533Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.1816588Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.1817501Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.1818240Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.1819425Z W0507 20:31:37.176000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.2636677Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.2638861Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.2640322Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.2641748Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.2642734Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.2644035Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.2645418Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.2646577Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.2647902Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.2649281Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.2650341Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.2651629Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.2652874Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.2654209Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.2655432Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.2656271Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.2657311Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.2658329Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.2659133Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.2660348Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.2661627Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.2662739Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.2663788Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.2664978Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.2666334Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.2667404Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.2668312Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.2669167Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.2670189Z W0507 20:31:37.261000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.5260714Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.5261832Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.5263182Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.5264709Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.5265690Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.5267158Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.5268547Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.5269542Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.5270774Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.5272154Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.5273214Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.5274501Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.5275752Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.5276981Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.5278194Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.5279019Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.5280044Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.5281187Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.5281982Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.5283189Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.5284474Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.5285593Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.5286639Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.5287923Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.5289352Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.5290415Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.5291326Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.5292066Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.5293085Z W0507 20:31:37.523000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.5409947Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.5411206Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.5412531Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.5413942Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.5414904Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.5416207Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.5417578Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.5418554Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.5419952Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.5421310Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.5422373Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.5423647Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.5424886Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.5426096Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.5427396Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.5428224Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.5429247Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.5430255Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.5431048Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.5432242Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.5433517Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.5434624Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.5435658Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.5436829Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.5438175Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.5439236Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.5440138Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.5440875Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.5441879Z W0507 20:31:37.537000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.7366949Z self = 2025-05-07T20:31:37.7367788Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:37.7368164Z 2025-05-07T20:31:37.7368285Z @given( 2025-05-07T20:31:37.7368595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.7369046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.7369375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.7369714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.7370040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.7370334Z ) 2025-05-07T20:31:37.7370789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.7371256Z def test_silu_mul_quant( 2025-05-07T20:31:37.7371506Z self, 2025-05-07T20:31:37.7371710Z T: int, 2025-05-07T20:31:37.7371912Z D: int, 2025-05-07T20:31:37.7372138Z scale_ub: Optional[float], 2025-05-07T20:31:37.7372415Z contiguous: bool, 2025-05-07T20:31:37.7372655Z compiled: bool, 2025-05-07T20:31:37.7372887Z ) -> None: 2025-05-07T20:31:37.7373109Z torch.manual_seed(2025) 2025-05-07T20:31:37.7373353Z 2025-05-07T20:31:37.7373826Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.7374182Z 2025-05-07T20:31:37.7374390Z x_sign = torch.sign(x) 2025-05-07T20:31:37.7374681Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.7374996Z x = x_sign * x_clamp 2025-05-07T20:31:37.7375244Z x0 = x[:, :D] 2025-05-07T20:31:37.7375464Z x1 = x[:, D:] 2025-05-07T20:31:37.7375677Z 2025-05-07T20:31:37.7375869Z if contiguous: 2025-05-07T20:31:37.7376108Z x0 = x0.contiguous() 2025-05-07T20:31:37.7376374Z x1 = x1.contiguous() 2025-05-07T20:31:37.7376619Z 2025-05-07T20:31:37.7376815Z if scale_ub is not None: 2025-05-07T20:31:37.7377096Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.7377456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.7377770Z ) 2025-05-07T20:31:37.7377964Z else: 2025-05-07T20:31:37.7378182Z scale_ub_tensor = None 2025-05-07T20:31:37.7378446Z 2025-05-07T20:31:37.7378681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.7379003Z op = silu_mul_quant 2025-05-07T20:31:37.7379260Z if compiled: 2025-05-07T20:31:37.7379508Z op = torch.compile(op) 2025-05-07T20:31:37.7379819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.7380101Z 2025-05-07T20:31:37.7380307Z y_fp8, y_scale = fn() 2025-05-07T20:31:37.7380596Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:37.7380899Z 2025-05-07T20:31:37.7381142Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.7381479Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:37.7381779Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:37.7382100Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:37.7382460Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.7382776Z 2025-05-07T20:31:37.7382991Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:37.7383185Z 2025-05-07T20:31:37.7383291Z moe/activation_test.py:126: 2025-05-07T20:31:37.7383586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.7383922Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:37.7384249Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.7385036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:37.7385980Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:37.7386528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.7387207Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.7387895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:37.7388619Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.7389370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:37.7390110Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.7390838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:37.7391483Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:37.7392081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:37.7392591Z fn() 2025-05-07T20:31:37.7393096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:37.7393759Z self.fn.run( 2025-05-07T20:31:37.7394230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.7394756Z kernel = self.compile( 2025-05-07T20:31:37.7395297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.7395949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.7396336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.7396575Z 2025-05-07T20:31:37.7396782Z self = 2025-05-07T20:31:37.7397860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.7399239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51cbd077e0>} 2025-05-07T20:31:37.7400583Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.7401605Z context = 2025-05-07T20:31:37.7401904Z 2025-05-07T20:31:37.7402068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.7402586Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.7403054Z module_map=module_map) 2025-05-07T20:31:37.7403416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.7403775Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:37.7404044Z E ^ 2025-05-07T20:31:37.7404508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.7404964Z 2025-05-07T20:31:37.7405376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.7406151Z 2025-05-07T20:31:37.7406259Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.7406675Z self=, 2025-05-07T20:31:37.7407215Z T=2048, 2025-05-07T20:31:37.7407407Z D=5120, 2025-05-07T20:31:37.7407657Z scale_ub=None, 2025-05-07T20:31:37.7407867Z contiguous=True, 2025-05-07T20:31:37.7408092Z compiled=True, 2025-05-07T20:31:37.7408300Z ) 2025-05-07T20:31:38.0540782Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.0542077Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.0543409Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.0544818Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.0545797Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.0547256Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.0548629Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.0549603Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.0550815Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.0552172Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.0553225Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.0554492Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.0555728Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.0556940Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.0558134Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.0558953Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.0559968Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:38.0560975Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.0561874Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:38.0563072Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.0564363Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.0565466Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.0566489Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.0567800Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.0569203Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.0570390Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.0571295Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.0572023Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.0573034Z W0507 20:31:38.051000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.1385062Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.1386323Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.1387666Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.1389076Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.1390047Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.1391345Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.1392722Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.1393698Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.1394921Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.1396453Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.1397512Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.1398784Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.1400020Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.1401231Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.1402430Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.1403508Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.1404531Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:38.1405542Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.1406624Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:38.1407883Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.1409157Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.1410272Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.1411304Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.1412469Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.1413824Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.1414879Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.1415789Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.1416524Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.1417532Z W0507 20:31:38.136000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.4017910Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.4019896Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.4021229Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.4022642Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.4023604Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.4024906Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.4026272Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.4027407Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.4028629Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.4029987Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.4031038Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.4032521Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.4033774Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.4034994Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.4036212Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.4037039Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.4038066Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:38.4039080Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.4039874Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:38.4041075Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.4042505Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.4043884Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.4044923Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.4046181Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.4047622Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.4048687Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.4049594Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.4050429Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.4051442Z W0507 20:31:38.399000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.4159486Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.4161851Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.4164491Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.4167300Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.4169151Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.4170473Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.4171853Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.4172836Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.4174054Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.4175414Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.4176632Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.4177907Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.4179147Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.4180355Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.4181558Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.4182381Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.4183405Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:38.4184522Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.4185312Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:38.4186515Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.4187794Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.4188919Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.4190013Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.4191187Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.4192542Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.4193595Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.4194503Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.4195239Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.4196255Z W0507 20:31:38.413000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.7756616Z self = 2025-05-07T20:31:38.7757299Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:38.7757587Z 2025-05-07T20:31:38.7757671Z @given( 2025-05-07T20:31:38.7758129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.7758455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.7758767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.7759104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.7759444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.7759732Z ) 2025-05-07T20:31:38.7760095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.7760549Z def test_silu_mul_quant( 2025-05-07T20:31:38.7767950Z self, 2025-05-07T20:31:38.7768168Z T: int, 2025-05-07T20:31:38.7768380Z D: int, 2025-05-07T20:31:38.7768604Z scale_ub: Optional[float], 2025-05-07T20:31:38.7768889Z contiguous: bool, 2025-05-07T20:31:38.7769139Z compiled: bool, 2025-05-07T20:31:38.7769368Z ) -> None: 2025-05-07T20:31:38.7769590Z torch.manual_seed(2025) 2025-05-07T20:31:38.7769855Z 2025-05-07T20:31:38.7770131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.7770488Z 2025-05-07T20:31:38.7770693Z x_sign = torch.sign(x) 2025-05-07T20:31:38.7770987Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.7771309Z x = x_sign * x_clamp 2025-05-07T20:31:38.7771559Z x0 = x[:, :D] 2025-05-07T20:31:38.7771784Z x1 = x[:, D:] 2025-05-07T20:31:38.7772003Z 2025-05-07T20:31:38.7772199Z if contiguous: 2025-05-07T20:31:38.7772589Z x0 = x0.contiguous() 2025-05-07T20:31:38.7772863Z x1 = x1.contiguous() 2025-05-07T20:31:38.7773112Z 2025-05-07T20:31:38.7773315Z if scale_ub is not None: 2025-05-07T20:31:38.7773593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.7773939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.7774258Z ) 2025-05-07T20:31:38.7774460Z else: 2025-05-07T20:31:38.7774693Z scale_ub_tensor = None 2025-05-07T20:31:38.7774954Z 2025-05-07T20:31:38.7775188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.7775518Z op = silu_mul_quant 2025-05-07T20:31:38.7775780Z if compiled: 2025-05-07T20:31:38.7776032Z op = torch.compile(op) 2025-05-07T20:31:38.7776337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.7776619Z 2025-05-07T20:31:38.7776815Z y_fp8, y_scale = fn() 2025-05-07T20:31:38.7777113Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:38.7777409Z 2025-05-07T20:31:38.7777648Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.7777990Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:38.7778290Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:38.7778609Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:38.7778966Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:38.7779288Z 2025-05-07T20:31:38.7779494Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:38.7779691Z 2025-05-07T20:31:38.7779796Z moe/activation_test.py:126: 2025-05-07T20:31:38.7780099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.7780444Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:38.7780772Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:38.7781572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:38.7782335Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:38.7782891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.7783573Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.7784264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:38.7785083Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:38.7785839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:38.7786585Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:38.7787325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:38.7787977Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:38.7788588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:38.7789155Z fn() 2025-05-07T20:31:38.7789676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:38.7790267Z self.fn.run( 2025-05-07T20:31:38.7790735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.7791271Z kernel = self.compile( 2025-05-07T20:31:38.7791818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.7792487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.7792968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.7793206Z 2025-05-07T20:31:38.7793417Z self = 2025-05-07T20:31:38.7794512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.7796053Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51cbc36c00>} 2025-05-07T20:31:38.7797405Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.7798443Z context = 2025-05-07T20:31:38.7798738Z 2025-05-07T20:31:38.7798908Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.7799435Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.7799904Z module_map=module_map) 2025-05-07T20:31:38.7800276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.7800641Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:38.7800918Z E ^ 2025-05-07T20:31:38.7801390Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.7801851Z 2025-05-07T20:31:38.7802272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.7802787Z 2025-05-07T20:31:38.7802901Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.7803320Z self=, 2025-05-07T20:31:38.7803731Z T=128, 2025-05-07T20:31:38.7803930Z D=5120, 2025-05-07T20:31:38.7804122Z scale_ub=None, 2025-05-07T20:31:38.7804348Z contiguous=True, 2025-05-07T20:31:38.7804576Z compiled=True, 2025-05-07T20:31:38.7804786Z ) 2025-05-07T20:31:39.0997192Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.0998638Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.0999969Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.1001397Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.1002365Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.1003663Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.1005048Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.1006383Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.1007661Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.1009068Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.1010143Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.1011420Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.1012666Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.1013877Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.1015073Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.1015897Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.1016914Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.1017927Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.1018710Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.1019912Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.1021309Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.1022417Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.1023455Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.1024618Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.1025966Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.1027025Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.1027928Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.1028655Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.1029746Z W0507 20:31:39.097000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.1849576Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.1850817Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.1852157Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.1853565Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.1854537Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.1855836Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.1857215Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.1858195Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.1859474Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.1860836Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.1861891Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.1863330Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.1864562Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.1865778Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.1866964Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.1867783Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.1868801Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.1869805Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.1870697Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.1871890Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.1873159Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.1874266Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.1875293Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.1876457Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.1877794Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.1878842Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.1879747Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.1880477Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.1881484Z W0507 20:31:39.182000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.4512302Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.4513651Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.4514983Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.4516576Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.4517550Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.4518841Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.4520215Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.4521192Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.4522405Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.4523876Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.4524923Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.4526197Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.4527444Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.4528729Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.4529937Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.4530754Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.4531773Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.4532788Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.4533575Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.4534781Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.4536047Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.4537157Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.4538303Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.4539473Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.4540874Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.4541923Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.4542825Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.4543563Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.4544582Z W0507 20:31:39.448000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.4655087Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.4656342Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.4657680Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.4659124Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.4660125Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.4661435Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.4662805Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.4663795Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.4665025Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.4666398Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.4667457Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.4668733Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.4670153Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.4671375Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.4672589Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.4673415Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.4674431Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.4675456Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.4676250Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.4677532Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.4678811Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.4679931Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.4680973Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.4682158Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.4683515Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.4684573Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.4685482Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.4686222Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.4687247Z W0507 20:31:39.463000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.6935601Z self = 2025-05-07T20:31:39.6936359Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:39.6936723Z 2025-05-07T20:31:39.6936838Z @given( 2025-05-07T20:31:39.6937149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.6937467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.6937775Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.6938107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.6938438Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.6938718Z ) 2025-05-07T20:31:39.6939075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.6939701Z def test_silu_mul_quant( 2025-05-07T20:31:39.6939950Z self, 2025-05-07T20:31:39.6940144Z T: int, 2025-05-07T20:31:39.6940347Z D: int, 2025-05-07T20:31:39.6940566Z scale_ub: Optional[float], 2025-05-07T20:31:39.6940835Z contiguous: bool, 2025-05-07T20:31:39.6941079Z compiled: bool, 2025-05-07T20:31:39.6941309Z ) -> None: 2025-05-07T20:31:39.6941524Z torch.manual_seed(2025) 2025-05-07T20:31:39.6941773Z 2025-05-07T20:31:39.6942049Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.6942394Z 2025-05-07T20:31:39.6942593Z x_sign = torch.sign(x) 2025-05-07T20:31:39.6942887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.6943196Z x = x_sign * x_clamp 2025-05-07T20:31:39.6943440Z x0 = x[:, :D] 2025-05-07T20:31:39.6943659Z x1 = x[:, D:] 2025-05-07T20:31:39.6943873Z 2025-05-07T20:31:39.6944061Z if contiguous: 2025-05-07T20:31:39.6944292Z x0 = x0.contiguous() 2025-05-07T20:31:39.6944548Z x1 = x1.contiguous() 2025-05-07T20:31:39.6944790Z 2025-05-07T20:31:39.6944989Z if scale_ub is not None: 2025-05-07T20:31:39.6945264Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.6945595Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.6945912Z ) 2025-05-07T20:31:39.6946114Z else: 2025-05-07T20:31:39.6946438Z scale_ub_tensor = None 2025-05-07T20:31:39.6946700Z 2025-05-07T20:31:39.6946939Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.6947248Z op = silu_mul_quant 2025-05-07T20:31:39.6947499Z if compiled: 2025-05-07T20:31:39.6947748Z op = torch.compile(op) 2025-05-07T20:31:39.6948039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.6948318Z 2025-05-07T20:31:39.6948517Z y_fp8, y_scale = fn() 2025-05-07T20:31:39.6948798Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:39.6949090Z 2025-05-07T20:31:39.6949337Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.6949676Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:39.6949969Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:39.6950279Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:39.6950642Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.6950962Z 2025-05-07T20:31:39.6951171Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:39.6951372Z 2025-05-07T20:31:39.6951477Z moe/activation_test.py:126: 2025-05-07T20:31:39.6951784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.6952123Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:39.6952455Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.6953257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:39.6954018Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:39.6954563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.6955250Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.6955947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:39.6956676Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.6957428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:39.6958180Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.6959000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:39.6959648Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:39.6960247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:39.6960766Z fn() 2025-05-07T20:31:39.6961277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:39.6961854Z self.fn.run( 2025-05-07T20:31:39.6962323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.6962854Z kernel = self.compile( 2025-05-07T20:31:39.6963395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.6964039Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.6964441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.6964669Z 2025-05-07T20:31:39.6964885Z self = 2025-05-07T20:31:39.6966045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.6967421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51cbb0aac0>} 2025-05-07T20:31:39.6968853Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.6969928Z context = 2025-05-07T20:31:39.6970215Z 2025-05-07T20:31:39.6970385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.6970902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.6971371Z module_map=module_map) 2025-05-07T20:31:39.6971736Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.6972098Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:39.6972370Z E ^ 2025-05-07T20:31:39.6972840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.6973285Z 2025-05-07T20:31:39.6973703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.6974214Z 2025-05-07T20:31:39.6974318Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.6974736Z self=, 2025-05-07T20:31:39.6975143Z T=4096, 2025-05-07T20:31:39.6975332Z D=5120, 2025-05-07T20:31:39.6975523Z scale_ub=None, 2025-05-07T20:31:39.6975739Z contiguous=True, 2025-05-07T20:31:39.6975963Z compiled=True, 2025-05-07T20:31:39.6976163Z ) 2025-05-07T20:31:40.0199993Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.0201545Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:40.0203123Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.0204751Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.0206005Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.0207311Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.0208746Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.0209719Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.0210936Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.0212420Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.0213482Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.0214750Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.0215993Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:40.0217194Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.0218396Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:40.0219211Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.0220223Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:40.0221232Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:40.0222008Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:40.0223209Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.0224477Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.0225581Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.0226605Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:40.0227889Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.0229235Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.0230332Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.0231232Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.0231956Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:40.0232971Z W0507 20:31:40.017000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.1058086Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.1059551Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:40.1060884Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.1062301Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.1063282Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.1064587Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.1065956Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.1066929Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.1068149Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.1069515Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.1070575Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.1071843Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.1073072Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:40.1074401Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.1075599Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:40.1076428Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.1077447Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:40.1078454Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:40.1079245Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:40.1094101Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.1095594Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.1096758Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.1097843Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:40.1099085Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.1100469Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.1101555Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.1102479Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.1103232Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:40.1104259Z W0507 20:31:40.103000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.3711257Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.3712495Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:40.3713850Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.3715281Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.3716255Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.3717831Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.3719211Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.3720189Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.3721410Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.3722779Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.3723835Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.3725233Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.3726477Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:40.3727779Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.3728990Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:40.3729808Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.3730836Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:40.3731847Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:40.3732633Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:40.3733838Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.3735105Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.3736217Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.3737252Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:40.3738422Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.3739862Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.3740911Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.3741818Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.3742553Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:40.3743568Z W0507 20:31:40.368000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.3853004Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.3854239Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:40.3855733Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.3857153Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.3858132Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.3859493Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.3860869Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.3861859Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.3863088Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.3864462Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.3865532Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.3866809Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.3868051Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:40.3869271Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.3870605Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:40.3871437Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.3872464Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:40.3873483Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:40.3874276Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:40.3875486Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.3876767Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.3877878Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.3878997Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:40.3880177Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.3881531Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.3882591Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.3883505Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.3884252Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:40.3885276Z W0507 20:31:40.383000 86821 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.6222320Z self = 2025-05-07T20:31:40.6223768Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.6224545Z 2025-05-07T20:31:40.6224767Z @given( 2025-05-07T20:31:40.6225251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.6225882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.6226498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.6227171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.6227822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.6228404Z ) 2025-05-07T20:31:40.6229126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.6229671Z def test_silu_mul_quant( 2025-05-07T20:31:40.6229927Z self, 2025-05-07T20:31:40.6230140Z T: int, 2025-05-07T20:31:40.6230339Z D: int, 2025-05-07T20:31:40.6230574Z scale_ub: Optional[float], 2025-05-07T20:31:40.6230856Z contiguous: bool, 2025-05-07T20:31:40.6231105Z compiled: bool, 2025-05-07T20:31:40.6231333Z ) -> None: 2025-05-07T20:31:40.6231762Z torch.manual_seed(2025) 2025-05-07T20:31:40.6232012Z 2025-05-07T20:31:40.6232291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.6232646Z 2025-05-07T20:31:40.6232853Z x_sign = torch.sign(x) 2025-05-07T20:31:40.6233146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.6233466Z x = x_sign * x_clamp 2025-05-07T20:31:40.6233717Z x0 = x[:, :D] 2025-05-07T20:31:40.6233936Z x1 = x[:, D:] 2025-05-07T20:31:40.6234163Z 2025-05-07T20:31:40.6234364Z if contiguous: 2025-05-07T20:31:40.6234603Z x0 = x0.contiguous() 2025-05-07T20:31:40.6234880Z x1 = x1.contiguous() 2025-05-07T20:31:40.6235131Z 2025-05-07T20:31:40.6235329Z if scale_ub is not None: 2025-05-07T20:31:40.6235614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.6235972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.6236298Z ) 2025-05-07T20:31:40.6236514Z else: 2025-05-07T20:31:40.6236726Z scale_ub_tensor = None 2025-05-07T20:31:40.6236984Z 2025-05-07T20:31:40.6237225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.6237541Z op = silu_mul_quant 2025-05-07T20:31:40.6237798Z if compiled: 2025-05-07T20:31:40.6238050Z op = torch.compile(op) 2025-05-07T20:31:40.6238350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.6238632Z 2025-05-07T20:31:40.6238954Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.6239244Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.6239543Z 2025-05-07T20:31:40.6239808Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.6240174Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.6240483Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.6240805Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.6241167Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.6241481Z 2025-05-07T20:31:40.6241690Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.6241888Z 2025-05-07T20:31:40.6241999Z moe/activation_test.py:126: 2025-05-07T20:31:40.6242301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.6242645Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.6242990Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.6243778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.6244543Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.6245099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.6245788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.6246476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.6247205Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.6248050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.6248805Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.6249533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.6250177Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.6250782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.6251306Z fn() 2025-05-07T20:31:40.6251812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.6252485Z self.fn.run( 2025-05-07T20:31:40.6252958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.6253488Z kernel = self.compile( 2025-05-07T20:31:40.6254033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.6254698Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.6255102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.6255332Z 2025-05-07T20:31:40.6255538Z self = 2025-05-07T20:31:40.6256620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.6258004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9bea5c0>} 2025-05-07T20:31:40.6259345Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.6260441Z context = 2025-05-07T20:31:40.6260739Z 2025-05-07T20:31:40.6260909Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.6261440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.6261910Z module_map=module_map) 2025-05-07T20:31:40.6262275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.6262646Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.6262923Z E ^ 2025-05-07T20:31:40.6263385Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.6263841Z 2025-05-07T20:31:40.6264257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.6264769Z 2025-05-07T20:31:40.6264883Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.6265299Z self=, 2025-05-07T20:31:40.6265703Z T=16384, 2025-05-07T20:31:40.6265906Z D=5120, 2025-05-07T20:31:40.6266111Z scale_ub=None, 2025-05-07T20:31:40.6266328Z contiguous=True, 2025-05-07T20:31:40.6266560Z compiled=True, 2025-05-07T20:31:40.6266777Z ) 2025-05-07T20:31:40.6531833Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:40.6534751Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:40.6537398Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:40.6539377Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:40.6540740Z W0507 20:31:40.652000 86821 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:40.7194031Z self = 2025-05-07T20:31:40.7194978Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.7195356Z 2025-05-07T20:31:40.7195451Z @given( 2025-05-07T20:31:40.7195695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.7196025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.7196333Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.7196676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.7197018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.7197307Z ) 2025-05-07T20:31:40.7197670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.7198123Z def test_silu_mul_quant( 2025-05-07T20:31:40.7198374Z self, 2025-05-07T20:31:40.7198574Z T: int, 2025-05-07T20:31:40.7198781Z D: int, 2025-05-07T20:31:40.7199007Z scale_ub: Optional[float], 2025-05-07T20:31:40.7199277Z contiguous: bool, 2025-05-07T20:31:40.7199534Z compiled: bool, 2025-05-07T20:31:40.7199764Z ) -> None: 2025-05-07T20:31:40.7199983Z torch.manual_seed(2025) 2025-05-07T20:31:40.7200235Z 2025-05-07T20:31:40.7200516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.7200859Z 2025-05-07T20:31:40.7201062Z x_sign = torch.sign(x) 2025-05-07T20:31:40.7201360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.7201672Z x = x_sign * x_clamp 2025-05-07T20:31:40.7202046Z x0 = x[:, :D] 2025-05-07T20:31:40.7202274Z x1 = x[:, D:] 2025-05-07T20:31:40.7202484Z 2025-05-07T20:31:40.7202682Z if contiguous: 2025-05-07T20:31:40.7202924Z x0 = x0.contiguous() 2025-05-07T20:31:40.7203186Z x1 = x1.contiguous() 2025-05-07T20:31:40.7203432Z 2025-05-07T20:31:40.7203632Z if scale_ub is not None: 2025-05-07T20:31:40.7203906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.7204253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.7204571Z ) 2025-05-07T20:31:40.7204774Z else: 2025-05-07T20:31:40.7204991Z scale_ub_tensor = None 2025-05-07T20:31:40.7205251Z 2025-05-07T20:31:40.7205493Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.7206078Z op = silu_mul_quant 2025-05-07T20:31:40.7206339Z if compiled: 2025-05-07T20:31:40.7206598Z op = torch.compile(op) 2025-05-07T20:31:40.7206910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.7207199Z 2025-05-07T20:31:40.7207404Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.7207745Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.7208051Z 2025-05-07T20:31:40.7208299Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.7208635Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.7208938Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.7209267Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.7209635Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.7209948Z 2025-05-07T20:31:40.7210163Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.7210360Z 2025-05-07T20:31:40.7210473Z moe/activation_test.py:126: 2025-05-07T20:31:40.7210776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.7211127Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.7211464Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.7212260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.7213015Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.7213563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.7214399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.7215085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.7215812Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.7216571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.7217322Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.7218047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.7218688Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.7219293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.7219824Z fn() 2025-05-07T20:31:40.7220332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.7220919Z self.fn.run( 2025-05-07T20:31:40.7221395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.7221926Z kernel = self.compile( 2025-05-07T20:31:40.7222587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.7223256Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.7223660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.7223893Z 2025-05-07T20:31:40.7224101Z self = 2025-05-07T20:31:40.7225183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.7226561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51caa054e0>} 2025-05-07T20:31:40.7227911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.7228934Z context = 2025-05-07T20:31:40.7229228Z 2025-05-07T20:31:40.7229397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.7229924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.7230407Z module_map=module_map) 2025-05-07T20:31:40.7230774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.7231141Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.7231419Z E ^ 2025-05-07T20:31:40.7231885Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.7232342Z 2025-05-07T20:31:40.7232764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.7233289Z 2025-05-07T20:31:40.7233400Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.7233820Z self=, 2025-05-07T20:31:40.7234224Z T=1, 2025-05-07T20:31:40.7234420Z D=5120, 2025-05-07T20:31:40.7234625Z scale_ub=1200.0, 2025-05-07T20:31:40.7234851Z contiguous=True, 2025-05-07T20:31:40.7235085Z compiled=True, 2025-05-07T20:31:40.7235408Z ) 2025-05-07T20:31:41.0184667Z self = 2025-05-07T20:31:41.0185409Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.0185770Z 2025-05-07T20:31:41.0185880Z @given( 2025-05-07T20:31:41.0186192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.0186515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.0186820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.0187165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.0187496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.0187779Z ) 2025-05-07T20:31:41.0188138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.0188585Z def test_silu_mul_quant( 2025-05-07T20:31:41.0188838Z self, 2025-05-07T20:31:41.0189035Z T: int, 2025-05-07T20:31:41.0189240Z D: int, 2025-05-07T20:31:41.0189470Z scale_ub: Optional[float], 2025-05-07T20:31:41.0189786Z contiguous: bool, 2025-05-07T20:31:41.0190040Z compiled: bool, 2025-05-07T20:31:41.0190271Z ) -> None: 2025-05-07T20:31:41.0190488Z torch.manual_seed(2025) 2025-05-07T20:31:41.0190738Z 2025-05-07T20:31:41.0191015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.0191361Z 2025-05-07T20:31:41.0191558Z x_sign = torch.sign(x) 2025-05-07T20:31:41.0192009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.0192322Z x = x_sign * x_clamp 2025-05-07T20:31:41.0192567Z x0 = x[:, :D] 2025-05-07T20:31:41.0192789Z x1 = x[:, D:] 2025-05-07T20:31:41.0192998Z 2025-05-07T20:31:41.0193194Z if contiguous: 2025-05-07T20:31:41.0193434Z x0 = x0.contiguous() 2025-05-07T20:31:41.0193693Z x1 = x1.contiguous() 2025-05-07T20:31:41.0193944Z 2025-05-07T20:31:41.0194154Z if scale_ub is not None: 2025-05-07T20:31:41.0194432Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.0194768Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.0195085Z ) 2025-05-07T20:31:41.0195288Z else: 2025-05-07T20:31:41.0195503Z scale_ub_tensor = None 2025-05-07T20:31:41.0195765Z 2025-05-07T20:31:41.0196002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.0196318Z op = silu_mul_quant 2025-05-07T20:31:41.0196581Z if compiled: 2025-05-07T20:31:41.0196834Z op = torch.compile(op) 2025-05-07T20:31:41.0197129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0197411Z 2025-05-07T20:31:41.0197612Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.0197778Z 2025-05-07T20:31:41.0197881Z moe/activation_test.py:117: 2025-05-07T20:31:41.0198184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0198529Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.0198813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0199374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.0199940Z return fn(*args, **kwargs) 2025-05-07T20:31:41.0200605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.0201302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.0201846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.0202533Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.0203204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.0203735Z kernel = self.compile( 2025-05-07T20:31:41.0204467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.0205130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.0205532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0206144Z 2025-05-07T20:31:41.0206354Z self = 2025-05-07T20:31:41.0207442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.0208884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51ca2f09a0>} 2025-05-07T20:31:41.0210233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.0211257Z context = 2025-05-07T20:31:41.0211550Z 2025-05-07T20:31:41.0211716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.0212243Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.0212835Z module_map=module_map) 2025-05-07T20:31:41.0213204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.0213566Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.0213832Z E ^ 2025-05-07T20:31:41.0214298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.0214754Z 2025-05-07T20:31:41.0215171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.0215691Z 2025-05-07T20:31:41.0215797Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.0216219Z self=, 2025-05-07T20:31:41.0216622Z T=1, 2025-05-07T20:31:41.0216820Z D=5120, 2025-05-07T20:31:41.0217023Z scale_ub=None, 2025-05-07T20:31:41.0217247Z contiguous=False, 2025-05-07T20:31:41.0217483Z compiled=True, 2025-05-07T20:31:41.0217702Z ) 2025-05-07T20:31:41.0691787Z self = 2025-05-07T20:31:41.0692530Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.0692898Z 2025-05-07T20:31:41.0693008Z @given( 2025-05-07T20:31:41.0702532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.0702901Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.0703220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.0703575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.0703909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.0704196Z ) 2025-05-07T20:31:41.0704560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.0705019Z def test_silu_mul_quant( 2025-05-07T20:31:41.0705271Z self, 2025-05-07T20:31:41.0705475Z T: int, 2025-05-07T20:31:41.0705931Z D: int, 2025-05-07T20:31:41.0706163Z scale_ub: Optional[float], 2025-05-07T20:31:41.0706447Z contiguous: bool, 2025-05-07T20:31:41.0706696Z compiled: bool, 2025-05-07T20:31:41.0706933Z ) -> None: 2025-05-07T20:31:41.0707154Z torch.manual_seed(2025) 2025-05-07T20:31:41.0707407Z 2025-05-07T20:31:41.0707693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.0708043Z 2025-05-07T20:31:41.0708249Z x_sign = torch.sign(x) 2025-05-07T20:31:41.0708724Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.0709037Z x = x_sign * x_clamp 2025-05-07T20:31:41.0709288Z x0 = x[:, :D] 2025-05-07T20:31:41.0709519Z x1 = x[:, D:] 2025-05-07T20:31:41.0709764Z 2025-05-07T20:31:41.0709976Z if contiguous: 2025-05-07T20:31:41.0710219Z x0 = x0.contiguous() 2025-05-07T20:31:41.0710483Z x1 = x1.contiguous() 2025-05-07T20:31:41.0710734Z 2025-05-07T20:31:41.0710942Z if scale_ub is not None: 2025-05-07T20:31:41.0711221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.0711565Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.0711885Z ) 2025-05-07T20:31:41.0712091Z else: 2025-05-07T20:31:41.0712307Z scale_ub_tensor = None 2025-05-07T20:31:41.0712567Z 2025-05-07T20:31:41.0712809Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.0713135Z op = silu_mul_quant 2025-05-07T20:31:41.0713393Z if compiled: 2025-05-07T20:31:41.0713652Z op = torch.compile(op) 2025-05-07T20:31:41.0713954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0714239Z 2025-05-07T20:31:41.0714441Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.0714731Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.0715031Z 2025-05-07T20:31:41.0715278Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.0715731Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.0716038Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.0716361Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.0716729Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.0717044Z 2025-05-07T20:31:41.0717254Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.0717451Z 2025-05-07T20:31:41.0717570Z moe/activation_test.py:126: 2025-05-07T20:31:41.0717876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0718227Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.0718564Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.0719358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.0720135Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.0720693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.0721390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.0722084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.0722812Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.0723582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.0724341Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.0725073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.0725724Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.0726467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.0727214Z fn() 2025-05-07T20:31:41.0727922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.0728521Z self.fn.run( 2025-05-07T20:31:41.0729003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.0729641Z kernel = self.compile( 2025-05-07T20:31:41.0730191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.0730854Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.0731257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0731492Z 2025-05-07T20:31:41.0731707Z self = 2025-05-07T20:31:41.0732799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.0734185Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c97751c0>} 2025-05-07T20:31:41.0735545Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.0736568Z context = 2025-05-07T20:31:41.0736866Z 2025-05-07T20:31:41.0737036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.0737644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.0738124Z module_map=module_map) 2025-05-07T20:31:41.0738488Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.0738853Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.0739126Z E ^ 2025-05-07T20:31:41.0739593Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.0740057Z 2025-05-07T20:31:41.0740520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.0741042Z 2025-05-07T20:31:41.0741149Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.0741570Z self=, 2025-05-07T20:31:41.0741967Z T=1, 2025-05-07T20:31:41.0742161Z D=5120, 2025-05-07T20:31:41.0742362Z scale_ub=None, 2025-05-07T20:31:41.0742586Z contiguous=True, 2025-05-07T20:31:41.0742818Z compiled=False, 2025-05-07T20:31:41.0743029Z ) 2025-05-07T20:31:41.1894925Z self = 2025-05-07T20:31:41.1895678Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.1896068Z 2025-05-07T20:31:41.1896170Z @given( 2025-05-07T20:31:41.1896480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1896921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1897332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1897713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1898035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1898319Z ) 2025-05-07T20:31:41.1898667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1899108Z def test_silu_mul_quant( 2025-05-07T20:31:41.1899356Z self, 2025-05-07T20:31:41.1899553Z T: int, 2025-05-07T20:31:41.1899756Z D: int, 2025-05-07T20:31:41.1899973Z scale_ub: Optional[float], 2025-05-07T20:31:41.1900249Z contiguous: bool, 2025-05-07T20:31:41.1900488Z compiled: bool, 2025-05-07T20:31:41.1900709Z ) -> None: 2025-05-07T20:31:41.1900927Z torch.manual_seed(2025) 2025-05-07T20:31:41.1901170Z 2025-05-07T20:31:41.1901439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1901960Z 2025-05-07T20:31:41.1902157Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1902447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1902756Z x = x_sign * x_clamp 2025-05-07T20:31:41.1902999Z x0 = x[:, :D] 2025-05-07T20:31:41.1903216Z x1 = x[:, D:] 2025-05-07T20:31:41.1903421Z 2025-05-07T20:31:41.1903608Z if contiguous: 2025-05-07T20:31:41.1903843Z x0 = x0.contiguous() 2025-05-07T20:31:41.1904099Z x1 = x1.contiguous() 2025-05-07T20:31:41.1904343Z 2025-05-07T20:31:41.1904539Z if scale_ub is not None: 2025-05-07T20:31:41.1904808Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1905147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1905461Z ) 2025-05-07T20:31:41.1905934Z else: 2025-05-07T20:31:41.1906151Z scale_ub_tensor = None 2025-05-07T20:31:41.1906415Z 2025-05-07T20:31:41.1906644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1906962Z op = silu_mul_quant 2025-05-07T20:31:41.1907211Z if compiled: 2025-05-07T20:31:41.1907457Z op = torch.compile(op) 2025-05-07T20:31:41.1907776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1908046Z 2025-05-07T20:31:41.1908238Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.1908400Z 2025-05-07T20:31:41.1908508Z moe/activation_test.py:117: 2025-05-07T20:31:41.1908931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1909266Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.1909577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1910294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.1910979Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.1911518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1912194Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1912854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1913380Z kernel = self.compile( 2025-05-07T20:31:41.1913922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1914572Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1914961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1915194Z 2025-05-07T20:31:41.1915400Z self = 2025-05-07T20:31:41.1916471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1917840Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9a50a40>} 2025-05-07T20:31:41.1919178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1920188Z context = 2025-05-07T20:31:41.1920478Z 2025-05-07T20:31:41.1920643Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1921163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1921627Z module_map=module_map) 2025-05-07T20:31:41.1922108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1922463Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.1922725Z E ^ 2025-05-07T20:31:41.1923183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1923639Z 2025-05-07T20:31:41.1924057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1924568Z 2025-05-07T20:31:41.1924675Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1925085Z self=, 2025-05-07T20:31:41.1925481Z T=128, 2025-05-07T20:31:41.1925673Z D=5120, 2025-05-07T20:31:41.1925866Z scale_ub=None, 2025-05-07T20:31:41.1926081Z contiguous=False, 2025-05-07T20:31:41.1926312Z compiled=True, 2025-05-07T20:31:41.1926519Z ) 2025-05-07T20:31:41.1926840Z self = 2025-05-07T20:31:41.1927333Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.1927689Z 2025-05-07T20:31:41.1927779Z @given( 2025-05-07T20:31:41.1928012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1928319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1928623Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1929066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1929390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1929678Z ) 2025-05-07T20:31:41.1930031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1930466Z def test_silu_mul_quant( 2025-05-07T20:31:41.1930709Z self, 2025-05-07T20:31:41.1930903Z T: int, 2025-05-07T20:31:41.1931095Z D: int, 2025-05-07T20:31:41.1931321Z scale_ub: Optional[float], 2025-05-07T20:31:41.1931593Z contiguous: bool, 2025-05-07T20:31:41.1931828Z compiled: bool, 2025-05-07T20:31:41.1932056Z ) -> None: 2025-05-07T20:31:41.1932270Z torch.manual_seed(2025) 2025-05-07T20:31:41.1932515Z 2025-05-07T20:31:41.1932784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1933127Z 2025-05-07T20:31:41.1933321Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1933616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1933924Z x = x_sign * x_clamp 2025-05-07T20:31:41.1934162Z x0 = x[:, :D] 2025-05-07T20:31:41.1934376Z x1 = x[:, D:] 2025-05-07T20:31:41.1934584Z 2025-05-07T20:31:41.1934770Z if contiguous: 2025-05-07T20:31:41.1935001Z x0 = x0.contiguous() 2025-05-07T20:31:41.1935266Z x1 = x1.contiguous() 2025-05-07T20:31:41.1935508Z 2025-05-07T20:31:41.1935696Z if scale_ub is not None: 2025-05-07T20:31:41.1935978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1936310Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1936611Z ) 2025-05-07T20:31:41.1936808Z else: 2025-05-07T20:31:41.1937019Z scale_ub_tensor = None 2025-05-07T20:31:41.1937269Z 2025-05-07T20:31:41.1937500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1937820Z op = silu_mul_quant 2025-05-07T20:31:41.1938073Z if compiled: 2025-05-07T20:31:41.1938319Z op = torch.compile(op) 2025-05-07T20:31:41.1938615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1938901Z 2025-05-07T20:31:41.1939096Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.1939266Z 2025-05-07T20:31:41.1939365Z moe/activation_test.py:117: 2025-05-07T20:31:41.1939663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1940016Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.1940414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1940969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.1941532Z return fn(*args, **kwargs) 2025-05-07T20:31:41.1942184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.1942875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.1943416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1944092Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1944753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1945283Z kernel = self.compile( 2025-05-07T20:31:41.1945818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1946465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1946859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1947087Z 2025-05-07T20:31:41.1947296Z self = 2025-05-07T20:31:41.1948449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1949801Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9a52c00>} 2025-05-07T20:31:41.1951131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1952149Z context = 2025-05-07T20:31:41.1952430Z 2025-05-07T20:31:41.1952596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1953108Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1953577Z module_map=module_map) 2025-05-07T20:31:41.1953935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1954283Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.1954534Z E ^ 2025-05-07T20:31:41.1954989Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1955431Z 2025-05-07T20:31:41.1955846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1956355Z 2025-05-07T20:31:41.1956460Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1956860Z self=, 2025-05-07T20:31:41.1957256Z T=128, 2025-05-07T20:31:41.1957442Z D=7168, 2025-05-07T20:31:41.1957629Z scale_ub=1200.0, 2025-05-07T20:31:41.1957845Z contiguous=False, 2025-05-07T20:31:41.1958067Z compiled=False, 2025-05-07T20:31:41.1958263Z ) 2025-05-07T20:31:41.2829524Z self = 2025-05-07T20:31:41.2830308Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.2830689Z 2025-05-07T20:31:41.2830801Z @given( 2025-05-07T20:31:41.2831095Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2831402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2831709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2832232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2832554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2832840Z ) 2025-05-07T20:31:41.2833183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2833618Z def test_silu_mul_quant( 2025-05-07T20:31:41.2833862Z self, 2025-05-07T20:31:41.2834046Z T: int, 2025-05-07T20:31:41.2834240Z D: int, 2025-05-07T20:31:41.2834458Z scale_ub: Optional[float], 2025-05-07T20:31:41.2834730Z contiguous: bool, 2025-05-07T20:31:41.2834966Z compiled: bool, 2025-05-07T20:31:41.2835186Z ) -> None: 2025-05-07T20:31:41.2835395Z torch.manual_seed(2025) 2025-05-07T20:31:41.2835636Z 2025-05-07T20:31:41.2835912Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2836254Z 2025-05-07T20:31:41.2836441Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2836733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2837035Z x = x_sign * x_clamp 2025-05-07T20:31:41.2837266Z x0 = x[:, :D] 2025-05-07T20:31:41.2837475Z x1 = x[:, D:] 2025-05-07T20:31:41.2837679Z 2025-05-07T20:31:41.2837856Z if contiguous: 2025-05-07T20:31:41.2838085Z x0 = x0.contiguous() 2025-05-07T20:31:41.2838341Z x1 = x1.contiguous() 2025-05-07T20:31:41.2838571Z 2025-05-07T20:31:41.2838899Z if scale_ub is not None: 2025-05-07T20:31:41.2839171Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2839497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2839803Z ) 2025-05-07T20:31:41.2839994Z else: 2025-05-07T20:31:41.2840197Z scale_ub_tensor = None 2025-05-07T20:31:41.2840443Z 2025-05-07T20:31:41.2840671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2840984Z op = silu_mul_quant 2025-05-07T20:31:41.2841228Z if compiled: 2025-05-07T20:31:41.2841472Z op = torch.compile(op) 2025-05-07T20:31:41.2841762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2842026Z 2025-05-07T20:31:41.2842209Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2842372Z 2025-05-07T20:31:41.2842481Z moe/activation_test.py:117: 2025-05-07T20:31:41.2842767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2843103Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2843382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2844070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2844765Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2845300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2845988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2846640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2847172Z kernel = self.compile( 2025-05-07T20:31:41.2847834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2848492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2848890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2849126Z 2025-05-07T20:31:41.2849334Z self = 2025-05-07T20:31:41.2850423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2851950Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c971f380>} 2025-05-07T20:31:41.2853301Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2854329Z context = 2025-05-07T20:31:41.2854621Z 2025-05-07T20:31:41.2854787Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2855313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2855779Z module_map=module_map) 2025-05-07T20:31:41.2856141Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2856501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2856763Z E ^ 2025-05-07T20:31:41.2857221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2857675Z 2025-05-07T20:31:41.2858088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2858603Z 2025-05-07T20:31:41.2858707Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2859198Z self=, 2025-05-07T20:31:41.2859600Z T=128, 2025-05-07T20:31:41.2859793Z D=5120, 2025-05-07T20:31:41.2859996Z scale_ub=None, 2025-05-07T20:31:41.2860208Z contiguous=False, 2025-05-07T20:31:41.2860438Z compiled=False, 2025-05-07T20:31:41.2860644Z ) 2025-05-07T20:31:41.2860997Z self = 2025-05-07T20:31:41.2861579Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.2861862Z 2025-05-07T20:31:41.2861944Z @given( 2025-05-07T20:31:41.2862178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.2862488Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.2862798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.2863129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.2863447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.2863740Z ) 2025-05-07T20:31:41.2864091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.2864527Z def test_silu_mul_quant( 2025-05-07T20:31:41.2864771Z self, 2025-05-07T20:31:41.2864971Z T: int, 2025-05-07T20:31:41.2865161Z D: int, 2025-05-07T20:31:41.2865383Z scale_ub: Optional[float], 2025-05-07T20:31:41.2865654Z contiguous: bool, 2025-05-07T20:31:41.2865897Z compiled: bool, 2025-05-07T20:31:41.2866124Z ) -> None: 2025-05-07T20:31:41.2866341Z torch.manual_seed(2025) 2025-05-07T20:31:41.2866584Z 2025-05-07T20:31:41.2867108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.2867588Z 2025-05-07T20:31:41.2867840Z x_sign = torch.sign(x) 2025-05-07T20:31:41.2868258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.2868700Z x = x_sign * x_clamp 2025-05-07T20:31:41.2868996Z x0 = x[:, :D] 2025-05-07T20:31:41.2869344Z x1 = x[:, D:] 2025-05-07T20:31:41.2869682Z 2025-05-07T20:31:41.2869925Z if contiguous: 2025-05-07T20:31:41.2870278Z x0 = x0.contiguous() 2025-05-07T20:31:41.2870736Z x1 = x1.contiguous() 2025-05-07T20:31:41.2871066Z 2025-05-07T20:31:41.2871375Z if scale_ub is not None: 2025-05-07T20:31:41.2871778Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.2881043Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.2881590Z ) 2025-05-07T20:31:41.2881795Z else: 2025-05-07T20:31:41.2882013Z scale_ub_tensor = None 2025-05-07T20:31:41.2882267Z 2025-05-07T20:31:41.2882515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.2882844Z op = silu_mul_quant 2025-05-07T20:31:41.2883098Z if compiled: 2025-05-07T20:31:41.2883358Z op = torch.compile(op) 2025-05-07T20:31:41.2883664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2883937Z 2025-05-07T20:31:41.2884128Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.2884294Z 2025-05-07T20:31:41.2884394Z moe/activation_test.py:117: 2025-05-07T20:31:41.2884690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2885020Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.2885306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.2886010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.2886710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.2887256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.2888012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.2888771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.2889316Z kernel = self.compile( 2025-05-07T20:31:41.2889877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.2890537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.2890941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.2891183Z 2025-05-07T20:31:41.2891394Z self = 2025-05-07T20:31:41.2892490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.2893876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9777380>} 2025-05-07T20:31:41.2895220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.2896248Z context = 2025-05-07T20:31:41.2896543Z 2025-05-07T20:31:41.2896714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.2897246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.2897720Z module_map=module_map) 2025-05-07T20:31:41.2898085Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.2898443Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.2898711Z E ^ 2025-05-07T20:31:41.2899179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.2899632Z 2025-05-07T20:31:41.2900048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.2900564Z 2025-05-07T20:31:41.2900670Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.2901088Z self=, 2025-05-07T20:31:41.2901493Z T=128, 2025-05-07T20:31:41.2901788Z D=5120, 2025-05-07T20:31:41.2901991Z scale_ub=1200.0, 2025-05-07T20:31:41.2902215Z contiguous=True, 2025-05-07T20:31:41.2902445Z compiled=False, 2025-05-07T20:31:41.2902659Z ) 2025-05-07T20:31:41.6133280Z self = 2025-05-07T20:31:41.6134719Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.6135435Z 2025-05-07T20:31:41.6135637Z @given( 2025-05-07T20:31:41.6136235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.6136899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.6137444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.6138046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.6138640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.6139153Z ) 2025-05-07T20:31:41.6139776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.6140584Z def test_silu_mul_quant( 2025-05-07T20:31:41.6140935Z self, 2025-05-07T20:31:41.6141123Z T: int, 2025-05-07T20:31:41.6141320Z D: int, 2025-05-07T20:31:41.6141541Z scale_ub: Optional[float], 2025-05-07T20:31:41.6141811Z contiguous: bool, 2025-05-07T20:31:41.6142046Z compiled: bool, 2025-05-07T20:31:41.6142272Z ) -> None: 2025-05-07T20:31:41.6142477Z torch.manual_seed(2025) 2025-05-07T20:31:41.6142718Z 2025-05-07T20:31:41.6143172Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.6143516Z 2025-05-07T20:31:41.6143711Z x_sign = torch.sign(x) 2025-05-07T20:31:41.6144003Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.6144311Z x = x_sign * x_clamp 2025-05-07T20:31:41.6144548Z x0 = x[:, :D] 2025-05-07T20:31:41.6144763Z x1 = x[:, D:] 2025-05-07T20:31:41.6144974Z 2025-05-07T20:31:41.6145158Z if contiguous: 2025-05-07T20:31:41.6145392Z x0 = x0.contiguous() 2025-05-07T20:31:41.6145653Z x1 = x1.contiguous() 2025-05-07T20:31:41.6145889Z 2025-05-07T20:31:41.6146083Z if scale_ub is not None: 2025-05-07T20:31:41.6146353Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.6146689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.6147001Z ) 2025-05-07T20:31:41.6147195Z else: 2025-05-07T20:31:41.6147403Z scale_ub_tensor = None 2025-05-07T20:31:41.6147659Z 2025-05-07T20:31:41.6147887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.6148197Z op = silu_mul_quant 2025-05-07T20:31:41.6148449Z if compiled: 2025-05-07T20:31:41.6148691Z op = torch.compile(op) 2025-05-07T20:31:41.6148983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6149254Z 2025-05-07T20:31:41.6149442Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.6149609Z 2025-05-07T20:31:41.6149730Z moe/activation_test.py:117: 2025-05-07T20:31:41.6150052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6150388Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.6150665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6151345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.6152038Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.6152579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.6153258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.6153908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.6154438Z kernel = self.compile( 2025-05-07T20:31:41.6154975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.6155756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.6156142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6156371Z 2025-05-07T20:31:41.6156574Z self = 2025-05-07T20:31:41.6157654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.6159025Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c97749a0>} 2025-05-07T20:31:41.6160361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.6161385Z context = 2025-05-07T20:31:41.6161677Z 2025-05-07T20:31:41.6161842Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.6162368Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.6162912Z module_map=module_map) 2025-05-07T20:31:41.6163283Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.6163639Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.6163895Z E ^ 2025-05-07T20:31:41.6164359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.6164811Z 2025-05-07T20:31:41.6165225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.6165738Z 2025-05-07T20:31:41.6165849Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.6166256Z self=, 2025-05-07T20:31:41.6166661Z T=1, 2025-05-07T20:31:41.6166856Z D=7168, 2025-05-07T20:31:41.6167049Z scale_ub=1200.0, 2025-05-07T20:31:41.6167275Z contiguous=True, 2025-05-07T20:31:41.6167619Z compiled=True, 2025-05-07T20:31:41.6167829Z ) 2025-05-07T20:31:41.6168152Z self = 2025-05-07T20:31:41.6168637Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.6168895Z 2025-05-07T20:31:41.6168978Z @given( 2025-05-07T20:31:41.6169207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.6169523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.6169838Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.6170197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.6170549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.6170844Z ) 2025-05-07T20:31:41.6171188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.6171633Z def test_silu_mul_quant( 2025-05-07T20:31:41.6171880Z self, 2025-05-07T20:31:41.6172079Z T: int, 2025-05-07T20:31:41.6172274Z D: int, 2025-05-07T20:31:41.6172506Z scale_ub: Optional[float], 2025-05-07T20:31:41.6172780Z contiguous: bool, 2025-05-07T20:31:41.6173009Z compiled: bool, 2025-05-07T20:31:41.6173235Z ) -> None: 2025-05-07T20:31:41.6173449Z torch.manual_seed(2025) 2025-05-07T20:31:41.6173686Z 2025-05-07T20:31:41.6173958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.6174303Z 2025-05-07T20:31:41.6174498Z x_sign = torch.sign(x) 2025-05-07T20:31:41.6174891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.6175204Z x = x_sign * x_clamp 2025-05-07T20:31:41.6175441Z x0 = x[:, :D] 2025-05-07T20:31:41.6175665Z x1 = x[:, D:] 2025-05-07T20:31:41.6175878Z 2025-05-07T20:31:41.6176062Z if contiguous: 2025-05-07T20:31:41.6176295Z x0 = x0.contiguous() 2025-05-07T20:31:41.6176555Z x1 = x1.contiguous() 2025-05-07T20:31:41.6176796Z 2025-05-07T20:31:41.6176999Z if scale_ub is not None: 2025-05-07T20:31:41.6177273Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.6177609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.6177917Z ) 2025-05-07T20:31:41.6178112Z else: 2025-05-07T20:31:41.6178329Z scale_ub_tensor = None 2025-05-07T20:31:41.6178580Z 2025-05-07T20:31:41.6178816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.6179131Z op = silu_mul_quant 2025-05-07T20:31:41.6179380Z if compiled: 2025-05-07T20:31:41.6179630Z op = torch.compile(op) 2025-05-07T20:31:41.6179926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6180196Z 2025-05-07T20:31:41.6180391Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.6180553Z 2025-05-07T20:31:41.6180658Z moe/activation_test.py:117: 2025-05-07T20:31:41.6180949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6181409Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.6181695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6182248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.6182808Z return fn(*args, **kwargs) 2025-05-07T20:31:41.6183606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.6184311Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.6184850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.6185527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.6186190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.6186723Z kernel = self.compile( 2025-05-07T20:31:41.6187277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.6187930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.6188324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6188554Z 2025-05-07T20:31:41.6188767Z self = 2025-05-07T20:31:41.6189846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.6191211Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9832840>} 2025-05-07T20:31:41.6192556Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.6193580Z context = 2025-05-07T20:31:41.6193889Z 2025-05-07T20:31:41.6194133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.6194698Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.6195269Z module_map=module_map) 2025-05-07T20:31:41.6195635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.6195990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.6196249Z E ^ 2025-05-07T20:31:41.6196713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.6197162Z 2025-05-07T20:31:41.6197587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.6198098Z 2025-05-07T20:31:41.6198203Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.6198614Z self=, 2025-05-07T20:31:41.6199016Z T=1, 2025-05-07T20:31:41.6199206Z D=7168, 2025-05-07T20:31:41.6199398Z scale_ub=1200.0, 2025-05-07T20:31:41.6199623Z contiguous=False, 2025-05-07T20:31:41.6199885Z compiled=True, 2025-05-07T20:31:41.6200109Z ) 2025-05-07T20:31:41.7193285Z self = 2025-05-07T20:31:41.7193933Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.7194246Z 2025-05-07T20:31:41.7194330Z @given( 2025-05-07T20:31:41.7194573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.7194890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.7195374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.7195707Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.7196042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.7196334Z ) 2025-05-07T20:31:41.7196680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.7197124Z def test_silu_mul_quant( 2025-05-07T20:31:41.7197367Z self, 2025-05-07T20:31:41.7197556Z T: int, 2025-05-07T20:31:41.7197765Z D: int, 2025-05-07T20:31:41.7197984Z scale_ub: Optional[float], 2025-05-07T20:31:41.7198278Z contiguous: bool, 2025-05-07T20:31:41.7198518Z compiled: bool, 2025-05-07T20:31:41.7198743Z ) -> None: 2025-05-07T20:31:41.7198963Z torch.manual_seed(2025) 2025-05-07T20:31:41.7199202Z 2025-05-07T20:31:41.7199477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.7199834Z 2025-05-07T20:31:41.7200064Z x_sign = torch.sign(x) 2025-05-07T20:31:41.7200361Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.7200741Z x = x_sign * x_clamp 2025-05-07T20:31:41.7201068Z x0 = x[:, :D] 2025-05-07T20:31:41.7201280Z x1 = x[:, D:] 2025-05-07T20:31:41.7201488Z 2025-05-07T20:31:41.7201676Z if contiguous: 2025-05-07T20:31:41.7201906Z x0 = x0.contiguous() 2025-05-07T20:31:41.7202170Z x1 = x1.contiguous() 2025-05-07T20:31:41.7202416Z 2025-05-07T20:31:41.7202604Z if scale_ub is not None: 2025-05-07T20:31:41.7202874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.7203206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.7203503Z ) 2025-05-07T20:31:41.7203704Z else: 2025-05-07T20:31:41.7203914Z scale_ub_tensor = None 2025-05-07T20:31:41.7204155Z 2025-05-07T20:31:41.7204385Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.7204706Z op = silu_mul_quant 2025-05-07T20:31:41.7204948Z if compiled: 2025-05-07T20:31:41.7205194Z op = torch.compile(op) 2025-05-07T20:31:41.7205488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.7205979Z 2025-05-07T20:31:41.7206167Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.7206330Z 2025-05-07T20:31:41.7206427Z moe/activation_test.py:117: 2025-05-07T20:31:41.7206716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.7207197Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.7207478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.7208100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.7208648Z return fn(*args, **kwargs) 2025-05-07T20:31:41.7209301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.7209994Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.7210530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.7211199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.7211855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.7212393Z kernel = self.compile( 2025-05-07T20:31:41.7212927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.7213566Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.7213955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.7214181Z 2025-05-07T20:31:41.7214389Z self = 2025-05-07T20:31:41.7215567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.7216930Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8d37420>} 2025-05-07T20:31:41.7218265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.7219284Z context = 2025-05-07T20:31:41.7219572Z 2025-05-07T20:31:41.7219746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.7220275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.7220742Z module_map=module_map) 2025-05-07T20:31:41.7221095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.7221451Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.7221711Z E ^ 2025-05-07T20:31:41.7222174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.7222625Z 2025-05-07T20:31:41.7223041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.7223553Z 2025-05-07T20:31:41.7223659Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.7224070Z self=, 2025-05-07T20:31:41.7224469Z T=1, 2025-05-07T20:31:41.7224654Z D=7168, 2025-05-07T20:31:41.7224857Z scale_ub=None, 2025-05-07T20:31:41.7225067Z contiguous=False, 2025-05-07T20:31:41.7225307Z compiled=True, 2025-05-07T20:31:41.7225507Z ) 2025-05-07T20:31:41.7901146Z self = 2025-05-07T20:31:41.7901893Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.7902249Z 2025-05-07T20:31:41.7902362Z @given( 2025-05-07T20:31:41.7902670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.7902981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.7903486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.7903808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.7904134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.7904419Z ) 2025-05-07T20:31:41.7904759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.7905194Z def test_silu_mul_quant( 2025-05-07T20:31:41.7905436Z self, 2025-05-07T20:31:41.7905825Z T: int, 2025-05-07T20:31:41.7906024Z D: int, 2025-05-07T20:31:41.7906243Z scale_ub: Optional[float], 2025-05-07T20:31:41.7906512Z contiguous: bool, 2025-05-07T20:31:41.7906743Z compiled: bool, 2025-05-07T20:31:41.7906965Z ) -> None: 2025-05-07T20:31:41.7907181Z torch.manual_seed(2025) 2025-05-07T20:31:41.7907415Z 2025-05-07T20:31:41.7907683Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.7908028Z 2025-05-07T20:31:41.7908215Z x_sign = torch.sign(x) 2025-05-07T20:31:41.7908505Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.7908811Z x = x_sign * x_clamp 2025-05-07T20:31:41.7909044Z x0 = x[:, :D] 2025-05-07T20:31:41.7909258Z x1 = x[:, D:] 2025-05-07T20:31:41.7909469Z 2025-05-07T20:31:41.7909648Z if contiguous: 2025-05-07T20:31:41.7909875Z x0 = x0.contiguous() 2025-05-07T20:31:41.7910128Z x1 = x1.contiguous() 2025-05-07T20:31:41.7910480Z 2025-05-07T20:31:41.7910676Z if scale_ub is not None: 2025-05-07T20:31:41.7910948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.7911277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.7911579Z ) 2025-05-07T20:31:41.7911770Z else: 2025-05-07T20:31:41.7911978Z scale_ub_tensor = None 2025-05-07T20:31:41.7912221Z 2025-05-07T20:31:41.7912448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.7912764Z op = silu_mul_quant 2025-05-07T20:31:41.7913006Z if compiled: 2025-05-07T20:31:41.7913250Z op = torch.compile(op) 2025-05-07T20:31:41.7913539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.7913805Z 2025-05-07T20:31:41.7913993Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.7914271Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.7914551Z 2025-05-07T20:31:41.7914792Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.7915121Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.7915410Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.7915717Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.7916074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.7916385Z 2025-05-07T20:31:41.7916581Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.7916784Z 2025-05-07T20:31:41.7916883Z moe/activation_test.py:126: 2025-05-07T20:31:41.7917180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.7917507Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.7917831Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.7918612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.7919361Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.7919895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.7920576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.7921259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.7921976Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.7922838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.7923623Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.7924639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.7925363Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.7925962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.7926484Z fn() 2025-05-07T20:31:41.7926990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.7935186Z self.fn.run( 2025-05-07T20:31:41.7935710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.7936266Z kernel = self.compile( 2025-05-07T20:31:41.7936823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.7937480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.7937887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.7938127Z 2025-05-07T20:31:41.7938449Z self = 2025-05-07T20:31:41.7939543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.7941788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8d36c00>} 2025-05-07T20:31:41.7943855Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.7945390Z context = 2025-05-07T20:31:41.7945819Z 2025-05-07T20:31:41.7946046Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.7946591Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.7947062Z module_map=module_map) 2025-05-07T20:31:41.7947431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.7947793Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.7948078Z E ^ 2025-05-07T20:31:41.7948549Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.7949008Z 2025-05-07T20:31:41.7949429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.7949942Z 2025-05-07T20:31:41.7950058Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.7950473Z self=, 2025-05-07T20:31:41.7950884Z T=1, 2025-05-07T20:31:41.7951083Z D=5120, 2025-05-07T20:31:41.7951292Z scale_ub=1200.0, 2025-05-07T20:31:41.7951518Z contiguous=False, 2025-05-07T20:31:41.7951759Z compiled=True, 2025-05-07T20:31:41.7951979Z ) 2025-05-07T20:31:41.9126872Z self = 2025-05-07T20:31:41.9127644Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.9128058Z 2025-05-07T20:31:41.9128186Z @given( 2025-05-07T20:31:41.9128520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.9129163Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.9129586Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.9130036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.9130365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.9130652Z ) 2025-05-07T20:31:41.9130996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.9131438Z def test_silu_mul_quant( 2025-05-07T20:31:41.9131688Z self, 2025-05-07T20:31:41.9131879Z T: int, 2025-05-07T20:31:41.9132075Z D: int, 2025-05-07T20:31:41.9132294Z scale_ub: Optional[float], 2025-05-07T20:31:41.9132566Z contiguous: bool, 2025-05-07T20:31:41.9132803Z compiled: bool, 2025-05-07T20:31:41.9133028Z ) -> None: 2025-05-07T20:31:41.9133247Z torch.manual_seed(2025) 2025-05-07T20:31:41.9133487Z 2025-05-07T20:31:41.9133772Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.9134117Z 2025-05-07T20:31:41.9134308Z x_sign = torch.sign(x) 2025-05-07T20:31:41.9134602Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.9134912Z x = x_sign * x_clamp 2025-05-07T20:31:41.9135152Z x0 = x[:, :D] 2025-05-07T20:31:41.9135376Z x1 = x[:, D:] 2025-05-07T20:31:41.9135591Z 2025-05-07T20:31:41.9135776Z if contiguous: 2025-05-07T20:31:41.9136143Z x0 = x0.contiguous() 2025-05-07T20:31:41.9136410Z x1 = x1.contiguous() 2025-05-07T20:31:41.9136646Z 2025-05-07T20:31:41.9136843Z if scale_ub is not None: 2025-05-07T20:31:41.9137127Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.9137467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.9137780Z ) 2025-05-07T20:31:41.9137977Z else: 2025-05-07T20:31:41.9138197Z scale_ub_tensor = None 2025-05-07T20:31:41.9138449Z 2025-05-07T20:31:41.9138684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.9139000Z op = silu_mul_quant 2025-05-07T20:31:41.9139250Z if compiled: 2025-05-07T20:31:41.9139503Z op = torch.compile(op) 2025-05-07T20:31:41.9139802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.9140073Z 2025-05-07T20:31:41.9140270Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.9140433Z 2025-05-07T20:31:41.9140547Z moe/activation_test.py:117: 2025-05-07T20:31:41.9140844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.9141178Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.9141465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.9142024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.9142583Z return fn(*args, **kwargs) 2025-05-07T20:31:41.9143250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.9143941Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.9144469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.9145152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.9145820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.9146353Z kernel = self.compile( 2025-05-07T20:31:41.9146889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.9147544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.9147941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.9148259Z 2025-05-07T20:31:41.9148468Z self = 2025-05-07T20:31:41.9149540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.9150925Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8d351c0>} 2025-05-07T20:31:41.9152269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.9153297Z context = 2025-05-07T20:31:41.9153586Z 2025-05-07T20:31:41.9153753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.9154287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.9154757Z module_map=module_map) 2025-05-07T20:31:41.9155131Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.9155491Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.9155761Z E ^ 2025-05-07T20:31:41.9156310Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.9156763Z 2025-05-07T20:31:41.9157180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.9157695Z 2025-05-07T20:31:41.9157804Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.9158229Z self=, 2025-05-07T20:31:41.9158642Z T=1, 2025-05-07T20:31:41.9158843Z D=5120, 2025-05-07T20:31:41.9159050Z scale_ub=1200.0, 2025-05-07T20:31:41.9159302Z contiguous=False, 2025-05-07T20:31:41.9159537Z compiled=False, 2025-05-07T20:31:41.9159746Z ) 2025-05-07T20:31:41.9160071Z self = 2025-05-07T20:31:41.9160569Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.9160838Z 2025-05-07T20:31:41.9160919Z @given( 2025-05-07T20:31:41.9161163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.9161482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.9161790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.9162128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.9162460Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.9162751Z ) 2025-05-07T20:31:41.9163101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.9163550Z def test_silu_mul_quant( 2025-05-07T20:31:41.9163801Z self, 2025-05-07T20:31:41.9163999Z T: int, 2025-05-07T20:31:41.9164206Z D: int, 2025-05-07T20:31:41.9164433Z scale_ub: Optional[float], 2025-05-07T20:31:41.9164707Z contiguous: bool, 2025-05-07T20:31:41.9164955Z compiled: bool, 2025-05-07T20:31:41.9165186Z ) -> None: 2025-05-07T20:31:41.9165401Z torch.manual_seed(2025) 2025-05-07T20:31:41.9165652Z 2025-05-07T20:31:41.9165936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.9166281Z 2025-05-07T20:31:41.9166483Z x_sign = torch.sign(x) 2025-05-07T20:31:41.9166781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.9167098Z x = x_sign * x_clamp 2025-05-07T20:31:41.9167345Z x0 = x[:, :D] 2025-05-07T20:31:41.9167656Z x1 = x[:, D:] 2025-05-07T20:31:41.9167862Z 2025-05-07T20:31:41.9168147Z if contiguous: 2025-05-07T20:31:41.9168389Z x0 = x0.contiguous() 2025-05-07T20:31:41.9168657Z x1 = x1.contiguous() 2025-05-07T20:31:41.9168898Z 2025-05-07T20:31:41.9169097Z if scale_ub is not None: 2025-05-07T20:31:41.9169376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.9169712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.9170031Z ) 2025-05-07T20:31:41.9170230Z else: 2025-05-07T20:31:41.9170451Z scale_ub_tensor = None 2025-05-07T20:31:41.9170706Z 2025-05-07T20:31:41.9170941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.9171255Z op = silu_mul_quant 2025-05-07T20:31:41.9171512Z if compiled: 2025-05-07T20:31:41.9171765Z op = torch.compile(op) 2025-05-07T20:31:41.9172061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.9172345Z 2025-05-07T20:31:41.9172547Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.9172719Z 2025-05-07T20:31:41.9172827Z moe/activation_test.py:117: 2025-05-07T20:31:41.9173122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.9173462Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.9173749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.9174440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.9175223Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.9175769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.9176458Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.9177123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.9177664Z kernel = self.compile( 2025-05-07T20:31:41.9178212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.9178868Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.9179272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.9179509Z 2025-05-07T20:31:41.9179716Z self = 2025-05-07T20:31:41.9180802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.9182166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c91994e0>} 2025-05-07T20:31:41.9183510Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.9184538Z context = 2025-05-07T20:31:41.9184827Z 2025-05-07T20:31:41.9185002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.9185535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.9186007Z module_map=module_map) 2025-05-07T20:31:41.9186376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.9186737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.9187000Z E ^ 2025-05-07T20:31:41.9187467Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.9187915Z 2025-05-07T20:31:41.9188340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.9188965Z 2025-05-07T20:31:41.9189077Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.9189490Z self=, 2025-05-07T20:31:41.9189900Z T=16384, 2025-05-07T20:31:41.9190103Z D=5120, 2025-05-07T20:31:41.9190309Z scale_ub=1200.0, 2025-05-07T20:31:41.9190583Z contiguous=False, 2025-05-07T20:31:41.9190828Z compiled=True, 2025-05-07T20:31:41.9191032Z ) 2025-05-07T20:31:42.1827691Z self = 2025-05-07T20:31:42.1828399Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.1828800Z 2025-05-07T20:31:42.1828905Z @given( 2025-05-07T20:31:42.1829213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.1829615Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.1830354Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.1831015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.1831665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.1832238Z ) 2025-05-07T20:31:42.1832936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.1833822Z def test_silu_mul_quant( 2025-05-07T20:31:42.1834312Z self, 2025-05-07T20:31:42.1834702Z T: int, 2025-05-07T20:31:42.1835428Z D: int, 2025-05-07T20:31:42.1835874Z scale_ub: Optional[float], 2025-05-07T20:31:42.1836416Z contiguous: bool, 2025-05-07T20:31:42.1836895Z compiled: bool, 2025-05-07T20:31:42.1837339Z ) -> None: 2025-05-07T20:31:42.1837775Z torch.manual_seed(2025) 2025-05-07T20:31:42.1838263Z 2025-05-07T20:31:42.1838802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.1839504Z 2025-05-07T20:31:42.1839878Z x_sign = torch.sign(x) 2025-05-07T20:31:42.1840175Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.1840489Z x = x_sign * x_clamp 2025-05-07T20:31:42.1840737Z x0 = x[:, :D] 2025-05-07T20:31:42.1840951Z x1 = x[:, D:] 2025-05-07T20:31:42.1841165Z 2025-05-07T20:31:42.1841356Z if contiguous: 2025-05-07T20:31:42.1841589Z x0 = x0.contiguous() 2025-05-07T20:31:42.1841853Z x1 = x1.contiguous() 2025-05-07T20:31:42.1842100Z 2025-05-07T20:31:42.1842301Z if scale_ub is not None: 2025-05-07T20:31:42.1842578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.1842919Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.1843237Z ) 2025-05-07T20:31:42.1843432Z else: 2025-05-07T20:31:42.1843651Z scale_ub_tensor = None 2025-05-07T20:31:42.1843907Z 2025-05-07T20:31:42.1844140Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.1844463Z op = silu_mul_quant 2025-05-07T20:31:42.1844716Z if compiled: 2025-05-07T20:31:42.1844965Z op = torch.compile(op) 2025-05-07T20:31:42.1845263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.1845546Z 2025-05-07T20:31:42.1845739Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.1845910Z 2025-05-07T20:31:42.1846013Z moe/activation_test.py:117: 2025-05-07T20:31:42.1846321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.1846653Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.1846939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.1847590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.1848158Z return fn(*args, **kwargs) 2025-05-07T20:31:42.1848815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.1849644Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.1850189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.1850871Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.1851536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.1852077Z kernel = self.compile( 2025-05-07T20:31:42.1852626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.1853277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.1853679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.1853909Z 2025-05-07T20:31:42.1854122Z self = 2025-05-07T20:31:42.1855220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.1856647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9199580>} 2025-05-07T20:31:42.1858077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.1859109Z context = 2025-05-07T20:31:42.1859401Z 2025-05-07T20:31:42.1859568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.1860091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.1860563Z module_map=module_map) 2025-05-07T20:31:42.1860931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.1861287Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.1861547Z E ^ 2025-05-07T20:31:42.1862017Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.1862472Z 2025-05-07T20:31:42.1862897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.1863407Z 2025-05-07T20:31:42.1863518Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.1863927Z self=, 2025-05-07T20:31:42.1864332Z T=2048, 2025-05-07T20:31:42.1864523Z D=7168, 2025-05-07T20:31:42.1864719Z scale_ub=1200.0, 2025-05-07T20:31:42.1864958Z contiguous=False, 2025-05-07T20:31:42.1865185Z compiled=True, 2025-05-07T20:31:42.1865389Z ) 2025-05-07T20:31:42.1865706Z self = 2025-05-07T20:31:42.1866201Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.1866476Z 2025-05-07T20:31:42.1866563Z @given( 2025-05-07T20:31:42.1866790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.1867102Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.1867411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.1867744Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.1868077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.1868365Z ) 2025-05-07T20:31:42.1868714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.1869152Z def test_silu_mul_quant( 2025-05-07T20:31:42.1869396Z self, 2025-05-07T20:31:42.1869682Z T: int, 2025-05-07T20:31:42.1869889Z D: int, 2025-05-07T20:31:42.1870107Z scale_ub: Optional[float], 2025-05-07T20:31:42.1870383Z contiguous: bool, 2025-05-07T20:31:42.1870622Z compiled: bool, 2025-05-07T20:31:42.1870851Z ) -> None: 2025-05-07T20:31:42.1871064Z torch.manual_seed(2025) 2025-05-07T20:31:42.1871306Z 2025-05-07T20:31:42.1871579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.1871921Z 2025-05-07T20:31:42.1872124Z x_sign = torch.sign(x) 2025-05-07T20:31:42.1872421Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.1872727Z x = x_sign * x_clamp 2025-05-07T20:31:42.1872969Z x0 = x[:, :D] 2025-05-07T20:31:42.1873186Z x1 = x[:, D:] 2025-05-07T20:31:42.1873395Z 2025-05-07T20:31:42.1873584Z if contiguous: 2025-05-07T20:31:42.1873822Z x0 = x0.contiguous() 2025-05-07T20:31:42.1874088Z x1 = x1.contiguous() 2025-05-07T20:31:42.1874331Z 2025-05-07T20:31:42.1874524Z if scale_ub is not None: 2025-05-07T20:31:42.1874797Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.1875137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.1875447Z ) 2025-05-07T20:31:42.1875642Z else: 2025-05-07T20:31:42.1875863Z scale_ub_tensor = None 2025-05-07T20:31:42.1876118Z 2025-05-07T20:31:42.1876430Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.1876743Z op = silu_mul_quant 2025-05-07T20:31:42.1876996Z if compiled: 2025-05-07T20:31:42.1877248Z op = torch.compile(op) 2025-05-07T20:31:42.1877540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.1877822Z 2025-05-07T20:31:42.1878020Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.1878197Z 2025-05-07T20:31:42.1878304Z moe/activation_test.py:117: 2025-05-07T20:31:42.1878602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.1878940Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.1879224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.1879783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.1880353Z return fn(*args, **kwargs) 2025-05-07T20:31:42.1881019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.1881712Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.1882243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.1882927Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.1883592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.1884128Z kernel = self.compile( 2025-05-07T20:31:42.1884673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.1885329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.1885732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.1885961Z 2025-05-07T20:31:42.1886175Z self = 2025-05-07T20:31:42.1887263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.1888702Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c919b060>} 2025-05-07T20:31:42.1890129Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.1891153Z context = 2025-05-07T20:31:42.1891447Z 2025-05-07T20:31:42.1891613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.1892141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.1892606Z module_map=module_map) 2025-05-07T20:31:42.1892966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.1893323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.1893583Z E ^ 2025-05-07T20:31:42.1894043Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.1894505Z 2025-05-07T20:31:42.1894920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.1895431Z 2025-05-07T20:31:42.2776773Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.2777666Z self=, 2025-05-07T20:31:42.2778748Z T=1, 2025-05-07T20:31:42.2779132Z D=5120, 2025-05-07T20:31:42.2779530Z scale_ub=None, 2025-05-07T20:31:42.2780249Z contiguous=False, 2025-05-07T20:31:42.2780488Z compiled=False, 2025-05-07T20:31:42.2780696Z ) 2025-05-07T20:31:42.2781025Z self = 2025-05-07T20:31:42.2781519Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:42.2781784Z 2025-05-07T20:31:42.2781867Z @given( 2025-05-07T20:31:42.2782114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.2782448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.2782758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.2783095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.2783433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.2783723Z ) 2025-05-07T20:31:42.2784078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.2784526Z def test_silu_mul_quant( 2025-05-07T20:31:42.2792248Z self, 2025-05-07T20:31:42.2792484Z T: int, 2025-05-07T20:31:42.2792693Z D: int, 2025-05-07T20:31:42.2792918Z scale_ub: Optional[float], 2025-05-07T20:31:42.2793195Z contiguous: bool, 2025-05-07T20:31:42.2793439Z compiled: bool, 2025-05-07T20:31:42.2793666Z ) -> None: 2025-05-07T20:31:42.2793883Z torch.manual_seed(2025) 2025-05-07T20:31:42.2794124Z 2025-05-07T20:31:42.2794409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.2794768Z 2025-05-07T20:31:42.2794964Z x_sign = torch.sign(x) 2025-05-07T20:31:42.2795262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.2795581Z x = x_sign * x_clamp 2025-05-07T20:31:42.2795818Z x0 = x[:, :D] 2025-05-07T20:31:42.2796035Z x1 = x[:, D:] 2025-05-07T20:31:42.2796244Z 2025-05-07T20:31:42.2796429Z if contiguous: 2025-05-07T20:31:42.2796660Z x0 = x0.contiguous() 2025-05-07T20:31:42.2796932Z x1 = x1.contiguous() 2025-05-07T20:31:42.2797167Z 2025-05-07T20:31:42.2797363Z if scale_ub is not None: 2025-05-07T20:31:42.2797644Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.2797977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.2798293Z ) 2025-05-07T20:31:42.2798499Z else: 2025-05-07T20:31:42.2798709Z scale_ub_tensor = None 2025-05-07T20:31:42.2798959Z 2025-05-07T20:31:42.2799358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.2799677Z op = silu_mul_quant 2025-05-07T20:31:42.2799923Z if compiled: 2025-05-07T20:31:42.2800175Z op = torch.compile(op) 2025-05-07T20:31:42.2800471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2800744Z 2025-05-07T20:31:42.2800945Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.2801110Z 2025-05-07T20:31:42.2801217Z moe/activation_test.py:117: 2025-05-07T20:31:42.2801512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2801847Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.2802131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2802830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.2803525Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.2804073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.2804759Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.2805418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.2806237Z kernel = self.compile( 2025-05-07T20:31:42.2806901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.2807660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.2808057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2808291Z 2025-05-07T20:31:42.2808500Z self = 2025-05-07T20:31:42.2809583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.2810963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8e485e0>} 2025-05-07T20:31:42.2812318Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.2813347Z context = 2025-05-07T20:31:42.2813641Z 2025-05-07T20:31:42.2813810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.2814341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.2814804Z module_map=module_map) 2025-05-07T20:31:42.2815174Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.2815532Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.2815796Z E ^ 2025-05-07T20:31:42.2816263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.2816723Z 2025-05-07T20:31:42.2817139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.2817655Z 2025-05-07T20:31:42.2817767Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.2818211Z self=, 2025-05-07T20:31:42.2818618Z T=4096, 2025-05-07T20:31:42.2818813Z D=7168, 2025-05-07T20:31:42.2819012Z scale_ub=1200.0, 2025-05-07T20:31:42.2819237Z contiguous=False, 2025-05-07T20:31:42.2819471Z compiled=False, 2025-05-07T20:31:42.2819683Z ) 2025-05-07T20:31:42.2820152Z self = 2025-05-07T20:31:42.2820647Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:42.2820928Z 2025-05-07T20:31:42.2821007Z @given( 2025-05-07T20:31:42.2821238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.2821548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.2821862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.2822196Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.2822524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.2822811Z ) 2025-05-07T20:31:42.2823168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.2823607Z def test_silu_mul_quant( 2025-05-07T20:31:42.2823847Z self, 2025-05-07T20:31:42.2824051Z T: int, 2025-05-07T20:31:42.2824252Z D: int, 2025-05-07T20:31:42.2824466Z scale_ub: Optional[float], 2025-05-07T20:31:42.2824747Z contiguous: bool, 2025-05-07T20:31:42.2824987Z compiled: bool, 2025-05-07T20:31:42.2825207Z ) -> None: 2025-05-07T20:31:42.2825426Z torch.manual_seed(2025) 2025-05-07T20:31:42.2825667Z 2025-05-07T20:31:42.2825940Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.2826288Z 2025-05-07T20:31:42.2826489Z x_sign = torch.sign(x) 2025-05-07T20:31:42.2826861Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.2827181Z x = x_sign * x_clamp 2025-05-07T20:31:42.2827426Z x0 = x[:, :D] 2025-05-07T20:31:42.2827642Z x1 = x[:, D:] 2025-05-07T20:31:42.2827854Z 2025-05-07T20:31:42.2828040Z if contiguous: 2025-05-07T20:31:42.2828266Z x0 = x0.contiguous() 2025-05-07T20:31:42.2828526Z x1 = x1.contiguous() 2025-05-07T20:31:42.2828765Z 2025-05-07T20:31:42.2828958Z if scale_ub is not None: 2025-05-07T20:31:42.2829231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.2829563Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.2829876Z ) 2025-05-07T20:31:42.2830068Z else: 2025-05-07T20:31:42.2830280Z scale_ub_tensor = None 2025-05-07T20:31:42.2830531Z 2025-05-07T20:31:42.2830763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.2831078Z op = silu_mul_quant 2025-05-07T20:31:42.2831333Z if compiled: 2025-05-07T20:31:42.2831588Z op = torch.compile(op) 2025-05-07T20:31:42.2831889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2832165Z 2025-05-07T20:31:42.2832361Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.2832527Z 2025-05-07T20:31:42.2832624Z moe/activation_test.py:117: 2025-05-07T20:31:42.2832923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2833255Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.2833541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2834226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.2834923Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.2835459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.2836143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.2836810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.2837344Z kernel = self.compile( 2025-05-07T20:31:42.2837885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.2838535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.2839017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2839246Z 2025-05-07T20:31:42.2839458Z self = 2025-05-07T20:31:42.2840537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.2841911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8e499e0>} 2025-05-07T20:31:42.2843257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.2844287Z context = 2025-05-07T20:31:42.2844580Z 2025-05-07T20:31:42.2844757Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.2845282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.2845752Z module_map=module_map) 2025-05-07T20:31:42.2846119Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.2846471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.2846805Z E ^ 2025-05-07T20:31:42.2847274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.2847777Z 2025-05-07T20:31:42.2848204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.2848713Z 2025-05-07T20:31:42.2848819Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.2849228Z self=, 2025-05-07T20:31:42.2849639Z T=16384, 2025-05-07T20:31:42.2849836Z D=7168, 2025-05-07T20:31:42.2850026Z scale_ub=None, 2025-05-07T20:31:42.2850238Z contiguous=True, 2025-05-07T20:31:42.2850460Z compiled=True, 2025-05-07T20:31:42.2850661Z ) 2025-05-07T20:31:42.4192404Z self = 2025-05-07T20:31:42.4193789Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:42.4194372Z 2025-05-07T20:31:42.4194528Z @given( 2025-05-07T20:31:42.4194994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.4195596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.4196204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.4196849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.4197494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.4198046Z ) 2025-05-07T20:31:42.4198760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.4199621Z def test_silu_mul_quant( 2025-05-07T20:31:42.4200112Z self, 2025-05-07T20:31:42.4200494Z T: int, 2025-05-07T20:31:42.4200743Z D: int, 2025-05-07T20:31:42.4200953Z scale_ub: Optional[float], 2025-05-07T20:31:42.4201229Z contiguous: bool, 2025-05-07T20:31:42.4201468Z compiled: bool, 2025-05-07T20:31:42.4201694Z ) -> None: 2025-05-07T20:31:42.4201914Z torch.manual_seed(2025) 2025-05-07T20:31:42.4202160Z 2025-05-07T20:31:42.4202423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.4202768Z 2025-05-07T20:31:42.4202965Z x_sign = torch.sign(x) 2025-05-07T20:31:42.4203253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.4203567Z x = x_sign * x_clamp 2025-05-07T20:31:42.4203809Z x0 = x[:, :D] 2025-05-07T20:31:42.4204204Z x1 = x[:, D:] 2025-05-07T20:31:42.4204412Z 2025-05-07T20:31:42.4204599Z if contiguous: 2025-05-07T20:31:42.4204835Z x0 = x0.contiguous() 2025-05-07T20:31:42.4205091Z x1 = x1.contiguous() 2025-05-07T20:31:42.4205340Z 2025-05-07T20:31:42.4205529Z if scale_ub is not None: 2025-05-07T20:31:42.4205984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.4206323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.4206638Z ) 2025-05-07T20:31:42.4206839Z else: 2025-05-07T20:31:42.4207047Z scale_ub_tensor = None 2025-05-07T20:31:42.4207301Z 2025-05-07T20:31:42.4207589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.4207901Z op = silu_mul_quant 2025-05-07T20:31:42.4208149Z if compiled: 2025-05-07T20:31:42.4208391Z op = torch.compile(op) 2025-05-07T20:31:42.4208690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4208976Z 2025-05-07T20:31:42.4209170Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.4209338Z 2025-05-07T20:31:42.4209438Z moe/activation_test.py:117: 2025-05-07T20:31:42.4209737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4210067Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.4210351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4211048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.4211612Z return fn(*args, **kwargs) 2025-05-07T20:31:42.4212267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.4212951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.4213490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.4214170Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.4214829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.4215349Z kernel = self.compile( 2025-05-07T20:31:42.4215893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.4216535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.4216944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4217179Z 2025-05-07T20:31:42.4217385Z self = 2025-05-07T20:31:42.4218467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.4219829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8e4ab60>} 2025-05-07T20:31:42.4221166Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.4222187Z context = 2025-05-07T20:31:42.4222470Z 2025-05-07T20:31:42.4222637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.4223156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.4223613Z module_map=module_map) 2025-05-07T20:31:42.4223974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.4224325Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.4224705Z E ^ 2025-05-07T20:31:42.4225164Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4225606Z 2025-05-07T20:31:42.4226022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.4226527Z 2025-05-07T20:31:42.4226636Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.4227044Z self=, 2025-05-07T20:31:42.4227441Z T=4096, 2025-05-07T20:31:42.4227634Z D=5120, 2025-05-07T20:31:42.4227825Z scale_ub=None, 2025-05-07T20:31:42.4228042Z contiguous=False, 2025-05-07T20:31:42.4228262Z compiled=True, 2025-05-07T20:31:42.4228459Z ) 2025-05-07T20:31:42.4228770Z self = 2025-05-07T20:31:42.4229270Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:42.4229547Z 2025-05-07T20:31:42.4229632Z @given( 2025-05-07T20:31:42.4229864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.4230174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.4230479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.4230800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.4231125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.4231492Z ) 2025-05-07T20:31:42.4231835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.4232271Z def test_silu_mul_quant( 2025-05-07T20:31:42.4232512Z self, 2025-05-07T20:31:42.4232701Z T: int, 2025-05-07T20:31:42.4232896Z D: int, 2025-05-07T20:31:42.4233108Z scale_ub: Optional[float], 2025-05-07T20:31:42.4233368Z contiguous: bool, 2025-05-07T20:31:42.4233605Z compiled: bool, 2025-05-07T20:31:42.4233830Z ) -> None: 2025-05-07T20:31:42.4234045Z torch.manual_seed(2025) 2025-05-07T20:31:42.4234279Z 2025-05-07T20:31:42.4234550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.4234891Z 2025-05-07T20:31:42.4235082Z x_sign = torch.sign(x) 2025-05-07T20:31:42.4235369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.4235674Z x = x_sign * x_clamp 2025-05-07T20:31:42.4235904Z x0 = x[:, :D] 2025-05-07T20:31:42.4236125Z x1 = x[:, D:] 2025-05-07T20:31:42.4236331Z 2025-05-07T20:31:42.4236510Z if contiguous: 2025-05-07T20:31:42.4236737Z x0 = x0.contiguous() 2025-05-07T20:31:42.4236997Z x1 = x1.contiguous() 2025-05-07T20:31:42.4237226Z 2025-05-07T20:31:42.4237420Z if scale_ub is not None: 2025-05-07T20:31:42.4237692Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.4238018Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.4238325Z ) 2025-05-07T20:31:42.4238521Z else: 2025-05-07T20:31:42.4238729Z scale_ub_tensor = None 2025-05-07T20:31:42.4238970Z 2025-05-07T20:31:42.4239201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.4239515Z op = silu_mul_quant 2025-05-07T20:31:42.4239758Z if compiled: 2025-05-07T20:31:42.4240002Z op = torch.compile(op) 2025-05-07T20:31:42.4240296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4240562Z 2025-05-07T20:31:42.4240762Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.4240923Z 2025-05-07T20:31:42.4241026Z moe/activation_test.py:117: 2025-05-07T20:31:42.4241315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4241644Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.4241918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4242469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.4243108Z return fn(*args, **kwargs) 2025-05-07T20:31:42.4243764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.4244448Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.4244975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.4245652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.4246308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.4246832Z kernel = self.compile( 2025-05-07T20:31:42.4247357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.4248075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.4248480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4248705Z 2025-05-07T20:31:42.4248918Z self = 2025-05-07T20:31:42.4250069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.4251433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8e4bd80>} 2025-05-07T20:31:42.4252765Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.4253786Z context = 2025-05-07T20:31:42.4254069Z 2025-05-07T20:31:42.4254232Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.4254750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.4255210Z module_map=module_map) 2025-05-07T20:31:42.4255570Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.4255920Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.4256181Z E ^ 2025-05-07T20:31:42.4256641Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4257083Z 2025-05-07T20:31:42.4257494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.4258004Z 2025-05-07T20:31:42.5396471Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.5396911Z self=, 2025-05-07T20:31:42.5397321Z T=4096, 2025-05-07T20:31:42.5397570Z D=5120, 2025-05-07T20:31:42.5397850Z scale_ub=1200.0, 2025-05-07T20:31:42.5398196Z contiguous=False, 2025-05-07T20:31:42.5398432Z compiled=False, 2025-05-07T20:31:42.5398650Z ) 2025-05-07T20:31:42.5398973Z self = 2025-05-07T20:31:42.5399483Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:42.5399761Z 2025-05-07T20:31:42.5399855Z @given( 2025-05-07T20:31:42.5400093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.5400454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.5400777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.5401113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.5401445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.5401919Z ) 2025-05-07T20:31:42.5402274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.5402713Z def test_silu_mul_quant( 2025-05-07T20:31:42.5402964Z self, 2025-05-07T20:31:42.5403167Z T: int, 2025-05-07T20:31:42.5403368Z D: int, 2025-05-07T20:31:42.5403597Z scale_ub: Optional[float], 2025-05-07T20:31:42.5403877Z contiguous: bool, 2025-05-07T20:31:42.5404125Z compiled: bool, 2025-05-07T20:31:42.5404359Z ) -> None: 2025-05-07T20:31:42.5404585Z torch.manual_seed(2025) 2025-05-07T20:31:42.5404828Z 2025-05-07T20:31:42.5405105Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.5405457Z 2025-05-07T20:31:42.5405843Z x_sign = torch.sign(x) 2025-05-07T20:31:42.5406139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.5406456Z x = x_sign * x_clamp 2025-05-07T20:31:42.5406711Z x0 = x[:, :D] 2025-05-07T20:31:42.5406932Z x1 = x[:, D:] 2025-05-07T20:31:42.5407146Z 2025-05-07T20:31:42.5407345Z if contiguous: 2025-05-07T20:31:42.5407628Z x0 = x0.contiguous() 2025-05-07T20:31:42.5407891Z x1 = x1.contiguous() 2025-05-07T20:31:42.5408137Z 2025-05-07T20:31:42.5408330Z if scale_ub is not None: 2025-05-07T20:31:42.5408608Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.5409066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.5409378Z ) 2025-05-07T20:31:42.5409580Z else: 2025-05-07T20:31:42.5409800Z scale_ub_tensor = None 2025-05-07T20:31:42.5410049Z 2025-05-07T20:31:42.5410285Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.5410603Z op = silu_mul_quant 2025-05-07T20:31:42.5410850Z if compiled: 2025-05-07T20:31:42.5411105Z op = torch.compile(op) 2025-05-07T20:31:42.5411409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.5411695Z 2025-05-07T20:31:42.5411886Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.5412058Z 2025-05-07T20:31:42.5412158Z moe/activation_test.py:117: 2025-05-07T20:31:42.5412457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.5412787Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.5413076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.5420597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.5421356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.5421909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.5422603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.5423278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.5423831Z kernel = self.compile( 2025-05-07T20:31:42.5424379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.5425046Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.5425456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.5425694Z 2025-05-07T20:31:42.5425913Z self = 2025-05-07T20:31:42.5427008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.5428389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbe28c20>} 2025-05-07T20:31:42.5429900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.5430925Z context = 2025-05-07T20:31:42.5431212Z 2025-05-07T20:31:42.5431383Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.5431903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.5432372Z module_map=module_map) 2025-05-07T20:31:42.5432733Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.5433093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.5433357Z E ^ 2025-05-07T20:31:42.5433822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.5434277Z 2025-05-07T20:31:42.5434694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.5435210Z 2025-05-07T20:31:42.5435316Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.5435726Z self=, 2025-05-07T20:31:42.5436125Z T=4096, 2025-05-07T20:31:42.5436396Z D=5120, 2025-05-07T20:31:42.5436592Z scale_ub=1200.0, 2025-05-07T20:31:42.5436819Z contiguous=False, 2025-05-07T20:31:42.5437042Z compiled=True, 2025-05-07T20:31:42.5437250Z ) 2025-05-07T20:31:42.5437572Z self = 2025-05-07T20:31:42.5438059Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.5438339Z 2025-05-07T20:31:42.5438417Z @given( 2025-05-07T20:31:42.5438653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.5438960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.5439268Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.5439603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.5439936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.5440218Z ) 2025-05-07T20:31:42.5440568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.5441016Z def test_silu_mul_quant( 2025-05-07T20:31:42.5441250Z self, 2025-05-07T20:31:42.5441446Z T: int, 2025-05-07T20:31:42.5441649Z D: int, 2025-05-07T20:31:42.5441870Z scale_ub: Optional[float], 2025-05-07T20:31:42.5442145Z contiguous: bool, 2025-05-07T20:31:42.5442391Z compiled: bool, 2025-05-07T20:31:42.5442617Z ) -> None: 2025-05-07T20:31:42.5442842Z torch.manual_seed(2025) 2025-05-07T20:31:42.5443098Z 2025-05-07T20:31:42.5443375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.5443725Z 2025-05-07T20:31:42.5443929Z x_sign = torch.sign(x) 2025-05-07T20:31:42.5444225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.5444545Z x = x_sign * x_clamp 2025-05-07T20:31:42.5444792Z x0 = x[:, :D] 2025-05-07T20:31:42.5445022Z x1 = x[:, D:] 2025-05-07T20:31:42.5445231Z 2025-05-07T20:31:42.5445431Z if contiguous: 2025-05-07T20:31:42.5445680Z x0 = x0.contiguous() 2025-05-07T20:31:42.5445943Z x1 = x1.contiguous() 2025-05-07T20:31:42.5446191Z 2025-05-07T20:31:42.5446393Z if scale_ub is not None: 2025-05-07T20:31:42.5446670Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.5447013Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.5447335Z ) 2025-05-07T20:31:42.5447620Z else: 2025-05-07T20:31:42.5447928Z scale_ub_tensor = None 2025-05-07T20:31:42.5448186Z 2025-05-07T20:31:42.5448415Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.5448735Z op = silu_mul_quant 2025-05-07T20:31:42.5448990Z if compiled: 2025-05-07T20:31:42.5449236Z op = torch.compile(op) 2025-05-07T20:31:42.5449533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.5449816Z 2025-05-07T20:31:42.5450020Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.5450193Z 2025-05-07T20:31:42.5450293Z moe/activation_test.py:117: 2025-05-07T20:31:42.5450591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.5450930Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.5451208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.5451765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.5452328Z return fn(*args, **kwargs) 2025-05-07T20:31:42.5452989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.5453676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.5454211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.5454891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.5455624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.5456160Z kernel = self.compile( 2025-05-07T20:31:42.5456697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.5457352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.5457744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.5457982Z 2025-05-07T20:31:42.5458189Z self = 2025-05-07T20:31:42.5459265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.5460637Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbe29f80>} 2025-05-07T20:31:42.5461965Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.5462987Z context = 2025-05-07T20:31:42.5463276Z 2025-05-07T20:31:42.5463451Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.5463968Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.5464427Z module_map=module_map) 2025-05-07T20:31:42.5464790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.5465140Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.5465395Z E ^ 2025-05-07T20:31:42.5465871Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.5466323Z 2025-05-07T20:31:42.5466736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.5467242Z 2025-05-07T20:31:42.6335761Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.6336880Z self=, 2025-05-07T20:31:42.6338435Z T=2048, 2025-05-07T20:31:42.6338979Z D=7168, 2025-05-07T20:31:42.6339398Z scale_ub=1200.0, 2025-05-07T20:31:42.6339851Z contiguous=False, 2025-05-07T20:31:42.6340295Z compiled=False, 2025-05-07T20:31:42.6340615Z ) 2025-05-07T20:31:42.6340988Z self = 2025-05-07T20:31:42.6341482Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:42.6341765Z 2025-05-07T20:31:42.6341856Z @given( 2025-05-07T20:31:42.6342091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.6342398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.6342706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.6343036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.6343366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.6343648Z ) 2025-05-07T20:31:42.6343998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.6344445Z def test_silu_mul_quant( 2025-05-07T20:31:42.6344686Z self, 2025-05-07T20:31:42.6344884Z T: int, 2025-05-07T20:31:42.6345085Z D: int, 2025-05-07T20:31:42.6345301Z scale_ub: Optional[float], 2025-05-07T20:31:42.6345575Z contiguous: bool, 2025-05-07T20:31:42.6345817Z compiled: bool, 2025-05-07T20:31:42.6346040Z ) -> None: 2025-05-07T20:31:42.6346256Z torch.manual_seed(2025) 2025-05-07T20:31:42.6346637Z 2025-05-07T20:31:42.6346913Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.6347258Z 2025-05-07T20:31:42.6347455Z x_sign = torch.sign(x) 2025-05-07T20:31:42.6347745Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.6348055Z x = x_sign * x_clamp 2025-05-07T20:31:42.6348297Z x0 = x[:, :D] 2025-05-07T20:31:42.6348514Z x1 = x[:, D:] 2025-05-07T20:31:42.6348727Z 2025-05-07T20:31:42.6348917Z if contiguous: 2025-05-07T20:31:42.6349154Z x0 = x0.contiguous() 2025-05-07T20:31:42.6349409Z x1 = x1.contiguous() 2025-05-07T20:31:42.6349646Z 2025-05-07T20:31:42.6349838Z if scale_ub is not None: 2025-05-07T20:31:42.6350106Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.6350441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.6350754Z ) 2025-05-07T20:31:42.6350945Z else: 2025-05-07T20:31:42.6351169Z scale_ub_tensor = None 2025-05-07T20:31:42.6351419Z 2025-05-07T20:31:42.6351644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.6351960Z op = silu_mul_quant 2025-05-07T20:31:42.6352215Z if compiled: 2025-05-07T20:31:42.6352458Z op = torch.compile(op) 2025-05-07T20:31:42.6352752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.6353029Z 2025-05-07T20:31:42.6353228Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.6353398Z 2025-05-07T20:31:42.6353497Z moe/activation_test.py:117: 2025-05-07T20:31:42.6353794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.6354125Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.6354401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.6355096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.6355788Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.6356318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.6357001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.6357664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.6358283Z kernel = self.compile( 2025-05-07T20:31:42.6358822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.6359477Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.6359876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.6360103Z 2025-05-07T20:31:42.6360314Z self = 2025-05-07T20:31:42.6361392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.6362846Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbe2ad40>} 2025-05-07T20:31:42.6364190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.6365207Z context = 2025-05-07T20:31:42.6365494Z 2025-05-07T20:31:42.6365664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.6366262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.6366736Z module_map=module_map) 2025-05-07T20:31:42.6367105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.6367461Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.6367807Z E ^ 2025-05-07T20:31:42.6368271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.6368715Z 2025-05-07T20:31:42.6369141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.6369650Z 2025-05-07T20:31:42.6369762Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.6370171Z self=, 2025-05-07T20:31:42.6370577Z T=1, 2025-05-07T20:31:42.6370765Z D=7168, 2025-05-07T20:31:42.6370955Z scale_ub=None, 2025-05-07T20:31:42.6371172Z contiguous=True, 2025-05-07T20:31:42.6371404Z compiled=False, 2025-05-07T20:31:42.6371611Z ) 2025-05-07T20:31:42.6371928Z self = 2025-05-07T20:31:42.6372409Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:42.6372668Z 2025-05-07T20:31:42.6372750Z @given( 2025-05-07T20:31:42.6372989Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.6373305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.6373621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.6373947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.6374274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.6374559Z ) 2025-05-07T20:31:42.6374904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.6375344Z def test_silu_mul_quant( 2025-05-07T20:31:42.6375588Z self, 2025-05-07T20:31:42.6375784Z T: int, 2025-05-07T20:31:42.6375986Z D: int, 2025-05-07T20:31:42.6376379Z scale_ub: Optional[float], 2025-05-07T20:31:42.6376650Z contiguous: bool, 2025-05-07T20:31:42.6376891Z compiled: bool, 2025-05-07T20:31:42.6377115Z ) -> None: 2025-05-07T20:31:42.6377329Z torch.manual_seed(2025) 2025-05-07T20:31:42.6377577Z 2025-05-07T20:31:42.6377847Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.6378282Z 2025-05-07T20:31:42.6378472Z x_sign = torch.sign(x) 2025-05-07T20:31:42.6378765Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.6379073Z x = x_sign * x_clamp 2025-05-07T20:31:42.6379311Z x0 = x[:, :D] 2025-05-07T20:31:42.6379534Z x1 = x[:, D:] 2025-05-07T20:31:42.6379743Z 2025-05-07T20:31:42.6379939Z if contiguous: 2025-05-07T20:31:42.6380213Z x0 = x0.contiguous() 2025-05-07T20:31:42.6380474Z x1 = x1.contiguous() 2025-05-07T20:31:42.6380715Z 2025-05-07T20:31:42.6380911Z if scale_ub is not None: 2025-05-07T20:31:42.6381185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.6381517Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.6381826Z ) 2025-05-07T20:31:42.6382021Z else: 2025-05-07T20:31:42.6382229Z scale_ub_tensor = None 2025-05-07T20:31:42.6382484Z 2025-05-07T20:31:42.6382717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.6383038Z op = silu_mul_quant 2025-05-07T20:31:42.6383285Z if compiled: 2025-05-07T20:31:42.6383536Z op = torch.compile(op) 2025-05-07T20:31:42.6383834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.6384109Z 2025-05-07T20:31:42.6384305Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.6384467Z 2025-05-07T20:31:42.6384573Z moe/activation_test.py:117: 2025-05-07T20:31:42.6384950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.6385288Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.6385571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.6386255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.6386948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.6387484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.6388177Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.6388840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.6389383Z kernel = self.compile( 2025-05-07T20:31:42.6389944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.6390612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.6391005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.6391237Z 2025-05-07T20:31:42.6391442Z self = 2025-05-07T20:31:42.6392520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.6393884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbe2afc0>} 2025-05-07T20:31:42.6395227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.6396243Z context = 2025-05-07T20:31:42.6396538Z 2025-05-07T20:31:42.6396703Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.6397225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.6397686Z module_map=module_map) 2025-05-07T20:31:42.6398052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.6398522Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.6398785Z E ^ 2025-05-07T20:31:42.6399244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.6399696Z 2025-05-07T20:31:42.6400106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.6400613Z 2025-05-07T20:31:42.6400729Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.6401145Z self=, 2025-05-07T20:31:42.6401546Z T=16384, 2025-05-07T20:31:42.6401748Z D=7168, 2025-05-07T20:31:42.6401946Z scale_ub=1200.0, 2025-05-07T20:31:42.6402170Z contiguous=False, 2025-05-07T20:31:42.6402400Z compiled=True, 2025-05-07T20:31:43.0209175Z ) 2025-05-07T20:31:43.0210042Z self = 2025-05-07T20:31:43.0210665Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:43.0210959Z 2025-05-07T20:31:43.0211045Z @given( 2025-05-07T20:31:43.0211292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0211620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0211932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0212274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0212781Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0213075Z ) 2025-05-07T20:31:43.0213436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0213883Z def test_silu_mul_quant( 2025-05-07T20:31:43.0214129Z self, 2025-05-07T20:31:43.0214338Z T: int, 2025-05-07T20:31:43.0214547Z D: int, 2025-05-07T20:31:43.0214772Z scale_ub: Optional[float], 2025-05-07T20:31:43.0215059Z contiguous: bool, 2025-05-07T20:31:43.0215309Z compiled: bool, 2025-05-07T20:31:43.0215540Z ) -> None: 2025-05-07T20:31:43.0215766Z torch.manual_seed(2025) 2025-05-07T20:31:43.0216018Z 2025-05-07T20:31:43.0216303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0216656Z 2025-05-07T20:31:43.0216861Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0217164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0217488Z x = x_sign * x_clamp 2025-05-07T20:31:43.0217741Z x0 = x[:, :D] 2025-05-07T20:31:43.0217969Z x1 = x[:, D:] 2025-05-07T20:31:43.0218186Z 2025-05-07T20:31:43.0218385Z if contiguous: 2025-05-07T20:31:43.0218630Z x0 = x0.contiguous() 2025-05-07T20:31:43.0218894Z x1 = x1.contiguous() 2025-05-07T20:31:43.0219143Z 2025-05-07T20:31:43.0219350Z if scale_ub is not None: 2025-05-07T20:31:43.0219631Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0219982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0220305Z ) 2025-05-07T20:31:43.0220503Z else: 2025-05-07T20:31:43.0220728Z scale_ub_tensor = None 2025-05-07T20:31:43.0220988Z 2025-05-07T20:31:43.0221231Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0221550Z op = silu_mul_quant 2025-05-07T20:31:43.0221809Z if compiled: 2025-05-07T20:31:43.0222078Z op = torch.compile(op) 2025-05-07T20:31:43.0222379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0222663Z 2025-05-07T20:31:43.0222863Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.0223031Z 2025-05-07T20:31:43.0223137Z moe/activation_test.py:117: 2025-05-07T20:31:43.0223443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0223787Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.0224201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0224776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.0225349Z return fn(*args, **kwargs) 2025-05-07T20:31:43.0226045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.0226748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.0227298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.0227989Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.0228658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.0229195Z kernel = self.compile( 2025-05-07T20:31:43.0229744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.0230413Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.0230820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0231054Z 2025-05-07T20:31:43.0231265Z self = 2025-05-07T20:31:43.0232434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.0233822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbdf5300>} 2025-05-07T20:31:43.0235176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.0236214Z context = 2025-05-07T20:31:43.0236503Z 2025-05-07T20:31:43.0236673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.0237203Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.0237680Z module_map=module_map) 2025-05-07T20:31:43.0238064Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.0238423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.0238693Z E ^ 2025-05-07T20:31:43.0239168Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.0239621Z 2025-05-07T20:31:43.0240039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.0240565Z 2025-05-07T20:31:43.0240673Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.0241095Z self=, 2025-05-07T20:31:43.0241506Z T=1, 2025-05-07T20:31:43.0241697Z D=7168, 2025-05-07T20:31:43.0241903Z scale_ub=None, 2025-05-07T20:31:43.0242129Z contiguous=False, 2025-05-07T20:31:43.0242363Z compiled=False, 2025-05-07T20:31:43.0242579Z ) 2025-05-07T20:31:43.0242917Z self = 2025-05-07T20:31:43.0243408Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:43.0243681Z 2025-05-07T20:31:43.0243764Z @given( 2025-05-07T20:31:43.0244007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0244337Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0244653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0244998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0245458Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0252312Z ) 2025-05-07T20:31:43.0252717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0253178Z def test_silu_mul_quant( 2025-05-07T20:31:43.0253436Z self, 2025-05-07T20:31:43.0253640Z T: int, 2025-05-07T20:31:43.0253850Z D: int, 2025-05-07T20:31:43.0254078Z scale_ub: Optional[float], 2025-05-07T20:31:43.0254359Z contiguous: bool, 2025-05-07T20:31:43.0254605Z compiled: bool, 2025-05-07T20:31:43.0254835Z ) -> None: 2025-05-07T20:31:43.0255059Z torch.manual_seed(2025) 2025-05-07T20:31:43.0255312Z 2025-05-07T20:31:43.0255592Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0255947Z 2025-05-07T20:31:43.0256159Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0256455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0256777Z x = x_sign * x_clamp 2025-05-07T20:31:43.0257026Z x0 = x[:, :D] 2025-05-07T20:31:43.0257253Z x1 = x[:, D:] 2025-05-07T20:31:43.0257464Z 2025-05-07T20:31:43.0257661Z if contiguous: 2025-05-07T20:31:43.0257904Z x0 = x0.contiguous() 2025-05-07T20:31:43.0258166Z x1 = x1.contiguous() 2025-05-07T20:31:43.0258411Z 2025-05-07T20:31:43.0258611Z if scale_ub is not None: 2025-05-07T20:31:43.0258994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0259343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0259663Z ) 2025-05-07T20:31:43.0259861Z else: 2025-05-07T20:31:43.0260083Z scale_ub_tensor = None 2025-05-07T20:31:43.0260345Z 2025-05-07T20:31:43.0260578Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0260899Z op = silu_mul_quant 2025-05-07T20:31:43.0261160Z if compiled: 2025-05-07T20:31:43.0261419Z op = torch.compile(op) 2025-05-07T20:31:43.0261727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0262011Z 2025-05-07T20:31:43.0262214Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.0262384Z 2025-05-07T20:31:43.0262489Z moe/activation_test.py:117: 2025-05-07T20:31:43.0262790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0263129Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.0263420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0264122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.0264828Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.0265371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.0266059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.0266735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.0267274Z kernel = self.compile( 2025-05-07T20:31:43.0267820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.0268483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.0268891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0269120Z 2025-05-07T20:31:43.0269337Z self = 2025-05-07T20:31:43.0270428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.0271889Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbdf60c0>} 2025-05-07T20:31:43.0273245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.0274280Z context = 2025-05-07T20:31:43.0274575Z 2025-05-07T20:31:43.0274748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.0275276Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.0275759Z module_map=module_map) 2025-05-07T20:31:43.0276132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.0276485Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.0276759Z E ^ 2025-05-07T20:31:43.0277234Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.0277686Z 2025-05-07T20:31:43.0278111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.0278623Z 2025-05-07T20:31:43.0278731Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.0279278Z self=, 2025-05-07T20:31:43.0279691Z T=2048, 2025-05-07T20:31:43.0279882Z D=7168, 2025-05-07T20:31:43.0280087Z scale_ub=None, 2025-05-07T20:31:43.0280310Z contiguous=False, 2025-05-07T20:31:43.0280537Z compiled=True, 2025-05-07T20:31:43.0280746Z ) 2025-05-07T20:31:43.0952245Z self = 2025-05-07T20:31:43.0952796Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.0953212Z 2025-05-07T20:31:43.0953307Z @given( 2025-05-07T20:31:43.0953552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0953871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0954216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0954554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0954885Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0955179Z ) 2025-05-07T20:31:43.0955543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0955989Z def test_silu_mul_quant( 2025-05-07T20:31:43.0956236Z self, 2025-05-07T20:31:43.0956441Z T: int, 2025-05-07T20:31:43.0956643Z D: int, 2025-05-07T20:31:43.0956872Z scale_ub: Optional[float], 2025-05-07T20:31:43.0957149Z contiguous: bool, 2025-05-07T20:31:43.0957395Z compiled: bool, 2025-05-07T20:31:43.0957624Z ) -> None: 2025-05-07T20:31:43.0957854Z torch.manual_seed(2025) 2025-05-07T20:31:43.0958107Z 2025-05-07T20:31:43.0958384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0958735Z 2025-05-07T20:31:43.0958936Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0959226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0959547Z x = x_sign * x_clamp 2025-05-07T20:31:43.0959792Z x0 = x[:, :D] 2025-05-07T20:31:43.0960014Z x1 = x[:, D:] 2025-05-07T20:31:43.0960230Z 2025-05-07T20:31:43.0960425Z if contiguous: 2025-05-07T20:31:43.0960665Z x0 = x0.contiguous() 2025-05-07T20:31:43.0960930Z x1 = x1.contiguous() 2025-05-07T20:31:43.0961174Z 2025-05-07T20:31:43.0961375Z if scale_ub is not None: 2025-05-07T20:31:43.0961660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0962004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0962500Z ) 2025-05-07T20:31:43.0962696Z else: 2025-05-07T20:31:43.0962916Z scale_ub_tensor = None 2025-05-07T20:31:43.0963173Z 2025-05-07T20:31:43.0963414Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0963730Z op = silu_mul_quant 2025-05-07T20:31:43.0963987Z if compiled: 2025-05-07T20:31:43.0964240Z op = torch.compile(op) 2025-05-07T20:31:43.0964535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0964816Z 2025-05-07T20:31:43.0965019Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.0965184Z 2025-05-07T20:31:43.0965288Z moe/activation_test.py:117: 2025-05-07T20:31:43.0965587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0965928Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.0966213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0966774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.0967344Z return fn(*args, **kwargs) 2025-05-07T20:31:43.0968087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.0968771Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.0969307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.0970105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.0970767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.0971293Z kernel = self.compile( 2025-05-07T20:31:43.0971830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.0972486Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.0972882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0973113Z 2025-05-07T20:31:43.0973317Z self = 2025-05-07T20:31:43.0974391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.0975756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbdf7560>} 2025-05-07T20:31:43.0977087Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.0978095Z context = 2025-05-07T20:31:43.0978388Z 2025-05-07T20:31:43.0978549Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.0979063Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.0979524Z module_map=module_map) 2025-05-07T20:31:43.0979878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.0980227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.0980492Z E ^ 2025-05-07T20:31:43.0980944Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.0981391Z 2025-05-07T20:31:43.0981799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.0982307Z 2025-05-07T20:31:43.0982410Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.0983321Z self=, 2025-05-07T20:31:43.0983708Z T=4096, 2025-05-07T20:31:43.0983894Z D=7168, 2025-05-07T20:31:43.0984090Z scale_ub=None, 2025-05-07T20:31:43.0984294Z contiguous=False, 2025-05-07T20:31:43.0984514Z compiled=True, 2025-05-07T20:31:43.0984711Z ) 2025-05-07T20:31:43.0985021Z self = 2025-05-07T20:31:43.0985505Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.0985778Z 2025-05-07T20:31:43.0985863Z @given( 2025-05-07T20:31:43.0986086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0986390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0986691Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0987019Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0987337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0987618Z ) 2025-05-07T20:31:43.0987963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0988391Z def test_silu_mul_quant( 2025-05-07T20:31:43.0988634Z self, 2025-05-07T20:31:43.0988823Z T: int, 2025-05-07T20:31:43.0989016Z D: int, 2025-05-07T20:31:43.0989227Z scale_ub: Optional[float], 2025-05-07T20:31:43.0989496Z contiguous: bool, 2025-05-07T20:31:43.0989723Z compiled: bool, 2025-05-07T20:31:43.0989945Z ) -> None: 2025-05-07T20:31:43.0990263Z torch.manual_seed(2025) 2025-05-07T20:31:43.0990521Z 2025-05-07T20:31:43.0990786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0991123Z 2025-05-07T20:31:43.0991323Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0991611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0991922Z x = x_sign * x_clamp 2025-05-07T20:31:43.0992160Z x0 = x[:, :D] 2025-05-07T20:31:43.0992382Z x1 = x[:, D:] 2025-05-07T20:31:43.0992592Z 2025-05-07T20:31:43.0992779Z if contiguous: 2025-05-07T20:31:43.0993010Z x0 = x0.contiguous() 2025-05-07T20:31:43.0993271Z x1 = x1.contiguous() 2025-05-07T20:31:43.0993511Z 2025-05-07T20:31:43.0993703Z if scale_ub is not None: 2025-05-07T20:31:43.0993976Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0994315Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0994623Z ) 2025-05-07T20:31:43.0994821Z else: 2025-05-07T20:31:43.0995037Z scale_ub_tensor = None 2025-05-07T20:31:43.0995286Z 2025-05-07T20:31:43.0995520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0995837Z op = silu_mul_quant 2025-05-07T20:31:43.0996084Z if compiled: 2025-05-07T20:31:43.0996334Z op = torch.compile(op) 2025-05-07T20:31:43.0996633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0996914Z 2025-05-07T20:31:43.0997107Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.0997273Z 2025-05-07T20:31:43.0997371Z moe/activation_test.py:117: 2025-05-07T20:31:43.0997667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0997997Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.0998276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0998838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.0999389Z return fn(*args, **kwargs) 2025-05-07T20:31:43.1000043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.1000729Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.1001259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.1002043Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.1002699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.1003224Z kernel = self.compile( 2025-05-07T20:31:43.1003762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.1004410Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.1004807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.1005035Z 2025-05-07T20:31:43.1005250Z self = 2025-05-07T20:31:43.1006493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.1007939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c81d07c0>} 2025-05-07T20:31:43.1009265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.1010407Z context = 2025-05-07T20:31:43.1010695Z 2025-05-07T20:31:43.1010861Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.1011371Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.1011827Z module_map=module_map) 2025-05-07T20:31:43.1012189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.1012552Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.1012811Z E ^ 2025-05-07T20:31:43.1013279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.1013723Z 2025-05-07T20:31:43.1014139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.1014643Z 2025-05-07T20:31:43.2273976Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2274444Z self=, 2025-05-07T20:31:43.2275023Z T=16384, 2025-05-07T20:31:43.2275282Z D=5120, 2025-05-07T20:31:43.2275534Z scale_ub=1200.0, 2025-05-07T20:31:43.2275824Z contiguous=False, 2025-05-07T20:31:43.2276113Z compiled=False, 2025-05-07T20:31:43.2276372Z ) 2025-05-07T20:31:43.2276700Z self = 2025-05-07T20:31:43.2277208Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.2277496Z 2025-05-07T20:31:43.2277585Z @given( 2025-05-07T20:31:43.2277822Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.2278144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.2278460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.2278795Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.2279127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.2279425Z ) 2025-05-07T20:31:43.2279780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.2280234Z def test_silu_mul_quant( 2025-05-07T20:31:43.2280488Z self, 2025-05-07T20:31:43.2280708Z T: int, 2025-05-07T20:31:43.2280940Z D: int, 2025-05-07T20:31:43.2281175Z scale_ub: Optional[float], 2025-05-07T20:31:43.2281450Z contiguous: bool, 2025-05-07T20:31:43.2281865Z compiled: bool, 2025-05-07T20:31:43.2282097Z ) -> None: 2025-05-07T20:31:43.2282322Z torch.manual_seed(2025) 2025-05-07T20:31:43.2282565Z 2025-05-07T20:31:43.2282848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.2283197Z 2025-05-07T20:31:43.2283393Z x_sign = torch.sign(x) 2025-05-07T20:31:43.2283690Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.2284005Z x = x_sign * x_clamp 2025-05-07T20:31:43.2284252Z x0 = x[:, :D] 2025-05-07T20:31:43.2284481Z x1 = x[:, D:] 2025-05-07T20:31:43.2284697Z 2025-05-07T20:31:43.2284883Z if contiguous: 2025-05-07T20:31:43.2285124Z x0 = x0.contiguous() 2025-05-07T20:31:43.2285392Z x1 = x1.contiguous() 2025-05-07T20:31:43.2285633Z 2025-05-07T20:31:43.2285835Z if scale_ub is not None: 2025-05-07T20:31:43.2286109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.2286448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.2286768Z ) 2025-05-07T20:31:43.2286970Z else: 2025-05-07T20:31:43.2287189Z scale_ub_tensor = None 2025-05-07T20:31:43.2287442Z 2025-05-07T20:31:43.2287763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2288077Z op = silu_mul_quant 2025-05-07T20:31:43.2288321Z if compiled: 2025-05-07T20:31:43.2288564Z op = torch.compile(op) 2025-05-07T20:31:43.2288972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2289253Z 2025-05-07T20:31:43.2289451Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.2289615Z 2025-05-07T20:31:43.2289746Z moe/activation_test.py:117: 2025-05-07T20:31:43.2290037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2290370Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.2290650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2291339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.2292031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.2292565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.2293246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.2293905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.2294436Z kernel = self.compile( 2025-05-07T20:31:43.2294972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.2295615Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.2296008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2296243Z 2025-05-07T20:31:43.2296448Z self = 2025-05-07T20:31:43.2297525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.2298899Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c81d1620>} 2025-05-07T20:31:43.2300227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.2301244Z context = 2025-05-07T20:31:43.2301536Z 2025-05-07T20:31:43.2301700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.2302305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.2302764Z module_map=module_map) 2025-05-07T20:31:43.2303125Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.2303477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.2303734Z E ^ 2025-05-07T20:31:43.2304200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.2304653Z 2025-05-07T20:31:43.2305068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.2305573Z 2025-05-07T20:31:43.2305850Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2306261Z self=, 2025-05-07T20:31:43.2306664Z T=16384, 2025-05-07T20:31:43.2306864Z D=5120, 2025-05-07T20:31:43.2307054Z scale_ub=1200.0, 2025-05-07T20:31:43.2307275Z contiguous=True, 2025-05-07T20:31:43.2307496Z compiled=True, 2025-05-07T20:31:43.2307695Z ) 2025-05-07T20:31:43.2308012Z self = 2025-05-07T20:31:43.2308511Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.2308784Z 2025-05-07T20:31:43.2308867Z @given( 2025-05-07T20:31:43.2309218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.2309536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.2309843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.2310165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.2310492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.2310782Z ) 2025-05-07T20:31:43.2311126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.2311571Z def test_silu_mul_quant( 2025-05-07T20:31:43.2311814Z self, 2025-05-07T20:31:43.2312007Z T: int, 2025-05-07T20:31:43.2312207Z D: int, 2025-05-07T20:31:43.2312427Z scale_ub: Optional[float], 2025-05-07T20:31:43.2312698Z contiguous: bool, 2025-05-07T20:31:43.2312934Z compiled: bool, 2025-05-07T20:31:43.2313157Z ) -> None: 2025-05-07T20:31:43.2313375Z torch.manual_seed(2025) 2025-05-07T20:31:43.2313613Z 2025-05-07T20:31:43.2313891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.2314238Z 2025-05-07T20:31:43.2314428Z x_sign = torch.sign(x) 2025-05-07T20:31:43.2314716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.2315034Z x = x_sign * x_clamp 2025-05-07T20:31:43.2315270Z x0 = x[:, :D] 2025-05-07T20:31:43.2315485Z x1 = x[:, D:] 2025-05-07T20:31:43.2315696Z 2025-05-07T20:31:43.2315878Z if contiguous: 2025-05-07T20:31:43.2316113Z x0 = x0.contiguous() 2025-05-07T20:31:43.2316369Z x1 = x1.contiguous() 2025-05-07T20:31:43.2316603Z 2025-05-07T20:31:43.2316797Z if scale_ub is not None: 2025-05-07T20:31:43.2317074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.2317409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.2317722Z ) 2025-05-07T20:31:43.2317923Z else: 2025-05-07T20:31:43.2318145Z scale_ub_tensor = None 2025-05-07T20:31:43.2318397Z 2025-05-07T20:31:43.2318632Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2318951Z op = silu_mul_quant 2025-05-07T20:31:43.2319202Z if compiled: 2025-05-07T20:31:43.2319456Z op = torch.compile(op) 2025-05-07T20:31:43.2319757Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2320032Z 2025-05-07T20:31:43.2320235Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.2320530Z 2025-05-07T20:31:43.2320637Z moe/activation_test.py:117: 2025-05-07T20:31:43.2320932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2321269Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.2321555Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2322121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.2322678Z return fn(*args, **kwargs) 2025-05-07T20:31:43.2323346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.2324048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.2324588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.2325419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.2326096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.2326640Z kernel = self.compile( 2025-05-07T20:31:43.2334088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.2334799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.2335209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2335557Z 2025-05-07T20:31:43.2335771Z self = 2025-05-07T20:31:43.2336856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.2338232Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c81d2a20>} 2025-05-07T20:31:43.2339585Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.2340606Z context = 2025-05-07T20:31:43.2340901Z 2025-05-07T20:31:43.2341078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.2341601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.2342070Z module_map=module_map) 2025-05-07T20:31:43.2342438Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.2342797Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.2343062Z E ^ 2025-05-07T20:31:43.2343526Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.2343985Z 2025-05-07T20:31:43.2344404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.2344920Z 2025-05-07T20:31:43.5616936Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.5617527Z self=, 2025-05-07T20:31:43.5618047Z T=16384, 2025-05-07T20:31:43.5618286Z D=5120, 2025-05-07T20:31:43.5618500Z scale_ub=None, 2025-05-07T20:31:43.5618730Z contiguous=False, 2025-05-07T20:31:43.5618972Z compiled=True, 2025-05-07T20:31:43.5619198Z ) 2025-05-07T20:31:43.5619525Z self = 2025-05-07T20:31:43.5620039Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.5620322Z 2025-05-07T20:31:43.5620726Z @given( 2025-05-07T20:31:43.5620969Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.5621286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.5621599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.5621937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.5622268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.5622572Z ) 2025-05-07T20:31:43.5622936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.5623379Z def test_silu_mul_quant( 2025-05-07T20:31:43.5623632Z self, 2025-05-07T20:31:43.5623839Z T: int, 2025-05-07T20:31:43.5624038Z D: int, 2025-05-07T20:31:43.5624266Z scale_ub: Optional[float], 2025-05-07T20:31:43.5624549Z contiguous: bool, 2025-05-07T20:31:43.5624794Z compiled: bool, 2025-05-07T20:31:43.5625035Z ) -> None: 2025-05-07T20:31:43.5625259Z torch.manual_seed(2025) 2025-05-07T20:31:43.5625509Z 2025-05-07T20:31:43.5625792Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.5626139Z 2025-05-07T20:31:43.5626343Z x_sign = torch.sign(x) 2025-05-07T20:31:43.5626636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.5626955Z x = x_sign * x_clamp 2025-05-07T20:31:43.5627205Z x0 = x[:, :D] 2025-05-07T20:31:43.5627425Z x1 = x[:, D:] 2025-05-07T20:31:43.5627651Z 2025-05-07T20:31:43.5627987Z if contiguous: 2025-05-07T20:31:43.5628226Z x0 = x0.contiguous() 2025-05-07T20:31:43.5628499Z x1 = x1.contiguous() 2025-05-07T20:31:43.5628753Z 2025-05-07T20:31:43.5628952Z if scale_ub is not None: 2025-05-07T20:31:43.5629248Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.5629600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.5629916Z ) 2025-05-07T20:31:43.5630125Z else: 2025-05-07T20:31:43.5630400Z scale_ub_tensor = None 2025-05-07T20:31:43.5630661Z 2025-05-07T20:31:43.5630908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.5631233Z op = silu_mul_quant 2025-05-07T20:31:43.5631494Z if compiled: 2025-05-07T20:31:43.5631747Z op = torch.compile(op) 2025-05-07T20:31:43.5632051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5632336Z 2025-05-07T20:31:43.5632535Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.5632715Z 2025-05-07T20:31:43.5632818Z moe/activation_test.py:117: 2025-05-07T20:31:43.5633127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5633469Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.5633777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5634353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.5634934Z return fn(*args, **kwargs) 2025-05-07T20:31:43.5635600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.5636300Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.5636845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.5637537Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.5638203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.5638744Z kernel = self.compile( 2025-05-07T20:31:43.5639295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.5639956Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.5640364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5640729Z 2025-05-07T20:31:43.5640941Z self = 2025-05-07T20:31:43.5642032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.5643443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c81d3c40>} 2025-05-07T20:31:43.5644794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.5645828Z context = 2025-05-07T20:31:43.5646132Z 2025-05-07T20:31:43.5646301Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.5646835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.5647307Z module_map=module_map) 2025-05-07T20:31:43.5647781Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.5648150Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.5648417Z E ^ 2025-05-07T20:31:43.5648977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.5649432Z 2025-05-07T20:31:43.5649856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.5650370Z 2025-05-07T20:31:43.5650475Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.5650915Z self=, 2025-05-07T20:31:43.5651361Z T=2048, 2025-05-07T20:31:43.5651555Z D=5120, 2025-05-07T20:31:43.5651760Z scale_ub=None, 2025-05-07T20:31:43.5651983Z contiguous=False, 2025-05-07T20:31:43.5652206Z compiled=True, 2025-05-07T20:31:43.5652415Z ) 2025-05-07T20:31:43.6364719Z self = 2025-05-07T20:31:43.6365263Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.6365583Z 2025-05-07T20:31:43.6365718Z @given( 2025-05-07T20:31:43.6366037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.6366367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.6366687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.6367021Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.6367359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.6367737Z ) 2025-05-07T20:31:43.6368091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.6368551Z def test_silu_mul_quant( 2025-05-07T20:31:43.6368803Z self, 2025-05-07T20:31:43.6369003Z T: int, 2025-05-07T20:31:43.6369213Z D: int, 2025-05-07T20:31:43.6369440Z scale_ub: Optional[float], 2025-05-07T20:31:43.6369714Z contiguous: bool, 2025-05-07T20:31:43.6369965Z compiled: bool, 2025-05-07T20:31:43.6370200Z ) -> None: 2025-05-07T20:31:43.6370430Z torch.manual_seed(2025) 2025-05-07T20:31:43.6370680Z 2025-05-07T20:31:43.6370965Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.6371325Z 2025-05-07T20:31:43.6371524Z x_sign = torch.sign(x) 2025-05-07T20:31:43.6371828Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.6372147Z x = x_sign * x_clamp 2025-05-07T20:31:43.6372395Z x0 = x[:, :D] 2025-05-07T20:31:43.6372622Z x1 = x[:, D:] 2025-05-07T20:31:43.6373020Z 2025-05-07T20:31:43.6373212Z if contiguous: 2025-05-07T20:31:43.6373463Z x0 = x0.contiguous() 2025-05-07T20:31:43.6373735Z x1 = x1.contiguous() 2025-05-07T20:31:43.6373981Z 2025-05-07T20:31:43.6374185Z if scale_ub is not None: 2025-05-07T20:31:43.6374468Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.6374807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.6375128Z ) 2025-05-07T20:31:43.6375339Z else: 2025-05-07T20:31:43.6375554Z scale_ub_tensor = None 2025-05-07T20:31:43.6375813Z 2025-05-07T20:31:43.6376075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6376401Z op = silu_mul_quant 2025-05-07T20:31:43.6376661Z if compiled: 2025-05-07T20:31:43.6376910Z op = torch.compile(op) 2025-05-07T20:31:43.6377219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6377512Z 2025-05-07T20:31:43.6377710Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.6377886Z 2025-05-07T20:31:43.6377989Z moe/activation_test.py:117: 2025-05-07T20:31:43.6378297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6378637Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.6378927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6379499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.6380187Z return fn(*args, **kwargs) 2025-05-07T20:31:43.6380910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.6381623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.6382166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.6382849Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.6383524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.6384069Z kernel = self.compile( 2025-05-07T20:31:43.6384622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.6385285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.6385699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6385932Z 2025-05-07T20:31:43.6386148Z self = 2025-05-07T20:31:43.6387247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.6388632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbbe47c0>} 2025-05-07T20:31:43.6389975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.6391004Z context = 2025-05-07T20:31:43.6391297Z 2025-05-07T20:31:43.6391469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.6391991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.6392459Z module_map=module_map) 2025-05-07T20:31:43.6392827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.6393189Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.6393591Z E ^ 2025-05-07T20:31:43.6394064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.6394517Z 2025-05-07T20:31:43.6394940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.6395448Z 2025-05-07T20:31:43.6395553Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.6395977Z self=, 2025-05-07T20:31:43.6396387Z T=2048, 2025-05-07T20:31:43.6396579Z D=5120, 2025-05-07T20:31:43.6396780Z scale_ub=1200.0, 2025-05-07T20:31:43.6397010Z contiguous=False, 2025-05-07T20:31:43.6397244Z compiled=True, 2025-05-07T20:31:43.6397448Z ) 2025-05-07T20:31:43.6397771Z self = 2025-05-07T20:31:43.6398270Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:43.6398547Z 2025-05-07T20:31:43.6398626Z @given( 2025-05-07T20:31:43.6398861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.6399179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.6399481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.6399812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.6400149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.6400434Z ) 2025-05-07T20:31:43.6400861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.6401304Z def test_silu_mul_quant( 2025-05-07T20:31:43.6401547Z self, 2025-05-07T20:31:43.6401740Z T: int, 2025-05-07T20:31:43.6401942Z D: int, 2025-05-07T20:31:43.6402161Z scale_ub: Optional[float], 2025-05-07T20:31:43.6402428Z contiguous: bool, 2025-05-07T20:31:43.6402670Z compiled: bool, 2025-05-07T20:31:43.6402895Z ) -> None: 2025-05-07T20:31:43.6403115Z torch.manual_seed(2025) 2025-05-07T20:31:43.6403358Z 2025-05-07T20:31:43.6403631Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.6403968Z 2025-05-07T20:31:43.6404165Z x_sign = torch.sign(x) 2025-05-07T20:31:43.6404459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.6404772Z x = x_sign * x_clamp 2025-05-07T20:31:43.6405013Z x0 = x[:, :D] 2025-05-07T20:31:43.6405237Z x1 = x[:, D:] 2025-05-07T20:31:43.6405451Z 2025-05-07T20:31:43.6405804Z if contiguous: 2025-05-07T20:31:43.6406039Z x0 = x0.contiguous() 2025-05-07T20:31:43.6406301Z x1 = x1.contiguous() 2025-05-07T20:31:43.6406541Z 2025-05-07T20:31:43.6406740Z if scale_ub is not None: 2025-05-07T20:31:43.6407017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.6407350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.6407728Z ) 2025-05-07T20:31:43.6407934Z else: 2025-05-07T20:31:43.6408152Z scale_ub_tensor = None 2025-05-07T20:31:43.6408414Z 2025-05-07T20:31:43.6408659Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6408980Z op = silu_mul_quant 2025-05-07T20:31:43.6409243Z if compiled: 2025-05-07T20:31:43.6409504Z op = torch.compile(op) 2025-05-07T20:31:43.6409805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6410097Z 2025-05-07T20:31:43.6410302Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.6410468Z 2025-05-07T20:31:43.6410579Z moe/activation_test.py:117: 2025-05-07T20:31:43.6410922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6411280Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.6411581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6412139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.6412840Z return fn(*args, **kwargs) 2025-05-07T20:31:43.6413510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.6414210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.6414746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.6415444Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.6416113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.6416647Z kernel = self.compile( 2025-05-07T20:31:43.6417198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.6417856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.6418265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6418498Z 2025-05-07T20:31:43.6418707Z self = 2025-05-07T20:31:43.6419792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.6421272Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbbe58a0>} 2025-05-07T20:31:43.6422624Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.6423660Z context = 2025-05-07T20:31:43.6423955Z 2025-05-07T20:31:43.6424124Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.6424654Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.6425123Z module_map=module_map) 2025-05-07T20:31:43.6425482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.6425839Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.6426109Z E ^ 2025-05-07T20:31:43.6426580Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.6427032Z 2025-05-07T20:31:43.6427446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.6427966Z 2025-05-07T20:31:43.7739931Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.7740410Z self=, 2025-05-07T20:31:43.7740866Z T=4096, 2025-05-07T20:31:43.7741062Z D=5120, 2025-05-07T20:31:43.7741274Z scale_ub=1200.0, 2025-05-07T20:31:43.7741498Z contiguous=True, 2025-05-07T20:31:43.7741730Z compiled=True, 2025-05-07T20:31:43.7741946Z ) 2025-05-07T20:31:43.7742265Z self = 2025-05-07T20:31:43.7742772Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.7743045Z 2025-05-07T20:31:43.7743136Z @given( 2025-05-07T20:31:43.7743371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.7743692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.7744003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.7744338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.7744664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.7745159Z ) 2025-05-07T20:31:43.7745509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.7745946Z def test_silu_mul_quant( 2025-05-07T20:31:43.7746198Z self, 2025-05-07T20:31:43.7746402Z T: int, 2025-05-07T20:31:43.7746603Z D: int, 2025-05-07T20:31:43.7746834Z scale_ub: Optional[float], 2025-05-07T20:31:43.7747110Z contiguous: bool, 2025-05-07T20:31:43.7747350Z compiled: bool, 2025-05-07T20:31:43.7747588Z ) -> None: 2025-05-07T20:31:43.7747812Z torch.manual_seed(2025) 2025-05-07T20:31:43.7748057Z 2025-05-07T20:31:43.7748339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.7748690Z 2025-05-07T20:31:43.7748894Z x_sign = torch.sign(x) 2025-05-07T20:31:43.7749194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.7749509Z x = x_sign * x_clamp 2025-05-07T20:31:43.7749758Z x0 = x[:, :D] 2025-05-07T20:31:43.7749981Z x1 = x[:, D:] 2025-05-07T20:31:43.7750198Z 2025-05-07T20:31:43.7750392Z if contiguous: 2025-05-07T20:31:43.7750624Z x0 = x0.contiguous() 2025-05-07T20:31:43.7750898Z x1 = x1.contiguous() 2025-05-07T20:31:43.7751145Z 2025-05-07T20:31:43.7751342Z if scale_ub is not None: 2025-05-07T20:31:43.7751620Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.7751960Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.7752397Z ) 2025-05-07T20:31:43.7752610Z else: 2025-05-07T20:31:43.7752831Z scale_ub_tensor = None 2025-05-07T20:31:43.7753084Z 2025-05-07T20:31:43.7753321Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.7753640Z op = silu_mul_quant 2025-05-07T20:31:43.7753889Z if compiled: 2025-05-07T20:31:43.7754139Z op = torch.compile(op) 2025-05-07T20:31:43.7754439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.7754735Z 2025-05-07T20:31:43.7754928Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.7755102Z 2025-05-07T20:31:43.7755204Z moe/activation_test.py:117: 2025-05-07T20:31:43.7755505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.7755843Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.7756124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.7756693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.7757256Z return fn(*args, **kwargs) 2025-05-07T20:31:43.7757910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.7758600Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.7759138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.7759826Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.7760482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.7761059Z kernel = self.compile( 2025-05-07T20:31:43.7761606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.7762260Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.7762659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.7762892Z 2025-05-07T20:31:43.7763099Z self = 2025-05-07T20:31:43.7764177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.7765623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbbe6ac0>} 2025-05-07T20:31:43.7766960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.7768062Z context = 2025-05-07T20:31:43.7768349Z 2025-05-07T20:31:43.7768522Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.7769048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.7769514Z module_map=module_map) 2025-05-07T20:31:43.7769888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.7770252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.7770516Z E ^ 2025-05-07T20:31:43.7771016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.7771490Z 2025-05-07T20:31:43.7771908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.7772418Z 2025-05-07T20:31:43.7772538Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.7773027Z self=, 2025-05-07T20:31:43.7773439Z T=128, 2025-05-07T20:31:43.7773640Z D=5120, 2025-05-07T20:31:43.7773843Z scale_ub=1200.0, 2025-05-07T20:31:43.7774074Z contiguous=False, 2025-05-07T20:31:43.7774306Z compiled=True, 2025-05-07T20:31:43.7774515Z ) 2025-05-07T20:31:43.8603679Z self = 2025-05-07T20:31:43.8604288Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:43.8604729Z 2025-05-07T20:31:43.8604816Z @given( 2025-05-07T20:31:43.8605068Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.8605386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.8605983Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.8606325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.8606667Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.8606970Z ) 2025-05-07T20:31:43.8607329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.8616853Z def test_silu_mul_quant( 2025-05-07T20:31:43.8617124Z self, 2025-05-07T20:31:43.8617324Z T: int, 2025-05-07T20:31:43.8617529Z D: int, 2025-05-07T20:31:43.8617764Z scale_ub: Optional[float], 2025-05-07T20:31:43.8618035Z contiguous: bool, 2025-05-07T20:31:43.8618296Z compiled: bool, 2025-05-07T20:31:43.8618522Z ) -> None: 2025-05-07T20:31:43.8618750Z torch.manual_seed(2025) 2025-05-07T20:31:43.8619007Z 2025-05-07T20:31:43.8619285Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.8619641Z 2025-05-07T20:31:43.8619848Z x_sign = torch.sign(x) 2025-05-07T20:31:43.8620154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.8620465Z x = x_sign * x_clamp 2025-05-07T20:31:43.8620719Z x0 = x[:, :D] 2025-05-07T20:31:43.8620936Z x1 = x[:, D:] 2025-05-07T20:31:43.8621145Z 2025-05-07T20:31:43.8621342Z if contiguous: 2025-05-07T20:31:43.8621583Z x0 = x0.contiguous() 2025-05-07T20:31:43.8621841Z x1 = x1.contiguous() 2025-05-07T20:31:43.8622091Z 2025-05-07T20:31:43.8622292Z if scale_ub is not None: 2025-05-07T20:31:43.8622567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.8623094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.8623413Z ) 2025-05-07T20:31:43.8623606Z else: 2025-05-07T20:31:43.8623822Z scale_ub_tensor = None 2025-05-07T20:31:43.8624075Z 2025-05-07T20:31:43.8624304Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.8624622Z op = silu_mul_quant 2025-05-07T20:31:43.8624879Z if compiled: 2025-05-07T20:31:43.8625130Z op = torch.compile(op) 2025-05-07T20:31:43.8625432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8625718Z 2025-05-07T20:31:43.8625917Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.8626085Z 2025-05-07T20:31:43.8626186Z moe/activation_test.py:117: 2025-05-07T20:31:43.8626481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8626827Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.8627103Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8627672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.8628232Z return fn(*args, **kwargs) 2025-05-07T20:31:43.8628891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.8629577Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.8630229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.8630915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.8631575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.8632107Z kernel = self.compile( 2025-05-07T20:31:43.8632651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.8633316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.8633708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8633942Z 2025-05-07T20:31:43.8634151Z self = 2025-05-07T20:31:43.8635239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.8636610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb970540>} 2025-05-07T20:31:43.8637944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.8638974Z context = 2025-05-07T20:31:43.8639267Z 2025-05-07T20:31:43.8639432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.8639955Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.8640419Z module_map=module_map) 2025-05-07T20:31:43.8640790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.8641150Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.8641415Z E ^ 2025-05-07T20:31:43.8641878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.8642330Z 2025-05-07T20:31:43.8642745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.8643252Z 2025-05-07T20:31:43.8643449Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.8643866Z self=, 2025-05-07T20:31:43.8644264Z T=16384, 2025-05-07T20:31:43.8644467Z D=7168, 2025-05-07T20:31:43.8644666Z scale_ub=1200.0, 2025-05-07T20:31:43.8644888Z contiguous=True, 2025-05-07T20:31:43.8645115Z compiled=True, 2025-05-07T20:31:43.8645324Z ) 2025-05-07T20:31:43.8645639Z self = 2025-05-07T20:31:43.8646144Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.8646420Z 2025-05-07T20:31:43.8646507Z @given( 2025-05-07T20:31:43.8646736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.8647054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.8647364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.8647763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.8648092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.8648384Z ) 2025-05-07T20:31:43.8648740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.8649180Z def test_silu_mul_quant( 2025-05-07T20:31:43.8649424Z self, 2025-05-07T20:31:43.8649627Z T: int, 2025-05-07T20:31:43.8649822Z D: int, 2025-05-07T20:31:43.8650043Z scale_ub: Optional[float], 2025-05-07T20:31:43.8650320Z contiguous: bool, 2025-05-07T20:31:43.8650647Z compiled: bool, 2025-05-07T20:31:43.8650876Z ) -> None: 2025-05-07T20:31:43.8651093Z torch.manual_seed(2025) 2025-05-07T20:31:43.8651332Z 2025-05-07T20:31:43.8651605Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.8651952Z 2025-05-07T20:31:43.8652145Z x_sign = torch.sign(x) 2025-05-07T20:31:43.8652435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.8652750Z x = x_sign * x_clamp 2025-05-07T20:31:43.8652993Z x0 = x[:, :D] 2025-05-07T20:31:43.8653206Z x1 = x[:, D:] 2025-05-07T20:31:43.8653418Z 2025-05-07T20:31:43.8653618Z if contiguous: 2025-05-07T20:31:43.8653845Z x0 = x0.contiguous() 2025-05-07T20:31:43.8654106Z x1 = x1.contiguous() 2025-05-07T20:31:43.8654354Z 2025-05-07T20:31:43.8654549Z if scale_ub is not None: 2025-05-07T20:31:43.8654832Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.8655180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.8655490Z ) 2025-05-07T20:31:43.8655683Z else: 2025-05-07T20:31:43.8655896Z scale_ub_tensor = None 2025-05-07T20:31:43.8656150Z 2025-05-07T20:31:43.8656384Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.8656696Z op = silu_mul_quant 2025-05-07T20:31:43.8656946Z if compiled: 2025-05-07T20:31:43.8657200Z op = torch.compile(op) 2025-05-07T20:31:43.8657492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8657769Z 2025-05-07T20:31:43.8657962Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.8658124Z 2025-05-07T20:31:43.8658224Z moe/activation_test.py:117: 2025-05-07T20:31:43.8658523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8658856Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.8659134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8659694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.8660254Z return fn(*args, **kwargs) 2025-05-07T20:31:43.8660915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.8661596Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.8662131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.8662896Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.8663548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.8664076Z kernel = self.compile( 2025-05-07T20:31:43.8664616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.8665276Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.8665668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8665897Z 2025-05-07T20:31:43.8666102Z self = 2025-05-07T20:31:43.8667173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.8668540Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb971080>} 2025-05-07T20:31:43.8669960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.8670981Z context = 2025-05-07T20:31:43.8671271Z 2025-05-07T20:31:43.8671437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.8671957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.8672421Z module_map=module_map) 2025-05-07T20:31:43.8672784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.8673138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.8673398Z E ^ 2025-05-07T20:31:43.8673855Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.8674307Z 2025-05-07T20:31:43.8674720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.8675232Z 2025-05-07T20:31:43.9620565Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.9621117Z self=, 2025-05-07T20:31:43.9621697Z T=16384, 2025-05-07T20:31:43.9621965Z D=5120, 2025-05-07T20:31:43.9622239Z scale_ub=1200.0, 2025-05-07T20:31:43.9622502Z contiguous=True, 2025-05-07T20:31:43.9622724Z compiled=False, 2025-05-07T20:31:43.9622927Z ) 2025-05-07T20:31:43.9623249Z self = 2025-05-07T20:31:43.9623756Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:43.9624038Z 2025-05-07T20:31:43.9624114Z @given( 2025-05-07T20:31:43.9624348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.9624656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.9624963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.9625295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.9625628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.9625909Z ) 2025-05-07T20:31:43.9626264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.9626704Z def test_silu_mul_quant( 2025-05-07T20:31:43.9626944Z self, 2025-05-07T20:31:43.9627140Z T: int, 2025-05-07T20:31:43.9627341Z D: int, 2025-05-07T20:31:43.9627551Z scale_ub: Optional[float], 2025-05-07T20:31:43.9628002Z contiguous: bool, 2025-05-07T20:31:43.9628239Z compiled: bool, 2025-05-07T20:31:43.9628462Z ) -> None: 2025-05-07T20:31:43.9628683Z torch.manual_seed(2025) 2025-05-07T20:31:43.9628930Z 2025-05-07T20:31:43.9629199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.9629545Z 2025-05-07T20:31:43.9629741Z x_sign = torch.sign(x) 2025-05-07T20:31:43.9630036Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.9630348Z x = x_sign * x_clamp 2025-05-07T20:31:43.9630634Z x0 = x[:, :D] 2025-05-07T20:31:43.9630857Z x1 = x[:, D:] 2025-05-07T20:31:43.9631064Z 2025-05-07T20:31:43.9631248Z if contiguous: 2025-05-07T20:31:43.9631487Z x0 = x0.contiguous() 2025-05-07T20:31:43.9631737Z x1 = x1.contiguous() 2025-05-07T20:31:43.9631981Z 2025-05-07T20:31:43.9632174Z if scale_ub is not None: 2025-05-07T20:31:43.9632450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.9632784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.9633099Z ) 2025-05-07T20:31:43.9633290Z else: 2025-05-07T20:31:43.9633505Z scale_ub_tensor = None 2025-05-07T20:31:43.9633756Z 2025-05-07T20:31:43.9633988Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.9634301Z op = silu_mul_quant 2025-05-07T20:31:43.9634546Z if compiled: 2025-05-07T20:31:43.9634936Z op = torch.compile(op) 2025-05-07T20:31:43.9635232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9635504Z 2025-05-07T20:31:43.9635698Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.9635862Z 2025-05-07T20:31:43.9635962Z moe/activation_test.py:117: 2025-05-07T20:31:43.9636264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9636599Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.9636886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9637574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.9638276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.9638813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.9639506Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.9640185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.9640725Z kernel = self.compile( 2025-05-07T20:31:43.9641268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.9641922Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.9642317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9642555Z 2025-05-07T20:31:43.9642760Z self = 2025-05-07T20:31:43.9643843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.9645216Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb972660>} 2025-05-07T20:31:43.9646554Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.9647682Z context = 2025-05-07T20:31:43.9648075Z 2025-05-07T20:31:43.9648239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.9648768Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.9649224Z module_map=module_map) 2025-05-07T20:31:43.9649595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.9649946Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.9650208Z E ^ 2025-05-07T20:31:43.9650676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.9651132Z 2025-05-07T20:31:43.9651546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.9652054Z 2025-05-07T20:31:43.9652167Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.9652570Z self=, 2025-05-07T20:31:43.9652980Z T=1, 2025-05-07T20:31:43.9653167Z D=7168, 2025-05-07T20:31:43.9653357Z scale_ub=1200.0, 2025-05-07T20:31:43.9653587Z contiguous=False, 2025-05-07T20:31:43.9653809Z compiled=False, 2025-05-07T20:31:43.9654008Z ) 2025-05-07T20:31:43.9654328Z self = 2025-05-07T20:31:43.9654816Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.9655085Z 2025-05-07T20:31:43.9655249Z @given( 2025-05-07T20:31:43.9655483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.9655791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.9656104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.9656426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.9656758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.9657041Z ) 2025-05-07T20:31:43.9657393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.9657832Z def test_silu_mul_quant( 2025-05-07T20:31:43.9658082Z self, 2025-05-07T20:31:43.9658272Z T: int, 2025-05-07T20:31:43.9658467Z D: int, 2025-05-07T20:31:43.9658684Z scale_ub: Optional[float], 2025-05-07T20:31:43.9658956Z contiguous: bool, 2025-05-07T20:31:43.9659196Z compiled: bool, 2025-05-07T20:31:43.9659415Z ) -> None: 2025-05-07T20:31:43.9659640Z torch.manual_seed(2025) 2025-05-07T20:31:43.9659874Z 2025-05-07T20:31:43.9660149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.9660492Z 2025-05-07T20:31:43.9660685Z x_sign = torch.sign(x) 2025-05-07T20:31:43.9660978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.9661285Z x = x_sign * x_clamp 2025-05-07T20:31:43.9661525Z x0 = x[:, :D] 2025-05-07T20:31:43.9661745Z x1 = x[:, D:] 2025-05-07T20:31:43.9661963Z 2025-05-07T20:31:43.9662142Z if contiguous: 2025-05-07T20:31:43.9662378Z x0 = x0.contiguous() 2025-05-07T20:31:43.9662639Z x1 = x1.contiguous() 2025-05-07T20:31:43.9662867Z 2025-05-07T20:31:43.9663059Z if scale_ub is not None: 2025-05-07T20:31:43.9663342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.9663669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.9663987Z ) 2025-05-07T20:31:43.9664189Z else: 2025-05-07T20:31:43.9664400Z scale_ub_tensor = None 2025-05-07T20:31:43.9664646Z 2025-05-07T20:31:43.9664876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.9665182Z op = silu_mul_quant 2025-05-07T20:31:43.9665427Z if compiled: 2025-05-07T20:31:43.9665675Z op = torch.compile(op) 2025-05-07T20:31:43.9665971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9666329Z 2025-05-07T20:31:43.9666525Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.9666685Z 2025-05-07T20:31:43.9666788Z moe/activation_test.py:117: 2025-05-07T20:31:43.9667082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9667407Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.9667694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9668378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.9669064Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.9669600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.9670278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.9670980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.9671515Z kernel = self.compile( 2025-05-07T20:31:43.9672056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.9672700Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.9673100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9673332Z 2025-05-07T20:31:43.9673535Z self = 2025-05-07T20:31:43.9674692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.9676051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb971d00>} 2025-05-07T20:31:43.9677387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.9678404Z context = 2025-05-07T20:31:43.9678692Z 2025-05-07T20:31:43.9678853Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.9679374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.9679826Z module_map=module_map) 2025-05-07T20:31:43.9680185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.9680535Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.9680790Z E ^ 2025-05-07T20:31:43.9681300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.9681757Z 2025-05-07T20:31:43.9682170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.9682675Z 2025-05-07T20:31:44.3148729Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3149404Z self=, 2025-05-07T20:31:44.3149967Z T=4096, 2025-05-07T20:31:44.3150236Z D=7168, 2025-05-07T20:31:44.3150706Z scale_ub=1200.0, 2025-05-07T20:31:44.3151180Z contiguous=False, 2025-05-07T20:31:44.3151628Z compiled=True, 2025-05-07T20:31:44.3152038Z ) 2025-05-07T20:31:44.3152664Z self = 2025-05-07T20:31:44.3153653Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.3154197Z 2025-05-07T20:31:44.3154367Z @given( 2025-05-07T20:31:44.3154823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3155778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3156385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3157042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3157689Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3158256Z ) 2025-05-07T20:31:44.3158951Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3159817Z def test_silu_mul_quant( 2025-05-07T20:31:44.3160301Z self, 2025-05-07T20:31:44.3160631Z T: int, 2025-05-07T20:31:44.3160852Z D: int, 2025-05-07T20:31:44.3161096Z scale_ub: Optional[float], 2025-05-07T20:31:44.3161368Z contiguous: bool, 2025-05-07T20:31:44.3161605Z compiled: bool, 2025-05-07T20:31:44.3161833Z ) -> None: 2025-05-07T20:31:44.3162067Z torch.manual_seed(2025) 2025-05-07T20:31:44.3162311Z 2025-05-07T20:31:44.3162583Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3162938Z 2025-05-07T20:31:44.3163133Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3163422Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3163734Z x = x_sign * x_clamp 2025-05-07T20:31:44.3163976Z x0 = x[:, :D] 2025-05-07T20:31:44.3164195Z x1 = x[:, D:] 2025-05-07T20:31:44.3164401Z 2025-05-07T20:31:44.3164590Z if contiguous: 2025-05-07T20:31:44.3164823Z x0 = x0.contiguous() 2025-05-07T20:31:44.3165192Z x1 = x1.contiguous() 2025-05-07T20:31:44.3165438Z 2025-05-07T20:31:44.3165633Z if scale_ub is not None: 2025-05-07T20:31:44.3165901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.3166241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.3166554Z ) 2025-05-07T20:31:44.3166746Z else: 2025-05-07T20:31:44.3166961Z scale_ub_tensor = None 2025-05-07T20:31:44.3167215Z 2025-05-07T20:31:44.3167445Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.3167873Z op = silu_mul_quant 2025-05-07T20:31:44.3168127Z if compiled: 2025-05-07T20:31:44.3168381Z op = torch.compile(op) 2025-05-07T20:31:44.3168673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3168952Z 2025-05-07T20:31:44.3169152Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.3169316Z 2025-05-07T20:31:44.3169417Z moe/activation_test.py:117: 2025-05-07T20:31:44.3169719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3170051Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.3177162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3177876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.3178449Z return fn(*args, **kwargs) 2025-05-07T20:31:44.3179259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.3179993Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.3180530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.3181219Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.3181888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.3182424Z kernel = self.compile( 2025-05-07T20:31:44.3182966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.3183628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.3184034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3184265Z 2025-05-07T20:31:44.3184603Z self = 2025-05-07T20:31:44.3185681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.3187058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c852ccc0>} 2025-05-07T20:31:44.3188401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.3189423Z context = 2025-05-07T20:31:44.3189711Z 2025-05-07T20:31:44.3189889Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.3190416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.3190893Z module_map=module_map) 2025-05-07T20:31:44.3191267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.3191624Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.3191889Z E ^ 2025-05-07T20:31:44.3192437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.3192888Z 2025-05-07T20:31:44.3193309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.3193821Z 2025-05-07T20:31:44.3193930Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3194347Z self=, 2025-05-07T20:31:44.3194755Z T=128, 2025-05-07T20:31:44.3194945Z D=7168, 2025-05-07T20:31:44.3195156Z scale_ub=1200.0, 2025-05-07T20:31:44.3195387Z contiguous=False, 2025-05-07T20:31:44.3195618Z compiled=True, 2025-05-07T20:31:44.3195823Z ) 2025-05-07T20:31:44.3894282Z self = 2025-05-07T20:31:44.3895021Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.3895431Z 2025-05-07T20:31:44.3895550Z @given( 2025-05-07T20:31:44.3895855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3896264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3896572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3896902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3897232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3897522Z ) 2025-05-07T20:31:44.3897870Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3898310Z def test_silu_mul_quant( 2025-05-07T20:31:44.3898560Z self, 2025-05-07T20:31:44.3898759Z T: int, 2025-05-07T20:31:44.3898951Z D: int, 2025-05-07T20:31:44.3899171Z scale_ub: Optional[float], 2025-05-07T20:31:44.3899439Z contiguous: bool, 2025-05-07T20:31:44.3899677Z compiled: bool, 2025-05-07T20:31:44.3899905Z ) -> None: 2025-05-07T20:31:44.3900119Z torch.manual_seed(2025) 2025-05-07T20:31:44.3900359Z 2025-05-07T20:31:44.3900636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3900983Z 2025-05-07T20:31:44.3901178Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3901471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3901781Z x = x_sign * x_clamp 2025-05-07T20:31:44.3902019Z x0 = x[:, :D] 2025-05-07T20:31:44.3902244Z x1 = x[:, D:] 2025-05-07T20:31:44.3902455Z 2025-05-07T20:31:44.3902649Z if contiguous: 2025-05-07T20:31:44.3902884Z x0 = x0.contiguous() 2025-05-07T20:31:44.3903326Z x1 = x1.contiguous() 2025-05-07T20:31:44.3903570Z 2025-05-07T20:31:44.3903760Z if scale_ub is not None: 2025-05-07T20:31:44.3904039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.3904375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.3904686Z ) 2025-05-07T20:31:44.3904883Z else: 2025-05-07T20:31:44.3905101Z scale_ub_tensor = None 2025-05-07T20:31:44.3905352Z 2025-05-07T20:31:44.3905762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.3906088Z op = silu_mul_quant 2025-05-07T20:31:44.3906335Z if compiled: 2025-05-07T20:31:44.3906587Z op = torch.compile(op) 2025-05-07T20:31:44.3906885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3907157Z 2025-05-07T20:31:44.3907351Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.3907518Z 2025-05-07T20:31:44.3907627Z moe/activation_test.py:117: 2025-05-07T20:31:44.3907925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3908256Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.3908541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3909098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.3909651Z return fn(*args, **kwargs) 2025-05-07T20:31:44.3910435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.3911129Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.3911663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.3912337Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.3912994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.3913531Z kernel = self.compile( 2025-05-07T20:31:44.3914064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.3914719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.3915113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3915341Z 2025-05-07T20:31:44.3915557Z self = 2025-05-07T20:31:44.3916623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.3917981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c852d580>} 2025-05-07T20:31:44.3919324Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.3920339Z context = 2025-05-07T20:31:44.3920626Z 2025-05-07T20:31:44.3920791Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.3921316Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.3921782Z module_map=module_map) 2025-05-07T20:31:44.3922146Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.3922500Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.3922765Z E ^ 2025-05-07T20:31:44.3923233Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.3923808Z 2025-05-07T20:31:44.3924220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.3924729Z 2025-05-07T20:31:44.3924836Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3925243Z self=, 2025-05-07T20:31:44.3925644Z T=2048, 2025-05-07T20:31:44.3925838Z D=7168, 2025-05-07T20:31:44.3926032Z scale_ub=None, 2025-05-07T20:31:44.3926249Z contiguous=True, 2025-05-07T20:31:44.3926473Z compiled=True, 2025-05-07T20:31:44.3926674Z ) 2025-05-07T20:31:44.3926992Z self = 2025-05-07T20:31:44.3927479Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:44.3927803Z 2025-05-07T20:31:44.3927886Z @given( 2025-05-07T20:31:44.3928109Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3928428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3928732Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3929056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3929385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3929673Z ) 2025-05-07T20:31:44.3930015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3930927Z def test_silu_mul_quant( 2025-05-07T20:31:44.3931178Z self, 2025-05-07T20:31:44.3931372Z T: int, 2025-05-07T20:31:44.3931570Z D: int, 2025-05-07T20:31:44.3931786Z scale_ub: Optional[float], 2025-05-07T20:31:44.3932057Z contiguous: bool, 2025-05-07T20:31:44.3932312Z compiled: bool, 2025-05-07T20:31:44.3932536Z ) -> None: 2025-05-07T20:31:44.3932754Z torch.manual_seed(2025) 2025-05-07T20:31:44.3932989Z 2025-05-07T20:31:44.3933265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3933608Z 2025-05-07T20:31:44.3933802Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3934095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3934406Z x = x_sign * x_clamp 2025-05-07T20:31:44.3934656Z x0 = x[:, :D] 2025-05-07T20:31:44.3934873Z x1 = x[:, D:] 2025-05-07T20:31:44.3935083Z 2025-05-07T20:31:44.3935272Z if contiguous: 2025-05-07T20:31:44.3935508Z x0 = x0.contiguous() 2025-05-07T20:31:44.3935765Z x1 = x1.contiguous() 2025-05-07T20:31:44.3936002Z 2025-05-07T20:31:44.3936190Z if scale_ub is not None: 2025-05-07T20:31:44.3936460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.3936791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.3937097Z ) 2025-05-07T20:31:44.3937293Z else: 2025-05-07T20:31:44.3937506Z scale_ub_tensor = None 2025-05-07T20:31:44.3937759Z 2025-05-07T20:31:44.3937991Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.3938306Z op = silu_mul_quant 2025-05-07T20:31:44.3938552Z if compiled: 2025-05-07T20:31:44.3938799Z op = torch.compile(op) 2025-05-07T20:31:44.3939094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3939369Z 2025-05-07T20:31:44.3939559Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.3939728Z 2025-05-07T20:31:44.3939833Z moe/activation_test.py:117: 2025-05-07T20:31:44.3940129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3940457Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.3940737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3941292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.3941844Z return fn(*args, **kwargs) 2025-05-07T20:31:44.3942587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.3943276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.3943806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.3944477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.3945138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.3945664Z kernel = self.compile( 2025-05-07T20:31:44.3946199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.3946842Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.3947236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3947466Z 2025-05-07T20:31:44.3947676Z self = 2025-05-07T20:31:44.3948742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.3950210Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c852e480>} 2025-05-07T20:31:44.3951546Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.3952563Z context = 2025-05-07T20:31:44.3952847Z 2025-05-07T20:31:44.3953014Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.3953539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.3954005Z module_map=module_map) 2025-05-07T20:31:44.3954367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.3954716Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.3954976Z E ^ 2025-05-07T20:31:44.3955446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.3955892Z 2025-05-07T20:31:44.3956310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.3956814Z 2025-05-07T20:31:44.4569704Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4570213Z self=, 2025-05-07T20:31:44.4570760Z T=16384, 2025-05-07T20:31:44.4571073Z D=5120, 2025-05-07T20:31:44.4571400Z scale_ub=None, 2025-05-07T20:31:44.4571703Z contiguous=False, 2025-05-07T20:31:44.4572006Z compiled=False, 2025-05-07T20:31:44.4572296Z ) 2025-05-07T20:31:44.4572628Z self = 2025-05-07T20:31:44.4573124Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.4573415Z 2025-05-07T20:31:44.4573498Z @given( 2025-05-07T20:31:44.4573737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4574057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4574356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4574692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4575023Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4575311Z ) 2025-05-07T20:31:44.4575663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4576284Z def test_silu_mul_quant( 2025-05-07T20:31:44.4576531Z self, 2025-05-07T20:31:44.4576730Z T: int, 2025-05-07T20:31:44.4576936Z D: int, 2025-05-07T20:31:44.4577147Z scale_ub: Optional[float], 2025-05-07T20:31:44.4577425Z contiguous: bool, 2025-05-07T20:31:44.4577666Z compiled: bool, 2025-05-07T20:31:44.4577895Z ) -> None: 2025-05-07T20:31:44.4578113Z torch.manual_seed(2025) 2025-05-07T20:31:44.4578367Z 2025-05-07T20:31:44.4578643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4578998Z 2025-05-07T20:31:44.4579198Z x_sign = torch.sign(x) 2025-05-07T20:31:44.4579500Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.4581531Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.4583420Z 2025-05-07T20:31:44.4583548Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:44.4583764Z 2025-05-07T20:31:44.4583993Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4584417Z self=, 2025-05-07T20:31:44.4584824Z T=4096, 2025-05-07T20:31:44.4585013Z D=7168, 2025-05-07T20:31:44.4585212Z scale_ub=1200.0, 2025-05-07T20:31:44.4585435Z contiguous=True, 2025-05-07T20:31:44.4585654Z compiled=True, 2025-05-07T20:31:44.4585860Z ) 2025-05-07T20:31:44.4586180Z self = 2025-05-07T20:31:44.4586682Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.4586953Z 2025-05-07T20:31:44.4587034Z @given( 2025-05-07T20:31:44.4587265Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4587583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4587886Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4588215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4588555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4588837Z ) 2025-05-07T20:31:44.4589189Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4589630Z def test_silu_mul_quant( 2025-05-07T20:31:44.4589873Z self, 2025-05-07T20:31:44.4590066Z T: int, 2025-05-07T20:31:44.4590267Z D: int, 2025-05-07T20:31:44.4590490Z scale_ub: Optional[float], 2025-05-07T20:31:44.4590760Z contiguous: bool, 2025-05-07T20:31:44.4591001Z compiled: bool, 2025-05-07T20:31:44.4591226Z ) -> None: 2025-05-07T20:31:44.4591442Z torch.manual_seed(2025) 2025-05-07T20:31:44.4591682Z 2025-05-07T20:31:44.4591955Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4592290Z 2025-05-07T20:31:44.4592491Z x_sign = torch.sign(x) 2025-05-07T20:31:44.4592779Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.4594794Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.4596817Z 2025-05-07T20:31:44.4596942Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:44.4597154Z 2025-05-07T20:31:44.4597258Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4597669Z self=, 2025-05-07T20:31:44.4598072Z T=16384, 2025-05-07T20:31:44.4598264Z D=7168, 2025-05-07T20:31:44.4598462Z scale_ub=None, 2025-05-07T20:31:44.4598686Z contiguous=False, 2025-05-07T20:31:44.4598907Z compiled=False, 2025-05-07T20:31:44.4599115Z ) 2025-05-07T20:31:44.4599435Z self = 2025-05-07T20:31:44.4599930Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.4600213Z 2025-05-07T20:31:44.4600292Z @given( 2025-05-07T20:31:44.4600520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4600846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4601150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4601482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4601813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4602096Z ) 2025-05-07T20:31:44.4602448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4602891Z def test_silu_mul_quant( 2025-05-07T20:31:44.4603214Z self, 2025-05-07T20:31:44.4603417Z T: int, 2025-05-07T20:31:44.4603622Z D: int, 2025-05-07T20:31:44.4603835Z scale_ub: Optional[float], 2025-05-07T20:31:44.4604106Z contiguous: bool, 2025-05-07T20:31:44.4604369Z compiled: bool, 2025-05-07T20:31:44.4604595Z ) -> None: 2025-05-07T20:31:44.4604806Z torch.manual_seed(2025) 2025-05-07T20:31:44.4605049Z 2025-05-07T20:31:44.4605323Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4607663Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.4609544Z 2025-05-07T20:31:44.4609672Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.4609881Z 2025-05-07T20:31:44.4609985Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4610405Z self=, 2025-05-07T20:31:44.4610806Z T=2048, 2025-05-07T20:31:44.4611001Z D=7168, 2025-05-07T20:31:44.4611195Z scale_ub=1200.0, 2025-05-07T20:31:44.4611425Z contiguous=True, 2025-05-07T20:31:44.4611642Z compiled=True, 2025-05-07T20:31:44.4611850Z ) 2025-05-07T20:31:44.4612167Z self = 2025-05-07T20:31:44.4612658Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.4612934Z 2025-05-07T20:31:44.4613009Z @given( 2025-05-07T20:31:44.4613243Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4613562Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4613866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4614194Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4614520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4614807Z ) 2025-05-07T20:31:44.4615157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4615730Z def test_silu_mul_quant( 2025-05-07T20:31:44.4615968Z self, 2025-05-07T20:31:44.4616167Z T: int, 2025-05-07T20:31:44.4616369Z D: int, 2025-05-07T20:31:44.4616586Z scale_ub: Optional[float], 2025-05-07T20:31:44.4616859Z contiguous: bool, 2025-05-07T20:31:44.4617099Z compiled: bool, 2025-05-07T20:31:44.4617325Z ) -> None: 2025-05-07T20:31:44.4617537Z torch.manual_seed(2025) 2025-05-07T20:31:44.4617780Z 2025-05-07T20:31:44.4618055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4618402Z 2025-05-07T20:31:44.4618598Z x_sign = torch.sign(x) 2025-05-07T20:31:44.4618888Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.4620889Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.4622757Z 2025-05-07T20:31:44.4622879Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:44.4623094Z 2025-05-07T20:31:44.4623310Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4623726Z self=, 2025-05-07T20:31:44.4624136Z T=2048, 2025-05-07T20:31:44.4624321Z D=7168, 2025-05-07T20:31:44.4624515Z scale_ub=None, 2025-05-07T20:31:44.4624729Z contiguous=True, 2025-05-07T20:31:44.4624944Z compiled=False, 2025-05-07T20:31:44.4625155Z ) 2025-05-07T20:31:44.5482884Z self = 2025-05-07T20:31:44.5484284Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.5484963Z 2025-05-07T20:31:44.5485159Z @given( 2025-05-07T20:31:44.5485719Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5486469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5487131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5487862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5488474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5488998Z ) 2025-05-07T20:31:44.5489630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5490438Z def test_silu_mul_quant( 2025-05-07T20:31:44.5490869Z self, 2025-05-07T20:31:44.5491100Z T: int, 2025-05-07T20:31:44.5491305Z D: int, 2025-05-07T20:31:44.5491532Z scale_ub: Optional[float], 2025-05-07T20:31:44.5491801Z contiguous: bool, 2025-05-07T20:31:44.5492052Z compiled: bool, 2025-05-07T20:31:44.5492284Z ) -> None: 2025-05-07T20:31:44.5492499Z torch.manual_seed(2025) 2025-05-07T20:31:44.5492752Z 2025-05-07T20:31:44.5493028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5493370Z 2025-05-07T20:31:44.5493573Z > x_sign = torch.sign(x) 2025-05-07T20:31:44.5495522Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.5497594Z 2025-05-07T20:31:44.5497715Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:44.5497928Z 2025-05-07T20:31:44.5498041Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5498454Z self=, 2025-05-07T20:31:44.5498874Z T=1, 2025-05-07T20:31:44.5499066Z D=7168, 2025-05-07T20:31:44.5499270Z scale_ub=1200.0, 2025-05-07T20:31:44.5499492Z contiguous=True, 2025-05-07T20:31:44.5499722Z compiled=False, 2025-05-07T20:31:44.5499942Z ) 2025-05-07T20:31:44.5500257Z self = 2025-05-07T20:31:44.5508554Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.5508836Z 2025-05-07T20:31:44.5508926Z @given( 2025-05-07T20:31:44.5509163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5509482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5509800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5510154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5510496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5510797Z ) 2025-05-07T20:31:44.5511159Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5511612Z def test_silu_mul_quant( 2025-05-07T20:31:44.5511866Z self, 2025-05-07T20:31:44.5512079Z T: int, 2025-05-07T20:31:44.5512291Z D: int, 2025-05-07T20:31:44.5512689Z scale_ub: Optional[float], 2025-05-07T20:31:44.5512979Z contiguous: bool, 2025-05-07T20:31:44.5513238Z compiled: bool, 2025-05-07T20:31:44.5513470Z ) -> None: 2025-05-07T20:31:44.5513701Z torch.manual_seed(2025) 2025-05-07T20:31:44.5513956Z 2025-05-07T20:31:44.5514236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5514596Z 2025-05-07T20:31:44.5514805Z x_sign = torch.sign(x) 2025-05-07T20:31:44.5515107Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.5515435Z x = x_sign * x_clamp 2025-05-07T20:31:44.5515690Z x0 = x[:, :D] 2025-05-07T20:31:44.5515925Z x1 = x[:, D:] 2025-05-07T20:31:44.5516145Z 2025-05-07T20:31:44.5516347Z if contiguous: 2025-05-07T20:31:44.5516597Z x0 = x0.contiguous() 2025-05-07T20:31:44.5516865Z x1 = x1.contiguous() 2025-05-07T20:31:44.5517115Z 2025-05-07T20:31:44.5517330Z if scale_ub is not None: 2025-05-07T20:31:44.5517611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.5517965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.5518283Z ) 2025-05-07T20:31:44.5518487Z else: 2025-05-07T20:31:44.5518713Z scale_ub_tensor = None 2025-05-07T20:31:44.5518980Z 2025-05-07T20:31:44.5519223Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.5519555Z op = silu_mul_quant 2025-05-07T20:31:44.5519818Z if compiled: 2025-05-07T20:31:44.5520075Z op = torch.compile(op) 2025-05-07T20:31:44.5520384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5520670Z 2025-05-07T20:31:44.5520874Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.5521043Z 2025-05-07T20:31:44.5521151Z moe/activation_test.py:117: 2025-05-07T20:31:44.5521457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5521816Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.5522103Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5522808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.5523514Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.5524064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.5524879Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.5525552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.5526105Z kernel = self.compile( 2025-05-07T20:31:44.5526654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.5527326Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.5527787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5528019Z 2025-05-07T20:31:44.5528237Z self = 2025-05-07T20:31:44.5529322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.5530735Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb759d00>} 2025-05-07T20:31:44.5532113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.5533234Z context = 2025-05-07T20:31:44.5533529Z 2025-05-07T20:31:44.5533707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.5534238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.5534718Z module_map=module_map) 2025-05-07T20:31:44.5535095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.5535461Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.5535908Z E ^ 2025-05-07T20:31:44.5536378Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.5536828Z 2025-05-07T20:31:44.5537252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.5537763Z 2025-05-07T20:31:44.5537872Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5538324Z self=, 2025-05-07T20:31:44.5538807Z T=128, 2025-05-07T20:31:44.5539001Z D=5120, 2025-05-07T20:31:44.5539273Z scale_ub=None, 2025-05-07T20:31:44.5539496Z contiguous=True, 2025-05-07T20:31:44.5539716Z compiled=False, 2025-05-07T20:31:44.5539921Z ) 2025-05-07T20:31:44.6064814Z self = 2025-05-07T20:31:44.6065583Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.6065960Z 2025-05-07T20:31:44.6066078Z @given( 2025-05-07T20:31:44.6066389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6066812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6067132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6067468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6067804Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6068104Z ) 2025-05-07T20:31:44.6068457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6068903Z def test_silu_mul_quant( 2025-05-07T20:31:44.6069151Z self, 2025-05-07T20:31:44.6069352Z T: int, 2025-05-07T20:31:44.6069560Z D: int, 2025-05-07T20:31:44.6069784Z scale_ub: Optional[float], 2025-05-07T20:31:44.6070062Z contiguous: bool, 2025-05-07T20:31:44.6070484Z compiled: bool, 2025-05-07T20:31:44.6070715Z ) -> None: 2025-05-07T20:31:44.6070939Z torch.manual_seed(2025) 2025-05-07T20:31:44.6071183Z 2025-05-07T20:31:44.6071466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6071818Z 2025-05-07T20:31:44.6072016Z x_sign = torch.sign(x) 2025-05-07T20:31:44.6072312Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.6072628Z x = x_sign * x_clamp 2025-05-07T20:31:44.6072878Z x0 = x[:, :D] 2025-05-07T20:31:44.6073110Z x1 = x[:, D:] 2025-05-07T20:31:44.6073329Z 2025-05-07T20:31:44.6073524Z if contiguous: 2025-05-07T20:31:44.6073763Z x0 = x0.contiguous() 2025-05-07T20:31:44.6074028Z x1 = x1.contiguous() 2025-05-07T20:31:44.6074275Z 2025-05-07T20:31:44.6074469Z if scale_ub is not None: 2025-05-07T20:31:44.6074747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.6075089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.6075403Z ) 2025-05-07T20:31:44.6075608Z else: 2025-05-07T20:31:44.6075826Z scale_ub_tensor = None 2025-05-07T20:31:44.6076080Z 2025-05-07T20:31:44.6076312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.6076632Z op = silu_mul_quant 2025-05-07T20:31:44.6076891Z if compiled: 2025-05-07T20:31:44.6077140Z op = torch.compile(op) 2025-05-07T20:31:44.6077555Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6077841Z 2025-05-07T20:31:44.6078036Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.6078206Z 2025-05-07T20:31:44.6078309Z moe/activation_test.py:117: 2025-05-07T20:31:44.6078610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6078940Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.6079224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6079916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.6080610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.6081143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.6081825Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.6082493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.6083022Z kernel = self.compile( 2025-05-07T20:31:44.6083563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.6084215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.6084612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6084848Z 2025-05-07T20:31:44.6085055Z self = 2025-05-07T20:31:44.6086132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.6087616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb75ae80>} 2025-05-07T20:31:44.6088957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.6089978Z context = 2025-05-07T20:31:44.6090265Z 2025-05-07T20:31:44.6090429Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.6091044Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.6091562Z module_map=module_map) 2025-05-07T20:31:44.6091926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.6092284Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.6092547Z E ^ 2025-05-07T20:31:44.6093018Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.6093465Z 2025-05-07T20:31:44.6093877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.6094389Z 2025-05-07T20:31:44.6094494Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6094910Z self=, 2025-05-07T20:31:44.6095313Z T=128, 2025-05-07T20:31:44.6095509Z D=7168, 2025-05-07T20:31:44.6095713Z scale_ub=None, 2025-05-07T20:31:44.6095934Z contiguous=True, 2025-05-07T20:31:44.6096157Z compiled=False, 2025-05-07T20:31:44.6096368Z ) 2025-05-07T20:31:44.6096689Z self = 2025-05-07T20:31:44.6097206Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.6097476Z 2025-05-07T20:31:44.6097559Z @given( 2025-05-07T20:31:44.6097876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6098191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6098492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6098824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6099154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6099435Z ) 2025-05-07T20:31:44.6099784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6100230Z def test_silu_mul_quant( 2025-05-07T20:31:44.6100474Z self, 2025-05-07T20:31:44.6100667Z T: int, 2025-05-07T20:31:44.6100868Z D: int, 2025-05-07T20:31:44.6101119Z scale_ub: Optional[float], 2025-05-07T20:31:44.6101405Z contiguous: bool, 2025-05-07T20:31:44.6101645Z compiled: bool, 2025-05-07T20:31:44.6101868Z ) -> None: 2025-05-07T20:31:44.6102086Z torch.manual_seed(2025) 2025-05-07T20:31:44.6102331Z 2025-05-07T20:31:44.6102610Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6102951Z 2025-05-07T20:31:44.6103152Z x_sign = torch.sign(x) 2025-05-07T20:31:44.6103448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.6103752Z x = x_sign * x_clamp 2025-05-07T20:31:44.6103996Z x0 = x[:, :D] 2025-05-07T20:31:44.6104217Z x1 = x[:, D:] 2025-05-07T20:31:44.6104421Z 2025-05-07T20:31:44.6104616Z if contiguous: 2025-05-07T20:31:44.6104859Z x0 = x0.contiguous() 2025-05-07T20:31:44.6105119Z x1 = x1.contiguous() 2025-05-07T20:31:44.6105352Z 2025-05-07T20:31:44.6105549Z if scale_ub is not None: 2025-05-07T20:31:44.6106153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.6106486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.6106798Z ) 2025-05-07T20:31:44.6106993Z else: 2025-05-07T20:31:44.6107208Z scale_ub_tensor = None 2025-05-07T20:31:44.6107461Z 2025-05-07T20:31:44.6107698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.6108012Z op = silu_mul_quant 2025-05-07T20:31:44.6108264Z if compiled: 2025-05-07T20:31:44.6108515Z op = torch.compile(op) 2025-05-07T20:31:44.6108804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6109085Z 2025-05-07T20:31:44.6109281Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.6109619Z 2025-05-07T20:31:44.6109730Z moe/activation_test.py:117: 2025-05-07T20:31:44.6110026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6110359Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.6110638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6111320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.6112012Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.6112544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.6113223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.6113880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.6114411Z kernel = self.compile( 2025-05-07T20:31:44.6114956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.6115602Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.6115995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6116229Z 2025-05-07T20:31:44.6116435Z self = 2025-05-07T20:31:44.6117626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.6118983Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb75bec0>} 2025-05-07T20:31:44.6120315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.6121393Z context = 2025-05-07T20:31:44.6121683Z 2025-05-07T20:31:44.6121852Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.6122376Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.6122842Z module_map=module_map) 2025-05-07T20:31:44.6123207Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.6123562Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.6123822Z E ^ 2025-05-07T20:31:44.6124290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.6124736Z 2025-05-07T20:31:44.6125156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.6125671Z 2025-05-07T20:31:44.6125783Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6126188Z self=, 2025-05-07T20:31:44.6126592Z T=2048, 2025-05-07T20:31:44.6126787Z D=7168, 2025-05-07T20:31:44.6126983Z scale_ub=1200.0, 2025-05-07T20:31:44.6127208Z contiguous=True, 2025-05-07T20:31:44.6127433Z compiled=False, 2025-05-07T20:31:44.6127686Z ) 2025-05-07T20:31:44.6786248Z self = 2025-05-07T20:31:44.6786990Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.6787384Z 2025-05-07T20:31:44.6787492Z @given( 2025-05-07T20:31:44.6787801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6788220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6788644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6789167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6789493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6789786Z ) 2025-05-07T20:31:44.6790132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6790568Z def test_silu_mul_quant( 2025-05-07T20:31:44.6790810Z self, 2025-05-07T20:31:44.6791013Z T: int, 2025-05-07T20:31:44.6791216Z D: int, 2025-05-07T20:31:44.6791439Z scale_ub: Optional[float], 2025-05-07T20:31:44.6791710Z contiguous: bool, 2025-05-07T20:31:44.6791953Z compiled: bool, 2025-05-07T20:31:44.6792181Z ) -> None: 2025-05-07T20:31:44.6792400Z torch.manual_seed(2025) 2025-05-07T20:31:44.6792646Z 2025-05-07T20:31:44.6792913Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6794957Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.6796929Z 2025-05-07T20:31:44.6797050Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.6797262Z 2025-05-07T20:31:44.6797371Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6797777Z self=, 2025-05-07T20:31:44.6798184Z T=1, 2025-05-07T20:31:44.6798377Z D=5120, 2025-05-07T20:31:44.6798577Z scale_ub=1200.0, 2025-05-07T20:31:44.6798797Z contiguous=True, 2025-05-07T20:31:44.6799030Z compiled=False, 2025-05-07T20:31:44.6799236Z ) 2025-05-07T20:31:44.6799554Z self = 2025-05-07T20:31:44.6800043Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.6800308Z 2025-05-07T20:31:44.6800395Z @given( 2025-05-07T20:31:44.6800624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6800940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6801255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6801581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6801904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6802193Z ) 2025-05-07T20:31:44.6802548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6802991Z def test_silu_mul_quant( 2025-05-07T20:31:44.6803233Z self, 2025-05-07T20:31:44.6803437Z T: int, 2025-05-07T20:31:44.6803637Z D: int, 2025-05-07T20:31:44.6803849Z scale_ub: Optional[float], 2025-05-07T20:31:44.6804124Z contiguous: bool, 2025-05-07T20:31:44.6804367Z compiled: bool, 2025-05-07T20:31:44.6804591Z ) -> None: 2025-05-07T20:31:44.6804807Z torch.manual_seed(2025) 2025-05-07T20:31:44.6805048Z 2025-05-07T20:31:44.6805326Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6805923Z 2025-05-07T20:31:44.6806133Z x_sign = torch.sign(x) 2025-05-07T20:31:44.6806430Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.6806738Z x = x_sign * x_clamp 2025-05-07T20:31:44.6806984Z x0 = x[:, :D] 2025-05-07T20:31:44.6807203Z x1 = x[:, D:] 2025-05-07T20:31:44.6807411Z 2025-05-07T20:31:44.6807679Z if contiguous: 2025-05-07T20:31:44.6807913Z x0 = x0.contiguous() 2025-05-07T20:31:44.6808166Z x1 = x1.contiguous() 2025-05-07T20:31:44.6808565Z 2025-05-07T20:31:44.6808760Z if scale_ub is not None: 2025-05-07T20:31:44.6809030Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.6809369Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.6809682Z ) 2025-05-07T20:31:44.6809881Z else: 2025-05-07T20:31:44.6810098Z scale_ub_tensor = None 2025-05-07T20:31:44.6810354Z 2025-05-07T20:31:44.6810589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.6810907Z op = silu_mul_quant 2025-05-07T20:31:44.6811161Z if compiled: 2025-05-07T20:31:44.6811411Z op = torch.compile(op) 2025-05-07T20:31:44.6811703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6811985Z 2025-05-07T20:31:44.6812189Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.6812352Z 2025-05-07T20:31:44.6812453Z moe/activation_test.py:117: 2025-05-07T20:31:44.6812751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6813092Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.6813370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6814057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.6814751Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.6815401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.6816080Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.6816741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.6817273Z kernel = self.compile( 2025-05-07T20:31:44.6817813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.6818465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.6818880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6819111Z 2025-05-07T20:31:44.6819316Z self = 2025-05-07T20:31:44.6820398Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.6821757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb85d4e0>} 2025-05-07T20:31:44.6823095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.6824121Z context = 2025-05-07T20:31:44.6824410Z 2025-05-07T20:31:44.6824581Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.6825104Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.6825565Z module_map=module_map) 2025-05-07T20:31:44.6825932Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.6826285Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.6826543Z E ^ 2025-05-07T20:31:44.6827008Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.6827453Z 2025-05-07T20:31:44.6827868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.6828375Z 2025-05-07T20:31:44.6828573Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6828981Z self=, 2025-05-07T20:31:44.6829511Z T=2048, 2025-05-07T20:31:44.6829705Z D=5120, 2025-05-07T20:31:44.6829894Z scale_ub=None, 2025-05-07T20:31:44.6830111Z contiguous=True, 2025-05-07T20:31:44.6830338Z compiled=False, 2025-05-07T20:31:44.6830543Z ) 2025-05-07T20:31:44.6830869Z self = 2025-05-07T20:31:44.6831371Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.6831642Z 2025-05-07T20:31:44.6831730Z @given( 2025-05-07T20:31:44.6831966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6832280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6832588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6832913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6833257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6833554Z ) 2025-05-07T20:31:44.6833904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6834346Z def test_silu_mul_quant( 2025-05-07T20:31:44.6834598Z self, 2025-05-07T20:31:44.6834790Z T: int, 2025-05-07T20:31:44.6834997Z D: int, 2025-05-07T20:31:44.6835225Z scale_ub: Optional[float], 2025-05-07T20:31:44.6835493Z contiguous: bool, 2025-05-07T20:31:44.6835857Z compiled: bool, 2025-05-07T20:31:44.6836088Z ) -> None: 2025-05-07T20:31:44.6836315Z torch.manual_seed(2025) 2025-05-07T20:31:44.6836560Z 2025-05-07T20:31:44.6836833Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6837180Z 2025-05-07T20:31:44.6837374Z > x_sign = torch.sign(x) 2025-05-07T20:31:44.6839339Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.6848568Z 2025-05-07T20:31:44.6848758Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:44.6848989Z 2025-05-07T20:31:44.6849103Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6849525Z self=, 2025-05-07T20:31:44.6849940Z T=16384, 2025-05-07T20:31:44.6850144Z D=5120, 2025-05-07T20:31:44.6850349Z scale_ub=None, 2025-05-07T20:31:44.6850564Z contiguous=True, 2025-05-07T20:31:44.6850786Z compiled=False, 2025-05-07T20:31:44.6851010Z ) 2025-05-07T20:31:44.7543625Z self = 2025-05-07T20:31:44.7544420Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.7544804Z 2025-05-07T20:31:44.7544917Z @given( 2025-05-07T20:31:44.7545223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7545567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7545884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7546208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7546536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7546821Z ) 2025-05-07T20:31:44.7547164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7547605Z def test_silu_mul_quant( 2025-05-07T20:31:44.7547849Z self, 2025-05-07T20:31:44.7548044Z T: int, 2025-05-07T20:31:44.7548894Z D: int, 2025-05-07T20:31:44.7549115Z scale_ub: Optional[float], 2025-05-07T20:31:44.7549388Z contiguous: bool, 2025-05-07T20:31:44.7549621Z compiled: bool, 2025-05-07T20:31:44.7549843Z ) -> None: 2025-05-07T20:31:44.7550058Z torch.manual_seed(2025) 2025-05-07T20:31:44.7550297Z 2025-05-07T20:31:44.7550571Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7552658Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7554563Z 2025-05-07T20:31:44.7554687Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7554902Z 2025-05-07T20:31:44.7555017Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7555432Z self=, 2025-05-07T20:31:44.7555841Z T=4096, 2025-05-07T20:31:44.7556036Z D=5120, 2025-05-07T20:31:44.7556235Z scale_ub=None, 2025-05-07T20:31:44.7556454Z contiguous=True, 2025-05-07T20:31:44.7556802Z compiled=False, 2025-05-07T20:31:44.7557016Z ) 2025-05-07T20:31:44.7557334Z self = 2025-05-07T20:31:44.7557835Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.7558107Z 2025-05-07T20:31:44.7558197Z @given( 2025-05-07T20:31:44.7558429Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7558746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7559061Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7559392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7559724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7560014Z ) 2025-05-07T20:31:44.7560372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7560813Z def test_silu_mul_quant( 2025-05-07T20:31:44.7561063Z self, 2025-05-07T20:31:44.7561268Z T: int, 2025-05-07T20:31:44.7561479Z D: int, 2025-05-07T20:31:44.7561704Z scale_ub: Optional[float], 2025-05-07T20:31:44.7561982Z contiguous: bool, 2025-05-07T20:31:44.7562224Z compiled: bool, 2025-05-07T20:31:44.7562453Z ) -> None: 2025-05-07T20:31:44.7562679Z torch.manual_seed(2025) 2025-05-07T20:31:44.7562924Z 2025-05-07T20:31:44.7563199Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7565271Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7567140Z 2025-05-07T20:31:44.7567264Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7567483Z 2025-05-07T20:31:44.7567697Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7568111Z self=, 2025-05-07T20:31:44.7568527Z T=2048, 2025-05-07T20:31:44.7568725Z D=5120, 2025-05-07T20:31:44.7568926Z scale_ub=None, 2025-05-07T20:31:44.7569247Z contiguous=False, 2025-05-07T20:31:44.7569482Z compiled=False, 2025-05-07T20:31:44.7569690Z ) 2025-05-07T20:31:44.7570017Z self = 2025-05-07T20:31:44.7570521Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.7570794Z 2025-05-07T20:31:44.7570881Z @given( 2025-05-07T20:31:44.7571122Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7571450Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7571763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7572095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7572432Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7572725Z ) 2025-05-07T20:31:44.7573078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7573525Z def test_silu_mul_quant( 2025-05-07T20:31:44.7573784Z self, 2025-05-07T20:31:44.7573981Z T: int, 2025-05-07T20:31:44.7574186Z D: int, 2025-05-07T20:31:44.7574415Z scale_ub: Optional[float], 2025-05-07T20:31:44.7574690Z contiguous: bool, 2025-05-07T20:31:44.7574929Z compiled: bool, 2025-05-07T20:31:44.7575160Z ) -> None: 2025-05-07T20:31:44.7575382Z torch.manual_seed(2025) 2025-05-07T20:31:44.7575627Z 2025-05-07T20:31:44.7575901Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7578030Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7579891Z 2025-05-07T20:31:44.7580034Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7580247Z 2025-05-07T20:31:44.7580361Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7580773Z self=, 2025-05-07T20:31:44.7581179Z T=4096, 2025-05-07T20:31:44.7581372Z D=7168, 2025-05-07T20:31:44.7581571Z scale_ub=None, 2025-05-07T20:31:44.7581793Z contiguous=True, 2025-05-07T20:31:44.7582025Z compiled=True, 2025-05-07T20:31:44.7582239Z ) 2025-05-07T20:31:44.7582558Z self = 2025-05-07T20:31:44.7583051Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:44.7583325Z 2025-05-07T20:31:44.7583406Z @given( 2025-05-07T20:31:44.7583642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7583960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7584271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7584601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7584929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7585217Z ) 2025-05-07T20:31:44.7585567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7586012Z def test_silu_mul_quant( 2025-05-07T20:31:44.7586261Z self, 2025-05-07T20:31:44.7586463Z T: int, 2025-05-07T20:31:44.7586665Z D: int, 2025-05-07T20:31:44.7586885Z scale_ub: Optional[float], 2025-05-07T20:31:44.7587162Z contiguous: bool, 2025-05-07T20:31:44.7587406Z compiled: bool, 2025-05-07T20:31:44.7587628Z ) -> None: 2025-05-07T20:31:44.7587850Z torch.manual_seed(2025) 2025-05-07T20:31:44.7588097Z 2025-05-07T20:31:44.7588373Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7590521Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7592387Z 2025-05-07T20:31:44.7592507Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7592724Z 2025-05-07T20:31:44.7592829Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7593241Z self=, 2025-05-07T20:31:44.7593643Z T=2048, 2025-05-07T20:31:44.7593841Z D=5120, 2025-05-07T20:31:44.7594038Z scale_ub=1200.0, 2025-05-07T20:31:44.7594263Z contiguous=False, 2025-05-07T20:31:44.7594494Z compiled=False, 2025-05-07T20:31:44.7594704Z ) 2025-05-07T20:31:44.7595021Z self = 2025-05-07T20:31:44.7595519Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.7595796Z 2025-05-07T20:31:44.7595883Z @given( 2025-05-07T20:31:44.7596203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7596517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7596829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7597161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7597487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7597777Z ) 2025-05-07T20:31:44.7598128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7598578Z def test_silu_mul_quant( 2025-05-07T20:31:44.7598824Z self, 2025-05-07T20:31:44.7599023Z T: int, 2025-05-07T20:31:44.7599220Z D: int, 2025-05-07T20:31:44.7599443Z scale_ub: Optional[float], 2025-05-07T20:31:44.7599718Z contiguous: bool, 2025-05-07T20:31:44.7599958Z compiled: bool, 2025-05-07T20:31:44.7600187Z ) -> None: 2025-05-07T20:31:44.7600408Z torch.manual_seed(2025) 2025-05-07T20:31:44.7600652Z 2025-05-07T20:31:44.7600933Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7602979Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7605081Z 2025-05-07T20:31:44.7605244Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7605545Z 2025-05-07T20:31:44.7605906Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7606332Z self=, 2025-05-07T20:31:44.7606741Z T=4096, 2025-05-07T20:31:44.7606945Z D=7168, 2025-05-07T20:31:44.7607142Z scale_ub=1200.0, 2025-05-07T20:31:44.7607371Z contiguous=True, 2025-05-07T20:31:44.7607658Z compiled=False, 2025-05-07T20:31:44.7607869Z ) 2025-05-07T20:31:44.8522616Z self = 2025-05-07T20:31:44.8523404Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.8523801Z 2025-05-07T20:31:44.8523909Z @given( 2025-05-07T20:31:44.8524442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8524820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8525126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8525459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8525783Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8526077Z ) 2025-05-07T20:31:44.8526431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8526871Z def test_silu_mul_quant( 2025-05-07T20:31:44.8527111Z self, 2025-05-07T20:31:44.8527310Z T: int, 2025-05-07T20:31:44.8527611Z D: int, 2025-05-07T20:31:44.8527828Z scale_ub: Optional[float], 2025-05-07T20:31:44.8528100Z contiguous: bool, 2025-05-07T20:31:44.8528343Z compiled: bool, 2025-05-07T20:31:44.8528568Z ) -> None: 2025-05-07T20:31:44.8528788Z torch.manual_seed(2025) 2025-05-07T20:31:44.8529039Z 2025-05-07T20:31:44.8529313Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8531517Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.8533401Z 2025-05-07T20:31:44.8533521Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.8533737Z 2025-05-07T20:31:44.8533841Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8534259Z self=, 2025-05-07T20:31:44.8534664Z T=16384, 2025-05-07T20:31:44.8534860Z D=7168, 2025-05-07T20:31:44.8535059Z scale_ub=None, 2025-05-07T20:31:44.8535271Z contiguous=False, 2025-05-07T20:31:44.8535500Z compiled=True, 2025-05-07T20:31:44.8535709Z ) 2025-05-07T20:31:44.8536024Z self = 2025-05-07T20:31:44.8536522Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:44.8536805Z 2025-05-07T20:31:44.8536895Z @given( 2025-05-07T20:31:44.8537127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8537434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8537741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8538071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8538395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8538683Z ) 2025-05-07T20:31:44.8539032Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8539471Z def test_silu_mul_quant( 2025-05-07T20:31:44.8539716Z self, 2025-05-07T20:31:44.8539915Z T: int, 2025-05-07T20:31:44.8540108Z D: int, 2025-05-07T20:31:44.8540329Z scale_ub: Optional[float], 2025-05-07T20:31:44.8540600Z contiguous: bool, 2025-05-07T20:31:44.8540834Z compiled: bool, 2025-05-07T20:31:44.8541060Z ) -> None: 2025-05-07T20:31:44.8541284Z torch.manual_seed(2025) 2025-05-07T20:31:44.8541529Z 2025-05-07T20:31:44.8541795Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8543854Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.8545815Z 2025-05-07T20:31:44.8545934Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.8546146Z 2025-05-07T20:31:44.8546258Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8546673Z self=, 2025-05-07T20:31:44.8547075Z T=4096, 2025-05-07T20:31:44.8547266Z D=7168, 2025-05-07T20:31:44.8547457Z scale_ub=None, 2025-05-07T20:31:44.8547667Z contiguous=True, 2025-05-07T20:31:44.8547892Z compiled=False, 2025-05-07T20:31:44.8548098Z ) 2025-05-07T20:31:44.8548412Z self = 2025-05-07T20:31:44.8548906Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.8549208Z 2025-05-07T20:31:44.8549290Z @given( 2025-05-07T20:31:44.8549520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8549828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8550129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8550462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8550785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8551076Z ) 2025-05-07T20:31:44.8551515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8551959Z def test_silu_mul_quant( 2025-05-07T20:31:44.8552203Z self, 2025-05-07T20:31:44.8552404Z T: int, 2025-05-07T20:31:44.8552599Z D: int, 2025-05-07T20:31:44.8552813Z scale_ub: Optional[float], 2025-05-07T20:31:44.8553085Z contiguous: bool, 2025-05-07T20:31:44.8553323Z compiled: bool, 2025-05-07T20:31:44.8553542Z ) -> None: 2025-05-07T20:31:44.8553766Z torch.manual_seed(2025) 2025-05-07T20:31:44.8554007Z 2025-05-07T20:31:44.8554277Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8556325Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.8558175Z 2025-05-07T20:31:44.8558292Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.8558509Z 2025-05-07T20:31:44.8558613Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8559028Z self=, 2025-05-07T20:31:44.8559424Z T=16384, 2025-05-07T20:31:44.8559623Z D=7168, 2025-05-07T20:31:44.8559815Z scale_ub=None, 2025-05-07T20:31:44.8560026Z contiguous=True, 2025-05-07T20:31:44.8560249Z compiled=False, 2025-05-07T20:31:44.8560454Z ) 2025-05-07T20:31:44.8560766Z self = 2025-05-07T20:31:44.8561284Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.8561591Z 2025-05-07T20:31:44.8561677Z @given( 2025-05-07T20:31:44.8561905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8562213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8562517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8562845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8563166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8563537Z ) 2025-05-07T20:31:44.8563885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8564321Z def test_silu_mul_quant( 2025-05-07T20:31:44.8564565Z self, 2025-05-07T20:31:44.8564764Z T: int, 2025-05-07T20:31:44.8564961Z D: int, 2025-05-07T20:31:44.8565179Z scale_ub: Optional[float], 2025-05-07T20:31:44.8565452Z contiguous: bool, 2025-05-07T20:31:44.8565686Z compiled: bool, 2025-05-07T20:31:44.8565912Z ) -> None: 2025-05-07T20:31:44.8566126Z torch.manual_seed(2025) 2025-05-07T20:31:44.8566373Z 2025-05-07T20:31:44.8566644Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8568760Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.8570614Z 2025-05-07T20:31:44.8570746Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.8570993Z 2025-05-07T20:31:44.8571097Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8571591Z self=, 2025-05-07T20:31:44.8571995Z T=16384, 2025-05-07T20:31:44.8572184Z D=7168, 2025-05-07T20:31:44.8572376Z scale_ub=1200.0, 2025-05-07T20:31:44.8572600Z contiguous=True, 2025-05-07T20:31:44.8572820Z compiled=False, 2025-05-07T20:31:44.8573026Z ) 2025-05-07T20:31:44.8573342Z self = 2025-05-07T20:31:44.8573837Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.8574120Z 2025-05-07T20:31:44.8574198Z @given( 2025-05-07T20:31:44.8574425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8574737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8575039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8575367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8575700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8575981Z ) 2025-05-07T20:31:44.8576326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8576767Z def test_silu_mul_quant( 2025-05-07T20:31:44.8577007Z self, 2025-05-07T20:31:44.8577209Z T: int, 2025-05-07T20:31:44.8577407Z D: int, 2025-05-07T20:31:44.8577620Z scale_ub: Optional[float], 2025-05-07T20:31:44.8577891Z contiguous: bool, 2025-05-07T20:31:44.8578135Z compiled: bool, 2025-05-07T20:31:44.8578358Z ) -> None: 2025-05-07T20:31:44.8578573Z torch.manual_seed(2025) 2025-05-07T20:31:44.8578816Z 2025-05-07T20:31:44.8579087Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8581129Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.8582977Z 2025-05-07T20:31:44.8583095Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.8583401Z 2025-05-07T20:31:44.8583506Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8583916Z self=, 2025-05-07T20:31:44.8584316Z T=128, 2025-05-07T20:31:44.8584508Z D=5120, 2025-05-07T20:31:44.8584706Z scale_ub=1200.0, 2025-05-07T20:31:44.8584931Z contiguous=False, 2025-05-07T20:31:44.8585154Z compiled=False, 2025-05-07T20:31:44.8585358Z ) 2025-05-07T20:31:44.9600764Z self = 2025-05-07T20:31:44.9601731Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.9602018Z 2025-05-07T20:31:44.9602102Z @given( 2025-05-07T20:31:44.9602344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9602654Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9602972Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9603310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9603651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9603943Z ) 2025-05-07T20:31:44.9604295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9604737Z def test_silu_mul_quant( 2025-05-07T20:31:44.9604982Z self, 2025-05-07T20:31:44.9605185Z T: int, 2025-05-07T20:31:44.9605387Z D: int, 2025-05-07T20:31:44.9605840Z scale_ub: Optional[float], 2025-05-07T20:31:44.9606293Z contiguous: bool, 2025-05-07T20:31:44.9606547Z compiled: bool, 2025-05-07T20:31:44.9606775Z ) -> None: 2025-05-07T20:31:44.9606999Z torch.manual_seed(2025) 2025-05-07T20:31:44.9607399Z 2025-05-07T20:31:44.9607736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9608086Z 2025-05-07T20:31:44.9608294Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9608591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9608913Z x = x_sign * x_clamp 2025-05-07T20:31:44.9609164Z x0 = x[:, :D] 2025-05-07T20:31:44.9609396Z x1 = x[:, D:] 2025-05-07T20:31:44.9609608Z 2025-05-07T20:31:44.9609804Z if contiguous: 2025-05-07T20:31:44.9610045Z x0 = x0.contiguous() 2025-05-07T20:31:44.9610305Z x1 = x1.contiguous() 2025-05-07T20:31:44.9610555Z 2025-05-07T20:31:44.9610763Z if scale_ub is not None: 2025-05-07T20:31:44.9611039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9611403Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9611719Z ) 2025-05-07T20:31:44.9611921Z else: 2025-05-07T20:31:44.9612139Z scale_ub_tensor = None 2025-05-07T20:31:44.9612397Z 2025-05-07T20:31:44.9612638Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9612959Z op = silu_mul_quant 2025-05-07T20:31:44.9613223Z if compiled: 2025-05-07T20:31:44.9621141Z op = torch.compile(op) 2025-05-07T20:31:44.9621497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9621778Z 2025-05-07T20:31:44.9621985Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9622156Z 2025-05-07T20:31:44.9622267Z moe/activation_test.py:117: 2025-05-07T20:31:44.9622566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9622906Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9623202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9623902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.9624607Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.9625159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.9625844Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.9626699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.9627235Z kernel = self.compile( 2025-05-07T20:31:44.9627779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.9628431Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.9628838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9629067Z 2025-05-07T20:31:44.9629281Z self = 2025-05-07T20:31:44.9630362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.9631721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb368220>} 2025-05-07T20:31:44.9633067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.9634087Z context = 2025-05-07T20:31:44.9634371Z 2025-05-07T20:31:44.9634628Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.9635142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.9635608Z module_map=module_map) 2025-05-07T20:31:44.9635978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.9636332Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.9636590Z E ^ 2025-05-07T20:31:44.9637060Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.9637506Z 2025-05-07T20:31:44.9637926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.9638431Z 2025-05-07T20:31:44.9638540Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9638947Z self=, 2025-05-07T20:31:44.9639361Z T=2048, 2025-05-07T20:31:44.9639558Z D=7168, 2025-05-07T20:31:44.9639752Z scale_ub=None, 2025-05-07T20:31:44.9639972Z contiguous=False, 2025-05-07T20:31:44.9640199Z compiled=False, 2025-05-07T20:31:44.9640400Z ) 2025-05-07T20:31:44.9640720Z self = 2025-05-07T20:31:44.9641209Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.9641485Z 2025-05-07T20:31:44.9641566Z @given( 2025-05-07T20:31:44.9641795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9642107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9642415Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9642740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9643067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9643353Z ) 2025-05-07T20:31:44.9643700Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9644137Z def test_silu_mul_quant( 2025-05-07T20:31:44.9644380Z self, 2025-05-07T20:31:44.9644571Z T: int, 2025-05-07T20:31:44.9644773Z D: int, 2025-05-07T20:31:44.9644995Z scale_ub: Optional[float], 2025-05-07T20:31:44.9645259Z contiguous: bool, 2025-05-07T20:31:44.9645503Z compiled: bool, 2025-05-07T20:31:44.9645732Z ) -> None: 2025-05-07T20:31:44.9646038Z torch.manual_seed(2025) 2025-05-07T20:31:44.9646283Z 2025-05-07T20:31:44.9646561Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9648710Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.9650554Z 2025-05-07T20:31:44.9650682Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.9650895Z 2025-05-07T20:31:44.9651000Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9651417Z self=, 2025-05-07T20:31:44.9651832Z T=128, 2025-05-07T20:31:44.9652018Z D=7168, 2025-05-07T20:31:44.9652223Z scale_ub=1200.0, 2025-05-07T20:31:44.9652444Z contiguous=True, 2025-05-07T20:31:44.9652662Z compiled=True, 2025-05-07T20:31:44.9652872Z ) 2025-05-07T20:31:44.9952112Z self = 2025-05-07T20:31:44.9952746Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.9953322Z 2025-05-07T20:31:44.9953445Z @given( 2025-05-07T20:31:44.9953752Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9954124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9954433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9954765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9955095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9955390Z ) 2025-05-07T20:31:44.9955742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9956189Z def test_silu_mul_quant( 2025-05-07T20:31:44.9956438Z self, 2025-05-07T20:31:44.9956639Z T: int, 2025-05-07T20:31:44.9956840Z D: int, 2025-05-07T20:31:44.9957067Z scale_ub: Optional[float], 2025-05-07T20:31:44.9957336Z contiguous: bool, 2025-05-07T20:31:44.9957576Z compiled: bool, 2025-05-07T20:31:44.9957803Z ) -> None: 2025-05-07T20:31:44.9958029Z torch.manual_seed(2025) 2025-05-07T20:31:44.9958269Z 2025-05-07T20:31:44.9958543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9958885Z 2025-05-07T20:31:44.9959076Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9959366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9959675Z x = x_sign * x_clamp 2025-05-07T20:31:44.9959911Z x0 = x[:, :D] 2025-05-07T20:31:44.9960138Z x1 = x[:, D:] 2025-05-07T20:31:44.9960346Z 2025-05-07T20:31:44.9960537Z if contiguous: 2025-05-07T20:31:44.9960773Z x0 = x0.contiguous() 2025-05-07T20:31:44.9961060Z x1 = x1.contiguous() 2025-05-07T20:31:44.9961318Z 2025-05-07T20:31:44.9961518Z if scale_ub is not None: 2025-05-07T20:31:44.9961793Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9962134Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9962448Z ) 2025-05-07T20:31:44.9962639Z else: 2025-05-07T20:31:44.9962856Z scale_ub_tensor = None 2025-05-07T20:31:44.9963109Z 2025-05-07T20:31:44.9963340Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9963651Z op = silu_mul_quant 2025-05-07T20:31:44.9963910Z if compiled: 2025-05-07T20:31:44.9964153Z op = torch.compile(op) 2025-05-07T20:31:44.9964448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9964860Z 2025-05-07T20:31:44.9965050Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9965218Z 2025-05-07T20:31:44.9965318Z moe/activation_test.py:117: 2025-05-07T20:31:44.9965610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9965938Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9966223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9966783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.9967348Z return fn(*args, **kwargs) 2025-05-07T20:31:44.9968118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.9968810Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.9969345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.9970030Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.9970686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.9971214Z kernel = self.compile( 2025-05-07T20:31:44.9971751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.9972401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.9972880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9973116Z 2025-05-07T20:31:44.9973321Z self = 2025-05-07T20:31:44.9974397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.9975762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb368860>} 2025-05-07T20:31:44.9977094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.9978114Z context = 2025-05-07T20:31:44.9978400Z 2025-05-07T20:31:44.9978569Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.9979088Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.9979550Z module_map=module_map) 2025-05-07T20:31:44.9979915Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.9980273Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.9980538Z E ^ 2025-05-07T20:31:44.9981002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.9981446Z 2025-05-07T20:31:44.9981863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.9982369Z 2025-05-07T20:31:44.9982480Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9982889Z self=, 2025-05-07T20:31:44.9983286Z T=128, 2025-05-07T20:31:44.9983477Z D=7168, 2025-05-07T20:31:44.9983666Z scale_ub=1200.0, 2025-05-07T20:31:44.9983887Z contiguous=True, 2025-05-07T20:31:44.9984108Z compiled=False, 2025-05-07T20:31:44.9984309Z ) 2025-05-07T20:31:44.9984631Z self = 2025-05-07T20:31:44.9985120Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.9985481Z 2025-05-07T20:31:44.9985566Z @given( 2025-05-07T20:31:44.9985792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9986104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9986411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9986762Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9987083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9987379Z ) 2025-05-07T20:31:44.9987725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9988159Z def test_silu_mul_quant( 2025-05-07T20:31:44.9988406Z self, 2025-05-07T20:31:44.9988604Z T: int, 2025-05-07T20:31:44.9988801Z D: int, 2025-05-07T20:31:44.9989021Z scale_ub: Optional[float], 2025-05-07T20:31:44.9989295Z contiguous: bool, 2025-05-07T20:31:44.9989533Z compiled: bool, 2025-05-07T20:31:44.9989766Z ) -> None: 2025-05-07T20:31:44.9989980Z torch.manual_seed(2025) 2025-05-07T20:31:44.9990221Z 2025-05-07T20:31:44.9990491Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9990842Z 2025-05-07T20:31:44.9991071Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9991368Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9993466Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.9995310Z 2025-05-07T20:31:44.9995428Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:44.9995638Z 2025-05-07T20:31:44.9995745Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9996151Z self=, 2025-05-07T20:31:44.9996550Z T=128, 2025-05-07T20:31:44.9996746Z D=5120, 2025-05-07T20:31:44.9996941Z scale_ub=1200.0, 2025-05-07T20:31:44.9997161Z contiguous=True, 2025-05-07T20:31:44.9997386Z compiled=True, 2025-05-07T20:31:44.9997600Z ) 2025-05-07T20:31:44.9997913Z self = 2025-05-07T20:31:44.9998401Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.9998665Z 2025-05-07T20:31:44.9998754Z @given( 2025-05-07T20:31:44.9998978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9999287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9999599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9999924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.0000252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.0000542Z ) 2025-05-07T20:31:45.0000888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.0001339Z def test_silu_mul_quant( 2025-05-07T20:31:45.0001623Z self, 2025-05-07T20:31:45.0001819Z T: int, 2025-05-07T20:31:45.0002017Z D: int, 2025-05-07T20:31:45.0002235Z scale_ub: Optional[float], 2025-05-07T20:31:45.0002507Z contiguous: bool, 2025-05-07T20:31:45.0002742Z compiled: bool, 2025-05-07T20:31:45.0002965Z ) -> None: 2025-05-07T20:31:45.0003183Z torch.manual_seed(2025) 2025-05-07T20:31:45.0003425Z 2025-05-07T20:31:45.0003695Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.0004037Z 2025-05-07T20:31:45.0004316Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.0006554Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.0008502Z 2025-05-07T20:31:45.0008625Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.0008834Z 2025-05-07T20:31:45.0008935Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.0009340Z self=, 2025-05-07T20:31:45.0009734Z T=128, 2025-05-07T20:31:45.0009928Z D=7168, 2025-05-07T20:31:45.0010128Z scale_ub=None, 2025-05-07T20:31:45.0010336Z contiguous=True, 2025-05-07T20:31:45.0010561Z compiled=True, 2025-05-07T20:31:45.0010766Z ) 2025-05-07T20:31:45.4890505Z self = 2025-05-07T20:31:45.4891259Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4891533Z 2025-05-07T20:31:45.4891618Z @given( 2025-05-07T20:31:45.4891853Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4892333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4892655Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4892988Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4893314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4893604Z ) 2025-05-07T20:31:45.4893958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4894409Z def test_silu_mul_quant( 2025-05-07T20:31:45.4894654Z self, 2025-05-07T20:31:45.4894853Z T: int, 2025-05-07T20:31:45.4895055Z D: int, 2025-05-07T20:31:45.4895275Z scale_ub: Optional[float], 2025-05-07T20:31:45.4895553Z contiguous: bool, 2025-05-07T20:31:45.4895800Z compiled: bool, 2025-05-07T20:31:45.4896023Z ) -> None: 2025-05-07T20:31:45.4896242Z torch.manual_seed(2025) 2025-05-07T20:31:45.4896489Z 2025-05-07T20:31:45.4896766Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4898811Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4900671Z 2025-05-07T20:31:45.4900793Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4901012Z 2025-05-07T20:31:45.4956906Z FAILED 2025-05-07T20:31:45.4957243Z 2025-05-07T20:31:45.4957618Z =================================== FAILURES =================================== 2025-05-07T20:31:45.4958845Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:31:45.4960024Z + Exception Group Traceback (most recent call last): 2025-05-07T20:31:45.4961520Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:31:45.4962291Z | yield 2025-05-07T20:31:45.4962883Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:31:45.4963598Z | self._callTestMethod(testMethod) 2025-05-07T20:31:45.4964558Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:31:45.4965310Z | if method() is not None: 2025-05-07T20:31:45.4965646Z | ^^^^^^^^ 2025-05-07T20:31:45.4966513Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:31:45.4967653Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4968068Z | ^^^^^^^ 2025-05-07T20:31:45.4968831Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:31:45.4969693Z | raise the_error_hypothesis_found 2025-05-07T20:31:45.4970277Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:31:45.4970850Z +-+---------------- 1 ---------------- 2025-05-07T20:31:45.4971260Z | Traceback (most recent call last): 2025-05-07T20:31:45.4972220Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:45.4973277Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4973775Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.4976668Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4978656Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:45.4979092Z | self=, 2025-05-07T20:31:45.4979502Z | T=128, 2025-05-07T20:31:45.4979706Z | D=7168, 2025-05-07T20:31:45.4979913Z | scale_ub=1200.0, 2025-05-07T20:31:45.4980161Z | contiguous=True, 2025-05-07T20:31:45.4980408Z | compiled=False, 2025-05-07T20:31:45.4980626Z | ) 2025-05-07T20:31:45.4980818Z | 2025-05-07T20:31:45.4981343Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:31:45.4981941Z +---------------- 2 ---------------- 2025-05-07T20:31:45.4982232Z | Traceback (most recent call last): 2025-05-07T20:31:45.4982941Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:45.4983721Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4984095Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.4986077Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4988039Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:45.4988477Z | self=, 2025-05-07T20:31:45.4988967Z | T=128, 2025-05-07T20:31:45.4989164Z | D=7168, 2025-05-07T20:31:45.4989375Z | scale_ub=None, 2025-05-07T20:31:45.4989612Z | contiguous=True, 2025-05-07T20:31:45.4989847Z | compiled=True, 2025-05-07T20:31:45.4990070Z | ) 2025-05-07T20:31:45.4990251Z | 2025-05-07T20:31:45.4990788Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:45.4991419Z +---------------- 3 ---------------- 2025-05-07T20:31:45.4991705Z | Traceback (most recent call last): 2025-05-07T20:31:45.4992405Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:45.4993649Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4994029Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.4996410Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4999170Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:45.4999765Z | self=, 2025-05-07T20:31:45.5000320Z | T=128, 2025-05-07T20:31:45.5000588Z | D=5120, 2025-05-07T20:31:45.5000874Z | scale_ub=1200.0, 2025-05-07T20:31:45.5001192Z | contiguous=True, 2025-05-07T20:31:45.5001525Z | compiled=True, 2025-05-07T20:31:45.5001834Z | ) 2025-05-07T20:31:45.5002076Z | 2025-05-07T20:31:45.5002794Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:31:45.5003620Z +---------------- 4 ---------------- 2025-05-07T20:31:45.5004007Z | Traceback (most recent call last): 2025-05-07T20:31:45.5005000Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:31:45.5006189Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5006592Z | ^^^^^^^^ 2025-05-07T20:31:45.5007467Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:31:45.5025368Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5025888Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5026995Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:31:45.5028074Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5028953Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:31:45.5029980Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5030598Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5031470Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:31:45.5032539Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5033371Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5034049Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:31:45.5034849Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5035326Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5035972Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:31:45.5036668Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5037050Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5037859Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:31:45.5038727Z | fn() 2025-05-07T20:31:45.5039539Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:31:45.5040498Z | self.fn.run( 2025-05-07T20:31:45.5041100Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:31:45.5042161Z | kernel = self.compile( 2025-05-07T20:31:45.5042557Z | ^^^^^^^^^^^^^ 2025-05-07T20:31:45.5043417Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:31:45.5044993Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5045560Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5046528Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:45.5047797Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5048472Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.5048976Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5049475Z | def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5049846Z | ^ 2025-05-07T20:31:45.5050490Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5051300Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:45.5051869Z | # The test always failed when commented parts were varied together. 2025-05-07T20:31:45.5052599Z | self=, 2025-05-07T20:31:45.5053217Z | T=1, # or any other generated value 2025-05-07T20:31:45.5053651Z | D=5120, # or any other generated value 2025-05-07T20:31:45.5054130Z | scale_ub=None, # or any other generated value 2025-05-07T20:31:45.5054638Z | contiguous=True, # or any other generated value 2025-05-07T20:31:45.5055162Z | compiled=True, # or any other generated value 2025-05-07T20:31:45.5055594Z | ) 2025-05-07T20:31:45.5055851Z | 2025-05-07T20:31:45.5056603Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:45.5057464Z +------------------------------------ 2025-05-07T20:31:45.5057954Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:31:45.5058441Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5058999Z self=, 2025-05-07T20:31:45.5059688Z T=1, 2025-05-07T20:31:45.5059952Z D=5120, 2025-05-07T20:31:45.5060212Z scale_ub=None, 2025-05-07T20:31:45.5060510Z contiguous=True, 2025-05-07T20:31:45.5060853Z compiled=True, 2025-05-07T20:31:45.5061172Z ) 2025-05-07T20:31:45.5061632Z self = 2025-05-07T20:31:45.5062328Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.5062696Z 2025-05-07T20:31:45.5062809Z @given( 2025-05-07T20:31:45.5063151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5063600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5064030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5064493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5064956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5065352Z ) 2025-05-07T20:31:45.5065845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5066472Z def test_silu_mul_quant( 2025-05-07T20:31:45.5066810Z self, 2025-05-07T20:31:45.5067094Z T: int, 2025-05-07T20:31:45.5067382Z D: int, 2025-05-07T20:31:45.5067696Z scale_ub: Optional[float], 2025-05-07T20:31:45.5068068Z contiguous: bool, 2025-05-07T20:31:45.5068406Z compiled: bool, 2025-05-07T20:31:45.5068720Z ) -> None: 2025-05-07T20:31:45.5069017Z torch.manual_seed(2025) 2025-05-07T20:31:45.5069458Z 2025-05-07T20:31:45.5069845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5070320Z 2025-05-07T20:31:45.5070603Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5071015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5071423Z x = x_sign * x_clamp 2025-05-07T20:31:45.5071741Z x0 = x[:, :D] 2025-05-07T20:31:45.5072049Z x1 = x[:, D:] 2025-05-07T20:31:45.5072348Z 2025-05-07T20:31:45.5072627Z if contiguous: 2025-05-07T20:31:45.5072960Z x0 = x0.contiguous() 2025-05-07T20:31:45.5073316Z x1 = x1.contiguous() 2025-05-07T20:31:45.5073653Z 2025-05-07T20:31:45.5073933Z if scale_ub is not None: 2025-05-07T20:31:45.5074324Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5074804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5075273Z ) 2025-05-07T20:31:45.5075576Z else: 2025-05-07T20:31:45.5075907Z scale_ub_tensor = None 2025-05-07T20:31:45.5076264Z 2025-05-07T20:31:45.5076583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5077001Z op = silu_mul_quant 2025-05-07T20:31:45.5077353Z if compiled: 2025-05-07T20:31:45.5077707Z op = torch.compile(op) 2025-05-07T20:31:45.5078104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5078493Z 2025-05-07T20:31:45.5078779Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5079163Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5079570Z 2025-05-07T20:31:45.5079897Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5080365Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5080781Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5081258Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5081787Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5082244Z 2025-05-07T20:31:45.5082524Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5082820Z 2025-05-07T20:31:45.5082972Z moe/activation_test.py:126: 2025-05-07T20:31:45.5083426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5083912Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5084375Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5085591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5086661Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5087423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5088500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5089469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5090418Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5091440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5092486Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5093521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5094380Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5095196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5095885Z fn() 2025-05-07T20:31:45.5096627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5097386Z self.fn.run( 2025-05-07T20:31:45.5097985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5098682Z kernel = self.compile( 2025-05-07T20:31:45.5099350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5100202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5100696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5100976Z 2025-05-07T20:31:45.5101233Z self = 2025-05-07T20:31:45.5102707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5104662Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f3937380>} 2025-05-07T20:31:45.5106791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5108236Z context = 2025-05-07T20:31:45.5108641Z 2025-05-07T20:31:45.5108882Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5109611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5110268Z module_map=module_map) 2025-05-07T20:31:45.5110805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5111320Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5111692Z E ^ 2025-05-07T20:31:45.5112362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5112988Z 2025-05-07T20:31:45.5113558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5114244Z 2025-05-07T20:31:45.5114384Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5115158Z self=, 2025-05-07T20:31:45.5115721Z T=2048, 2025-05-07T20:31:45.5115986Z D=5120, 2025-05-07T20:31:45.5116281Z scale_ub=1200.0, 2025-05-07T20:31:45.5116625Z contiguous=True, 2025-05-07T20:31:45.5116944Z compiled=False, 2025-05-07T20:31:45.5117227Z ) 2025-05-07T20:31:45.5117653Z self = 2025-05-07T20:31:45.5118346Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.5118733Z 2025-05-07T20:31:45.5118847Z @given( 2025-05-07T20:31:45.5119189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5119617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5120041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5120850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5121343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5121717Z ) 2025-05-07T20:31:45.5122427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5123039Z def test_silu_mul_quant( 2025-05-07T20:31:45.5123340Z self, 2025-05-07T20:31:45.5123589Z T: int, 2025-05-07T20:31:45.5123840Z D: int, 2025-05-07T20:31:45.5124109Z scale_ub: Optional[float], 2025-05-07T20:31:45.5124456Z contiguous: bool, 2025-05-07T20:31:45.5124766Z compiled: bool, 2025-05-07T20:31:45.5125214Z ) -> None: 2025-05-07T20:31:45.5125534Z torch.manual_seed(2025) 2025-05-07T20:31:45.5125885Z 2025-05-07T20:31:45.5126289Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5126776Z 2025-05-07T20:31:45.5127074Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5127603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5128054Z x = x_sign * x_clamp 2025-05-07T20:31:45.5128407Z x0 = x[:, :D] 2025-05-07T20:31:45.5128730Z x1 = x[:, D:] 2025-05-07T20:31:45.5129049Z 2025-05-07T20:31:45.5129317Z if contiguous: 2025-05-07T20:31:45.5129652Z x0 = x0.contiguous() 2025-05-07T20:31:45.5130026Z x1 = x1.contiguous() 2025-05-07T20:31:45.5130376Z 2025-05-07T20:31:45.5130641Z if scale_ub is not None: 2025-05-07T20:31:45.5130981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5131448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5131833Z ) 2025-05-07T20:31:45.5132073Z else: 2025-05-07T20:31:45.5132333Z scale_ub_tensor = None 2025-05-07T20:31:45.5132647Z 2025-05-07T20:31:45.5132934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5133316Z op = silu_mul_quant 2025-05-07T20:31:45.5133625Z if compiled: 2025-05-07T20:31:45.5133933Z op = torch.compile(op) 2025-05-07T20:31:45.5134306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5134645Z 2025-05-07T20:31:45.5134896Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5135095Z 2025-05-07T20:31:45.5135223Z moe/activation_test.py:117: 2025-05-07T20:31:45.5135580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5135996Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5136349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5137280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5138192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5138858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5139701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5140514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5144114Z kernel = self.compile( 2025-05-07T20:31:45.5144797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5145612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5146096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5146384Z 2025-05-07T20:31:45.5146638Z self = 2025-05-07T20:31:45.5148051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5149947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f3756660>} 2025-05-07T20:31:45.5151796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5153200Z context = 2025-05-07T20:31:45.5153600Z 2025-05-07T20:31:45.5153921Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5154640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5155273Z module_map=module_map) 2025-05-07T20:31:45.5155770Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5156246Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5156595Z E ^ 2025-05-07T20:31:45.5157241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5157883Z 2025-05-07T20:31:45.5158466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5159178Z 2025-05-07T20:31:45.5159336Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5159904Z self=, 2025-05-07T20:31:45.5160466Z T=2048, 2025-05-07T20:31:45.5160741Z D=5120, 2025-05-07T20:31:45.5161008Z scale_ub=1200.0, 2025-05-07T20:31:45.5161323Z contiguous=True, 2025-05-07T20:31:45.5161632Z compiled=True, 2025-05-07T20:31:45.5161923Z ) 2025-05-07T20:31:45.5162364Z self = 2025-05-07T20:31:45.5163036Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.5163408Z 2025-05-07T20:31:45.5163523Z @given( 2025-05-07T20:31:45.5163844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5164270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5164689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5165156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5165599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5165992Z ) 2025-05-07T20:31:45.5166468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5167112Z def test_silu_mul_quant( 2025-05-07T20:31:45.5167443Z self, 2025-05-07T20:31:45.5167815Z T: int, 2025-05-07T20:31:45.5168091Z D: int, 2025-05-07T20:31:45.5168400Z scale_ub: Optional[float], 2025-05-07T20:31:45.5168778Z contiguous: bool, 2025-05-07T20:31:45.5169110Z compiled: bool, 2025-05-07T20:31:45.5169426Z ) -> None: 2025-05-07T20:31:45.5169727Z torch.manual_seed(2025) 2025-05-07T20:31:45.5170060Z 2025-05-07T20:31:45.5170541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5171065Z 2025-05-07T20:31:45.5171328Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5171731Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5172168Z x = x_sign * x_clamp 2025-05-07T20:31:45.5172490Z x0 = x[:, :D] 2025-05-07T20:31:45.5172793Z x1 = x[:, D:] 2025-05-07T20:31:45.5173078Z 2025-05-07T20:31:45.5173328Z if contiguous: 2025-05-07T20:31:45.5173658Z x0 = x0.contiguous() 2025-05-07T20:31:45.5174029Z x1 = x1.contiguous() 2025-05-07T20:31:45.5174368Z 2025-05-07T20:31:45.5174638Z if scale_ub is not None: 2025-05-07T20:31:45.5175011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5175468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5175889Z ) 2025-05-07T20:31:45.5176158Z else: 2025-05-07T20:31:45.5176462Z scale_ub_tensor = None 2025-05-07T20:31:45.5176809Z 2025-05-07T20:31:45.5177126Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5177562Z op = silu_mul_quant 2025-05-07T20:31:45.5177903Z if compiled: 2025-05-07T20:31:45.5178253Z op = torch.compile(op) 2025-05-07T20:31:45.5178664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5179039Z 2025-05-07T20:31:45.5179309Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5179796Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5180202Z 2025-05-07T20:31:45.5180542Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5181007Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5181399Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5181818Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5182296Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5182714Z 2025-05-07T20:31:45.5183000Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5183283Z 2025-05-07T20:31:45.5183429Z moe/activation_test.py:126: 2025-05-07T20:31:45.5183858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5184329Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5184793Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5185872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5186867Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5187583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5188501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5189417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5190405Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5191408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5192395Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5193373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5194212Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5195015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5195705Z fn() 2025-05-07T20:31:45.5196372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5197240Z self.fn.run( 2025-05-07T20:31:45.5197846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5198553Z kernel = self.compile( 2025-05-07T20:31:45.5199283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5200149Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5200693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5200996Z 2025-05-07T20:31:45.5201274Z self = 2025-05-07T20:31:45.5202731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5204559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f38cbce0>} 2025-05-07T20:31:45.5206817Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5208555Z context = 2025-05-07T20:31:45.5208972Z 2025-05-07T20:31:45.5209214Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5209898Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5210534Z module_map=module_map) 2025-05-07T20:31:45.5211069Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5211542Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5211928Z E ^ 2025-05-07T20:31:45.5212545Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5213146Z 2025-05-07T20:31:45.5213713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5214402Z 2025-05-07T20:31:45.5214539Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5215107Z self=, 2025-05-07T20:31:45.5215649Z T=16384, 2025-05-07T20:31:45.5215900Z D=7168, 2025-05-07T20:31:45.5216165Z scale_ub=1200.0, 2025-05-07T20:31:45.5216475Z contiguous=False, 2025-05-07T20:31:45.5216781Z compiled=False, 2025-05-07T20:31:45.5217067Z ) 2025-05-07T20:31:45.5217509Z self = 2025-05-07T20:31:45.5218205Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.5218575Z 2025-05-07T20:31:45.5218683Z @given( 2025-05-07T20:31:45.5218992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5219402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5219817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5220252Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5220708Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5221092Z ) 2025-05-07T20:31:45.5221604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5222213Z def test_silu_mul_quant( 2025-05-07T20:31:45.5222925Z self, 2025-05-07T20:31:45.5223229Z T: int, 2025-05-07T20:31:45.5223505Z D: int, 2025-05-07T20:31:45.5223808Z scale_ub: Optional[float], 2025-05-07T20:31:45.5224181Z contiguous: bool, 2025-05-07T20:31:45.5224718Z compiled: bool, 2025-05-07T20:31:45.5225228Z ) -> None: 2025-05-07T20:31:45.5225525Z torch.manual_seed(2025) 2025-05-07T20:31:45.5225867Z 2025-05-07T20:31:45.5226236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5226671Z 2025-05-07T20:31:45.5226936Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5227320Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5227742Z x = x_sign * x_clamp 2025-05-07T20:31:45.5228038Z x0 = x[:, :D] 2025-05-07T20:31:45.5228344Z x1 = x[:, D:] 2025-05-07T20:31:45.5228627Z 2025-05-07T20:31:45.5228892Z if contiguous: 2025-05-07T20:31:45.5229208Z x0 = x0.contiguous() 2025-05-07T20:31:45.5229563Z x1 = x1.contiguous() 2025-05-07T20:31:45.5229909Z 2025-05-07T20:31:45.5230181Z if scale_ub is not None: 2025-05-07T20:31:45.5242901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5243402Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5243845Z ) 2025-05-07T20:31:45.5244120Z else: 2025-05-07T20:31:45.5244421Z scale_ub_tensor = None 2025-05-07T20:31:45.5244784Z 2025-05-07T20:31:45.5245112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5245557Z op = silu_mul_quant 2025-05-07T20:31:45.5245913Z if compiled: 2025-05-07T20:31:45.5246272Z op = torch.compile(op) 2025-05-07T20:31:45.5246689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5247219Z 2025-05-07T20:31:45.5247629Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5247879Z 2025-05-07T20:31:45.5248022Z moe/activation_test.py:117: 2025-05-07T20:31:45.5248451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5248925Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5249329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5250311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5251307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5252072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5253067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5254023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5254764Z kernel = self.compile( 2025-05-07T20:31:45.5255546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5256478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5257054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5257381Z 2025-05-07T20:31:45.5257674Z self = 2025-05-07T20:31:45.5259201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5261168Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f3e0c720>} 2025-05-07T20:31:45.5263080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5264511Z context = 2025-05-07T20:31:45.5264921Z 2025-05-07T20:31:45.5265158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5266036Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5266679Z module_map=module_map) 2025-05-07T20:31:45.5267193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5267698Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5268071Z E ^ 2025-05-07T20:31:45.5268745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5269383Z 2025-05-07T20:31:45.5269936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5270633Z 2025-05-07T20:31:45.5270794Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5271392Z self=, 2025-05-07T20:31:45.5271927Z T=1, 2025-05-07T20:31:45.5272182Z D=7168, 2025-05-07T20:31:45.5272457Z scale_ub=None, 2025-05-07T20:31:45.5272739Z contiguous=True, 2025-05-07T20:31:45.5273037Z compiled=True, 2025-05-07T20:31:45.5273317Z ) 2025-05-07T20:31:45.5273742Z self = 2025-05-07T20:31:45.5274388Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.5274741Z 2025-05-07T20:31:45.5274855Z @given( 2025-05-07T20:31:45.5275175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5275714Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5276151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5276613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5277065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5277471Z ) 2025-05-07T20:31:45.5277955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5278408Z def test_silu_mul_quant( 2025-05-07T20:31:45.5278664Z self, 2025-05-07T20:31:45.5278872Z T: int, 2025-05-07T20:31:45.5279086Z D: int, 2025-05-07T20:31:45.5279311Z scale_ub: Optional[float], 2025-05-07T20:31:45.5279594Z contiguous: bool, 2025-05-07T20:31:45.5279844Z compiled: bool, 2025-05-07T20:31:45.5280069Z ) -> None: 2025-05-07T20:31:45.5280298Z torch.manual_seed(2025) 2025-05-07T20:31:45.5280553Z 2025-05-07T20:31:45.5280837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5281189Z 2025-05-07T20:31:45.5281394Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5281689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5282010Z x = x_sign * x_clamp 2025-05-07T20:31:45.5282263Z x0 = x[:, :D] 2025-05-07T20:31:45.5282483Z x1 = x[:, D:] 2025-05-07T20:31:45.5282707Z 2025-05-07T20:31:45.5282903Z if contiguous: 2025-05-07T20:31:45.5283143Z x0 = x0.contiguous() 2025-05-07T20:31:45.5283419Z x1 = x1.contiguous() 2025-05-07T20:31:45.5283673Z 2025-05-07T20:31:45.5283871Z if scale_ub is not None: 2025-05-07T20:31:45.5284157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5284504Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5284820Z ) 2025-05-07T20:31:45.5285016Z else: 2025-05-07T20:31:45.5285238Z scale_ub_tensor = None 2025-05-07T20:31:45.5285500Z 2025-05-07T20:31:45.5285739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5286065Z op = silu_mul_quant 2025-05-07T20:31:45.5286322Z if compiled: 2025-05-07T20:31:45.5286573Z op = torch.compile(op) 2025-05-07T20:31:45.5286878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5287163Z 2025-05-07T20:31:45.5287354Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5287769Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5288164Z 2025-05-07T20:31:45.5288410Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5288742Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5289038Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5289354Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5289714Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5290028Z 2025-05-07T20:31:45.5290243Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5290437Z 2025-05-07T20:31:45.5290540Z moe/activation_test.py:126: 2025-05-07T20:31:45.5290847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5291182Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5291512Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5292295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5293055Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5293603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5294283Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5295047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5295771Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5296524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5297264Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5297991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5298637Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5299242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5299756Z fn() 2025-05-07T20:31:45.5300267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5300877Z self.fn.run( 2025-05-07T20:31:45.5301367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5301902Z kernel = self.compile( 2025-05-07T20:31:45.5302451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5303107Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5303499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5303738Z 2025-05-07T20:31:45.5303949Z self = 2025-05-07T20:31:45.5305029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5306833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f3e0e2a0>} 2025-05-07T20:31:45.5308173Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5309198Z context = 2025-05-07T20:31:45.5309738Z 2025-05-07T20:31:45.5309927Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5310548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5311094Z module_map=module_map) 2025-05-07T20:31:45.5311507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5311915Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5312211Z E ^ 2025-05-07T20:31:45.5312762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5313322Z 2025-05-07T20:31:45.5313834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5314467Z 2025-05-07T20:31:45.5314582Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5315059Z self=, 2025-05-07T20:31:45.5315530Z T=4096, 2025-05-07T20:31:45.5315730Z D=5120, 2025-05-07T20:31:45.5315927Z scale_ub=None, 2025-05-07T20:31:45.5316143Z contiguous=False, 2025-05-07T20:31:45.5316376Z compiled=False, 2025-05-07T20:31:45.5316584Z ) 2025-05-07T20:31:45.5316902Z self = 2025-05-07T20:31:45.5317398Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.5317669Z 2025-05-07T20:31:45.5317876Z @given( 2025-05-07T20:31:45.5318105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5318425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5318733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5319064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5319392Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5319683Z ) 2025-05-07T20:31:45.5320040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5320485Z def test_silu_mul_quant( 2025-05-07T20:31:45.5320734Z self, 2025-05-07T20:31:45.5320933Z T: int, 2025-05-07T20:31:45.5321129Z D: int, 2025-05-07T20:31:45.5321359Z scale_ub: Optional[float], 2025-05-07T20:31:45.5321631Z contiguous: bool, 2025-05-07T20:31:45.5321872Z compiled: bool, 2025-05-07T20:31:45.5322105Z ) -> None: 2025-05-07T20:31:45.5322330Z torch.manual_seed(2025) 2025-05-07T20:31:45.5322574Z 2025-05-07T20:31:45.5322852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5323195Z 2025-05-07T20:31:45.5323390Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5323685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5323998Z x = x_sign * x_clamp 2025-05-07T20:31:45.5324249Z x0 = x[:, :D] 2025-05-07T20:31:45.5324473Z x1 = x[:, D:] 2025-05-07T20:31:45.5324695Z 2025-05-07T20:31:45.5324891Z if contiguous: 2025-05-07T20:31:45.5325354Z x0 = x0.contiguous() 2025-05-07T20:31:45.5325620Z x1 = x1.contiguous() 2025-05-07T20:31:45.5325867Z 2025-05-07T20:31:45.5326058Z if scale_ub is not None: 2025-05-07T20:31:45.5326338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5326683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5327000Z ) 2025-05-07T20:31:45.5327200Z else: 2025-05-07T20:31:45.5327421Z scale_ub_tensor = None 2025-05-07T20:31:45.5327754Z 2025-05-07T20:31:45.5327986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5328306Z op = silu_mul_quant 2025-05-07T20:31:45.5328561Z if compiled: 2025-05-07T20:31:45.5328808Z op = torch.compile(op) 2025-05-07T20:31:45.5329107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5329562Z 2025-05-07T20:31:45.5329757Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5329930Z 2025-05-07T20:31:45.5330030Z moe/activation_test.py:117: 2025-05-07T20:31:45.5330327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5330672Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5330956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5331652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5332343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5332881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5333569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5334231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5334773Z kernel = self.compile( 2025-05-07T20:31:45.5335313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5335975Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5336373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5336601Z 2025-05-07T20:31:45.5336809Z self = 2025-05-07T20:31:45.5338002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5339371Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f2b454e0>} 2025-05-07T20:31:45.5340718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5341794Z context = 2025-05-07T20:31:45.5342083Z 2025-05-07T20:31:45.5342254Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5342783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5343250Z module_map=module_map) 2025-05-07T20:31:45.5343617Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5343971Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5344239Z E ^ 2025-05-07T20:31:45.5344708Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5345163Z 2025-05-07T20:31:45.5345580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5346097Z 2025-05-07T20:31:45.5346209Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5346625Z self=, 2025-05-07T20:31:45.5347030Z T=4096, 2025-05-07T20:31:45.5347221Z D=7168, 2025-05-07T20:31:45.5347422Z scale_ub=None, 2025-05-07T20:31:45.5347646Z contiguous=False, 2025-05-07T20:31:45.5347872Z compiled=False, 2025-05-07T20:31:45.5348085Z ) 2025-05-07T20:31:45.5348409Z self = 2025-05-07T20:31:45.5348903Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.5349181Z 2025-05-07T20:31:45.5349264Z @given( 2025-05-07T20:31:45.5349499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5349906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5350213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5350550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5350911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5351219Z ) 2025-05-07T20:31:45.5351573Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5352018Z def test_silu_mul_quant( 2025-05-07T20:31:45.5352263Z self, 2025-05-07T20:31:45.5352475Z T: int, 2025-05-07T20:31:45.5352682Z D: int, 2025-05-07T20:31:45.5352899Z scale_ub: Optional[float], 2025-05-07T20:31:45.5353173Z contiguous: bool, 2025-05-07T20:31:45.5353421Z compiled: bool, 2025-05-07T20:31:45.5353643Z ) -> None: 2025-05-07T20:31:45.5353863Z torch.manual_seed(2025) 2025-05-07T20:31:45.5354119Z 2025-05-07T20:31:45.5354394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5354747Z 2025-05-07T20:31:45.5354944Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5355241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5355547Z x = x_sign * x_clamp 2025-05-07T20:31:45.5355791Z x0 = x[:, :D] 2025-05-07T20:31:45.5356010Z x1 = x[:, D:] 2025-05-07T20:31:45.5356216Z 2025-05-07T20:31:45.5356408Z if contiguous: 2025-05-07T20:31:45.5356642Z x0 = x0.contiguous() 2025-05-07T20:31:45.5356977Z x1 = x1.contiguous() 2025-05-07T20:31:45.5357222Z 2025-05-07T20:31:45.5357421Z if scale_ub is not None: 2025-05-07T20:31:45.5357688Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5358028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5358339Z ) 2025-05-07T20:31:45.5358533Z else: 2025-05-07T20:31:45.5358749Z scale_ub_tensor = None 2025-05-07T20:31:45.5359006Z 2025-05-07T20:31:45.5359239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5359559Z op = silu_mul_quant 2025-05-07T20:31:45.5359816Z if compiled: 2025-05-07T20:31:45.5360068Z op = torch.compile(op) 2025-05-07T20:31:45.5360362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5360640Z 2025-05-07T20:31:45.5360837Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5361002Z 2025-05-07T20:31:45.5361101Z moe/activation_test.py:117: 2025-05-07T20:31:45.5361410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5361748Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5362029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5362720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5363416Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5363964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5364648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5365312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5365853Z kernel = self.compile( 2025-05-07T20:31:45.5366389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5367055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5367454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5367743Z 2025-05-07T20:31:45.5367957Z self = 2025-05-07T20:31:45.5369030Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5370486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f2b45580>} 2025-05-07T20:31:45.5371833Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5372857Z context = 2025-05-07T20:31:45.5373144Z 2025-05-07T20:31:45.5373317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5373838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5374304Z module_map=module_map) 2025-05-07T20:31:45.5374675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5375024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5375289Z E ^ 2025-05-07T20:31:45.5375760Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5376209Z 2025-05-07T20:31:45.5376630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5377138Z 2025-05-07T20:31:45.5377326Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5377744Z self=, 2025-05-07T20:31:45.5378152Z T=128, 2025-05-07T20:31:45.5378342Z D=7168, 2025-05-07T20:31:45.5378543Z scale_ub=None, 2025-05-07T20:31:45.5378764Z contiguous=False, 2025-05-07T20:31:45.5378985Z compiled=True, 2025-05-07T20:31:45.5379199Z ) 2025-05-07T20:31:45.5379523Z self = 2025-05-07T20:31:45.5380024Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5380295Z 2025-05-07T20:31:45.5380377Z @given( 2025-05-07T20:31:45.5380613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5380925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5381229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5381566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5381902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5382187Z ) 2025-05-07T20:31:45.5382542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5382988Z def test_silu_mul_quant( 2025-05-07T20:31:45.5383232Z self, 2025-05-07T20:31:45.5383427Z T: int, 2025-05-07T20:31:45.5383635Z D: int, 2025-05-07T20:31:45.5383857Z scale_ub: Optional[float], 2025-05-07T20:31:45.5384132Z contiguous: bool, 2025-05-07T20:31:45.5384382Z compiled: bool, 2025-05-07T20:31:45.5384614Z ) -> None: 2025-05-07T20:31:45.5384830Z torch.manual_seed(2025) 2025-05-07T20:31:45.5385077Z 2025-05-07T20:31:45.5385358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5385696Z 2025-05-07T20:31:45.5385903Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5386201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5386513Z x = x_sign * x_clamp 2025-05-07T20:31:45.5386759Z x0 = x[:, :D] 2025-05-07T20:31:45.5386983Z x1 = x[:, D:] 2025-05-07T20:31:45.5387191Z 2025-05-07T20:31:45.5387386Z if contiguous: 2025-05-07T20:31:45.5387621Z x0 = x0.contiguous() 2025-05-07T20:31:45.5387877Z x1 = x1.contiguous() 2025-05-07T20:31:45.5388121Z 2025-05-07T20:31:45.5388321Z if scale_ub is not None: 2025-05-07T20:31:45.5388690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5389021Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5389332Z ) 2025-05-07T20:31:45.5389531Z else: 2025-05-07T20:31:45.5389742Z scale_ub_tensor = None 2025-05-07T20:31:45.5389998Z 2025-05-07T20:31:45.5390231Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5390542Z op = silu_mul_quant 2025-05-07T20:31:45.5390806Z if compiled: 2025-05-07T20:31:45.5391111Z op = torch.compile(op) 2025-05-07T20:31:45.5391404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5391683Z 2025-05-07T20:31:45.5391879Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5392162Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5392458Z 2025-05-07T20:31:45.5392700Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5393042Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5393338Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5393657Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5394016Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5394321Z 2025-05-07T20:31:45.5394529Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5394723Z 2025-05-07T20:31:45.5394829Z moe/activation_test.py:126: 2025-05-07T20:31:45.5395243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5395584Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5395913Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5396699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5397447Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5397992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5398676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5399363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5400094Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5400854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5401657Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5402383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5403027Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5403630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5404162Z fn() 2025-05-07T20:31:45.5411790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5412449Z self.fn.run( 2025-05-07T20:31:45.5412934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5413487Z kernel = self.compile( 2025-05-07T20:31:45.5414055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5414728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5415133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5415374Z 2025-05-07T20:31:45.5415588Z self = 2025-05-07T20:31:45.5416862Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5418250Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51f2b46c00>} 2025-05-07T20:31:45.5419606Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5420648Z context = 2025-05-07T20:31:45.5420982Z 2025-05-07T20:31:45.5421169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5421704Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5422181Z module_map=module_map) 2025-05-07T20:31:45.5422564Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5422942Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5423216Z E ^ 2025-05-07T20:31:45.5423696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5424160Z 2025-05-07T20:31:45.5424725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5425244Z 2025-05-07T20:31:45.5425359Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5425775Z self=, 2025-05-07T20:31:45.5426191Z T=128, 2025-05-07T20:31:45.5426398Z D=7168, 2025-05-07T20:31:45.5426598Z scale_ub=None, 2025-05-07T20:31:45.5426831Z contiguous=False, 2025-05-07T20:31:45.5427076Z compiled=False, 2025-05-07T20:31:45.5427295Z ) 2025-05-07T20:31:45.5427620Z self = 2025-05-07T20:31:45.5428119Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.5428392Z 2025-05-07T20:31:45.5428483Z @given( 2025-05-07T20:31:45.5428722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5429045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5429372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5429705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5430041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5430335Z ) 2025-05-07T20:31:45.5430693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5431178Z def test_silu_mul_quant( 2025-05-07T20:31:45.5431442Z self, 2025-05-07T20:31:45.5431648Z T: int, 2025-05-07T20:31:45.5431857Z D: int, 2025-05-07T20:31:45.5432085Z scale_ub: Optional[float], 2025-05-07T20:31:45.5432367Z contiguous: bool, 2025-05-07T20:31:45.5432615Z compiled: bool, 2025-05-07T20:31:45.5432851Z ) -> None: 2025-05-07T20:31:45.5433082Z torch.manual_seed(2025) 2025-05-07T20:31:45.5433331Z 2025-05-07T20:31:45.5433619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5433967Z 2025-05-07T20:31:45.5434173Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5434475Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5434799Z x = x_sign * x_clamp 2025-05-07T20:31:45.5435048Z x0 = x[:, :D] 2025-05-07T20:31:45.5435280Z x1 = x[:, D:] 2025-05-07T20:31:45.5435502Z 2025-05-07T20:31:45.5435695Z if contiguous: 2025-05-07T20:31:45.5435935Z x0 = x0.contiguous() 2025-05-07T20:31:45.5436206Z x1 = x1.contiguous() 2025-05-07T20:31:45.5436546Z 2025-05-07T20:31:45.5436743Z if scale_ub is not None: 2025-05-07T20:31:45.5437029Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5437374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5437455Z ) 2025-05-07T20:31:45.5437536Z else: 2025-05-07T20:31:45.5437646Z scale_ub_tensor = None 2025-05-07T20:31:45.5437724Z 2025-05-07T20:31:45.5437863Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5437976Z op = silu_mul_quant 2025-05-07T20:31:45.5438067Z if compiled: 2025-05-07T20:31:45.5438178Z op = torch.compile(op) 2025-05-07T20:31:45.5438295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5438374Z 2025-05-07T20:31:45.5438469Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5438481Z 2025-05-07T20:31:45.5438584Z moe/activation_test.py:117: 2025-05-07T20:31:45.5438719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5438839Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5438944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5439448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5439555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5439998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5440236Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5440579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5440680Z kernel = self.compile( 2025-05-07T20:31:45.5441072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5441256Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5441389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5441394Z 2025-05-07T20:31:45.5441611Z self = 2025-05-07T20:31:45.5442398Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5442909Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51ca943100>} 2025-05-07T20:31:45.5443658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5443864Z context = 2025-05-07T20:31:45.5443869Z 2025-05-07T20:31:45.5444036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5444305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5444422Z module_map=module_map) 2025-05-07T20:31:45.5444590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5444698Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5444990Z E ^ 2025-05-07T20:31:45.5445350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5445355Z 2025-05-07T20:31:45.5445777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5445782Z 2025-05-07T20:31:45.5445977Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5446205Z self=, 2025-05-07T20:31:45.5446292Z T=4096, 2025-05-07T20:31:45.5446375Z D=5120, 2025-05-07T20:31:45.5446465Z scale_ub=1200.0, 2025-05-07T20:31:45.5446566Z contiguous=True, 2025-05-07T20:31:45.5446655Z compiled=False, 2025-05-07T20:31:45.5446744Z ) 2025-05-07T20:31:45.5446963Z self = 2025-05-07T20:31:45.5447151Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.5447156Z 2025-05-07T20:31:45.5447244Z @given( 2025-05-07T20:31:45.5447371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5447475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5447656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5447778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5447906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5447987Z ) 2025-05-07T20:31:45.5448236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5448339Z def test_silu_mul_quant( 2025-05-07T20:31:45.5448426Z self, 2025-05-07T20:31:45.5448509Z T: int, 2025-05-07T20:31:45.5448591Z D: int, 2025-05-07T20:31:45.5448699Z scale_ub: Optional[float], 2025-05-07T20:31:45.5448793Z contiguous: bool, 2025-05-07T20:31:45.5448964Z compiled: bool, 2025-05-07T20:31:45.5449054Z ) -> None: 2025-05-07T20:31:45.5449156Z torch.manual_seed(2025) 2025-05-07T20:31:45.5449234Z 2025-05-07T20:31:45.5449412Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5449491Z 2025-05-07T20:31:45.5449593Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5449722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5449820Z x = x_sign * x_clamp 2025-05-07T20:31:45.5449911Z x0 = x[:, :D] 2025-05-07T20:31:45.5449996Z x1 = x[:, D:] 2025-05-07T20:31:45.5450074Z 2025-05-07T20:31:45.5450168Z if contiguous: 2025-05-07T20:31:45.5450264Z x0 = x0.contiguous() 2025-05-07T20:31:45.5450357Z x1 = x1.contiguous() 2025-05-07T20:31:45.5450444Z 2025-05-07T20:31:45.5450539Z if scale_ub is not None: 2025-05-07T20:31:45.5450648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5450796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5450876Z ) 2025-05-07T20:31:45.5450960Z else: 2025-05-07T20:31:45.5451060Z scale_ub_tensor = None 2025-05-07T20:31:45.5451138Z 2025-05-07T20:31:45.5451279Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5451375Z op = silu_mul_quant 2025-05-07T20:31:45.5451464Z if compiled: 2025-05-07T20:31:45.5451572Z op = torch.compile(op) 2025-05-07T20:31:45.5451687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5451765Z 2025-05-07T20:31:45.5451866Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5451871Z 2025-05-07T20:31:45.5451972Z moe/activation_test.py:117: 2025-05-07T20:31:45.5452107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5452213Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5452315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5452829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5452930Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5453292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5453521Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5453949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5454052Z kernel = self.compile( 2025-05-07T20:31:45.5454435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5454613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5454750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5454760Z 2025-05-07T20:31:45.5454968Z self = 2025-05-07T20:31:45.5455755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5456260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51cb66b560>} 2025-05-07T20:31:45.5457014Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5457216Z context = 2025-05-07T20:31:45.5457221Z 2025-05-07T20:31:45.5457464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5457736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5457847Z module_map=module_map) 2025-05-07T20:31:45.5458013Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5458122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5458205Z E ^ 2025-05-07T20:31:45.5458561Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5458576Z 2025-05-07T20:31:45.5458994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5458998Z 2025-05-07T20:31:45.5459106Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5459336Z self=, 2025-05-07T20:31:45.5459417Z T=1, 2025-05-07T20:31:45.5459503Z D=5120, 2025-05-07T20:31:45.5459599Z scale_ub=None, 2025-05-07T20:31:45.5459689Z contiguous=True, 2025-05-07T20:31:45.5459780Z compiled=True, 2025-05-07T20:31:45.5459863Z ) 2025-05-07T20:31:45.5460083Z self = 2025-05-07T20:31:45.5460258Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.5460263Z 2025-05-07T20:31:45.5460344Z @given( 2025-05-07T20:31:45.5460476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5460584Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5460703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5460825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5460947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5461026Z ) 2025-05-07T20:31:45.5461283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5461387Z def test_silu_mul_quant( 2025-05-07T20:31:45.5461468Z self, 2025-05-07T20:31:45.5461556Z T: int, 2025-05-07T20:31:45.5461636Z D: int, 2025-05-07T20:31:45.5461737Z scale_ub: Optional[float], 2025-05-07T20:31:45.5461834Z contiguous: bool, 2025-05-07T20:31:45.5461923Z compiled: bool, 2025-05-07T20:31:45.5462005Z ) -> None: 2025-05-07T20:31:45.5462108Z torch.manual_seed(2025) 2025-05-07T20:31:45.5462296Z 2025-05-07T20:31:45.5462468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5462550Z 2025-05-07T20:31:45.5462645Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5462771Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5462866Z x = x_sign * x_clamp 2025-05-07T20:31:45.5462950Z x0 = x[:, :D] 2025-05-07T20:31:45.5463041Z x1 = x[:, D:] 2025-05-07T20:31:45.5463117Z 2025-05-07T20:31:45.5463209Z if contiguous: 2025-05-07T20:31:45.5463314Z x0 = x0.contiguous() 2025-05-07T20:31:45.5463407Z x1 = x1.contiguous() 2025-05-07T20:31:45.5463482Z 2025-05-07T20:31:45.5463581Z if scale_ub is not None: 2025-05-07T20:31:45.5463691Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5463829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5463915Z ) 2025-05-07T20:31:45.5463995Z else: 2025-05-07T20:31:45.5464098Z scale_ub_tensor = None 2025-05-07T20:31:45.5464179Z 2025-05-07T20:31:45.5464313Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5464405Z op = silu_mul_quant 2025-05-07T20:31:45.5464505Z if compiled: 2025-05-07T20:31:45.5464608Z op = torch.compile(op) 2025-05-07T20:31:45.5464724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5464800Z 2025-05-07T20:31:45.5464893Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5465103Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5465180Z 2025-05-07T20:31:45.5465319Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5465430Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5465534Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5465660Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5465808Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5465890Z 2025-05-07T20:31:45.5465999Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5466003Z 2025-05-07T20:31:45.5466106Z moe/activation_test.py:126: 2025-05-07T20:31:45.5466238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5466353Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5466490Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5467061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5467169Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5467529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5467757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5468129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5468395Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5468799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5469054Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5469440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5469607Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5469952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5470039Z fn() 2025-05-07T20:31:45.5470439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5470607Z self.fn.run( 2025-05-07T20:31:45.5470953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5471053Z kernel = self.compile( 2025-05-07T20:31:45.5471438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5471625Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5471786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5471792Z 2025-05-07T20:31:45.5472027Z self = 2025-05-07T20:31:45.5472805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5473322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51cbd077e0>} 2025-05-07T20:31:45.5474069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5474339Z context = 2025-05-07T20:31:45.5474350Z 2025-05-07T20:31:45.5474518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5474786Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5474903Z module_map=module_map) 2025-05-07T20:31:45.5475068Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5475174Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5475266Z E ^ 2025-05-07T20:31:45.5475624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5475629Z 2025-05-07T20:31:45.5476050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5476054Z 2025-05-07T20:31:45.5476164Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5476400Z self=, 2025-05-07T20:31:45.5476488Z T=2048, 2025-05-07T20:31:45.5476568Z D=5120, 2025-05-07T20:31:45.5476655Z scale_ub=None, 2025-05-07T20:31:45.5476752Z contiguous=True, 2025-05-07T20:31:45.5476843Z compiled=True, 2025-05-07T20:31:45.5476923Z ) 2025-05-07T20:31:45.5477148Z self = 2025-05-07T20:31:45.5477322Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.5477331Z 2025-05-07T20:31:45.5477418Z @given( 2025-05-07T20:31:45.5477541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5477644Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5477769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5477893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5478015Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5478100Z ) 2025-05-07T20:31:45.5478348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5478451Z def test_silu_mul_quant( 2025-05-07T20:31:45.5478531Z self, 2025-05-07T20:31:45.5478611Z T: int, 2025-05-07T20:31:45.5478695Z D: int, 2025-05-07T20:31:45.5478797Z scale_ub: Optional[float], 2025-05-07T20:31:45.5478890Z contiguous: bool, 2025-05-07T20:31:45.5478989Z compiled: bool, 2025-05-07T20:31:45.5479154Z ) -> None: 2025-05-07T20:31:45.5479261Z torch.manual_seed(2025) 2025-05-07T20:31:45.5479337Z 2025-05-07T20:31:45.5479509Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5479590Z 2025-05-07T20:31:45.5479685Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5479817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5479915Z x = x_sign * x_clamp 2025-05-07T20:31:45.5479998Z x0 = x[:, :D] 2025-05-07T20:31:45.5480087Z x1 = x[:, D:] 2025-05-07T20:31:45.5480165Z 2025-05-07T20:31:45.5480253Z if contiguous: 2025-05-07T20:31:45.5480348Z x0 = x0.contiguous() 2025-05-07T20:31:45.5480445Z x1 = x1.contiguous() 2025-05-07T20:31:45.5480522Z 2025-05-07T20:31:45.5480615Z if scale_ub is not None: 2025-05-07T20:31:45.5480729Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5480868Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5480957Z ) 2025-05-07T20:31:45.5481035Z else: 2025-05-07T20:31:45.5481135Z scale_ub_tensor = None 2025-05-07T20:31:45.5481215Z 2025-05-07T20:31:45.5481347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5481441Z op = silu_mul_quant 2025-05-07T20:31:45.5481534Z if compiled: 2025-05-07T20:31:45.5481647Z op = torch.compile(op) 2025-05-07T20:31:45.5481862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5481953Z 2025-05-07T20:31:45.5482049Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5482174Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5482255Z 2025-05-07T20:31:45.5482393Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5482502Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5482603Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5482735Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5482887Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5482964Z 2025-05-07T20:31:45.5483065Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5483070Z 2025-05-07T20:31:45.5483176Z moe/activation_test.py:126: 2025-05-07T20:31:45.5483309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5483423Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5483564Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5484125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5484237Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5484598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5484829Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5485199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5485456Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5485859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5486118Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5486494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5486667Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5487009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5487178Z fn() 2025-05-07T20:31:45.5487635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5487723Z self.fn.run( 2025-05-07T20:31:45.5488065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5488164Z kernel = self.compile( 2025-05-07T20:31:45.5488552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5488732Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5488862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5488867Z 2025-05-07T20:31:45.5489078Z self = 2025-05-07T20:31:45.5489855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5490364Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51cbc36c00>} 2025-05-07T20:31:45.5491195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5491391Z context = 2025-05-07T20:31:45.5491396Z 2025-05-07T20:31:45.5491568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5491834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5491944Z module_map=module_map) 2025-05-07T20:31:45.5492118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5492223Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5492312Z E ^ 2025-05-07T20:31:45.5492667Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5492672Z 2025-05-07T20:31:45.5493090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5493100Z 2025-05-07T20:31:45.5493211Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5493435Z self=, 2025-05-07T20:31:45.5493518Z T=128, 2025-05-07T20:31:45.5493599Z D=5120, 2025-05-07T20:31:45.5493689Z scale_ub=None, 2025-05-07T20:31:45.5493782Z contiguous=True, 2025-05-07T20:31:45.5493869Z compiled=True, 2025-05-07T20:31:45.5493947Z ) 2025-05-07T20:31:45.5494173Z self = 2025-05-07T20:31:45.5494349Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.5494354Z 2025-05-07T20:31:45.5494434Z @given( 2025-05-07T20:31:45.5494563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5494667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5494791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5494916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5495033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5495121Z ) 2025-05-07T20:31:45.5495368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5495465Z def test_silu_mul_quant( 2025-05-07T20:31:45.5495549Z self, 2025-05-07T20:31:45.5495629Z T: int, 2025-05-07T20:31:45.5495709Z D: int, 2025-05-07T20:31:45.5495819Z scale_ub: Optional[float], 2025-05-07T20:31:45.5496024Z contiguous: bool, 2025-05-07T20:31:45.5496111Z compiled: bool, 2025-05-07T20:31:45.5496194Z ) -> None: 2025-05-07T20:31:45.5496292Z torch.manual_seed(2025) 2025-05-07T20:31:45.5496372Z 2025-05-07T20:31:45.5496545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5496622Z 2025-05-07T20:31:45.5496720Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5496848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5496944Z x = x_sign * x_clamp 2025-05-07T20:31:45.5497030Z x0 = x[:, :D] 2025-05-07T20:31:45.5497113Z x1 = x[:, D:] 2025-05-07T20:31:45.5497191Z 2025-05-07T20:31:45.5497285Z if contiguous: 2025-05-07T20:31:45.5497379Z x0 = x0.contiguous() 2025-05-07T20:31:45.5497473Z x1 = x1.contiguous() 2025-05-07T20:31:45.5497553Z 2025-05-07T20:31:45.5497646Z if scale_ub is not None: 2025-05-07T20:31:45.5497762Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5497908Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5497986Z ) 2025-05-07T20:31:45.5498069Z else: 2025-05-07T20:31:45.5498166Z scale_ub_tensor = None 2025-05-07T20:31:45.5498243Z 2025-05-07T20:31:45.5498379Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5498473Z op = silu_mul_quant 2025-05-07T20:31:45.5498561Z if compiled: 2025-05-07T20:31:45.5498748Z op = torch.compile(op) 2025-05-07T20:31:45.5498858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5498933Z 2025-05-07T20:31:45.5499032Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5499156Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5499232Z 2025-05-07T20:31:45.5499377Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5499482Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5499596Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5499723Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5499866Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5499949Z 2025-05-07T20:31:45.5500057Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5500061Z 2025-05-07T20:31:45.5500161Z moe/activation_test.py:126: 2025-05-07T20:31:45.5500304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5500412Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5500554Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5501115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5501219Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5501583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5501814Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5502181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5502442Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5502845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5503104Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5503477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5503646Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5503992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5504156Z fn() 2025-05-07T20:31:45.5504559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5504646Z self.fn.run( 2025-05-07T20:31:45.5504988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5505091Z kernel = self.compile( 2025-05-07T20:31:45.5505478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5505847Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5506042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5506050Z 2025-05-07T20:31:45.5506320Z self = 2025-05-07T20:31:45.5507385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5507938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51cbb0aac0>} 2025-05-07T20:31:45.5508828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5509028Z context = 2025-05-07T20:31:45.5509032Z 2025-05-07T20:31:45.5509200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5509472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5509589Z module_map=module_map) 2025-05-07T20:31:45.5509753Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5509866Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5509947Z E ^ 2025-05-07T20:31:45.5510309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5510314Z 2025-05-07T20:31:45.5510737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5510741Z 2025-05-07T20:31:45.5510849Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5511078Z self=, 2025-05-07T20:31:45.5511160Z T=4096, 2025-05-07T20:31:45.5511248Z D=5120, 2025-05-07T20:31:45.5511336Z scale_ub=None, 2025-05-07T20:31:45.5511424Z contiguous=True, 2025-05-07T20:31:45.5511521Z compiled=True, 2025-05-07T20:31:45.5511600Z ) 2025-05-07T20:31:45.5511821Z self = 2025-05-07T20:31:45.5511999Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.5512004Z 2025-05-07T20:31:45.5512088Z @given( 2025-05-07T20:31:45.5512211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5512315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5512436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5512560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5512676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5512752Z ) 2025-05-07T20:31:45.5513003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5513098Z def test_silu_mul_quant( 2025-05-07T20:31:45.5513177Z self, 2025-05-07T20:31:45.5513386Z T: int, 2025-05-07T20:31:45.5513469Z D: int, 2025-05-07T20:31:45.5513568Z scale_ub: Optional[float], 2025-05-07T20:31:45.5513664Z contiguous: bool, 2025-05-07T20:31:45.5513751Z compiled: bool, 2025-05-07T20:31:45.5513833Z ) -> None: 2025-05-07T20:31:45.5513934Z torch.manual_seed(2025) 2025-05-07T20:31:45.5514013Z 2025-05-07T20:31:45.5514186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5514261Z 2025-05-07T20:31:45.5514362Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5514496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5514586Z x = x_sign * x_clamp 2025-05-07T20:31:45.5514666Z x0 = x[:, :D] 2025-05-07T20:31:45.5514751Z x1 = x[:, D:] 2025-05-07T20:31:45.5514824Z 2025-05-07T20:31:45.5514910Z if contiguous: 2025-05-07T20:31:45.5515008Z x0 = x0.contiguous() 2025-05-07T20:31:45.5515098Z x1 = x1.contiguous() 2025-05-07T20:31:45.5515181Z 2025-05-07T20:31:45.5515277Z if scale_ub is not None: 2025-05-07T20:31:45.5515384Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5515520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5515603Z ) 2025-05-07T20:31:45.5515685Z else: 2025-05-07T20:31:45.5515787Z scale_ub_tensor = None 2025-05-07T20:31:45.5515860Z 2025-05-07T20:31:45.5515991Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5516168Z op = silu_mul_quant 2025-05-07T20:31:45.5516256Z if compiled: 2025-05-07T20:31:45.5516356Z op = torch.compile(op) 2025-05-07T20:31:45.5516466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5516543Z 2025-05-07T20:31:45.5516632Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5516755Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5516831Z 2025-05-07T20:31:45.5516974Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5517081Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5517181Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5517308Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5517446Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5517523Z 2025-05-07T20:31:45.5517626Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5517630Z 2025-05-07T20:31:45.5517733Z moe/activation_test.py:126: 2025-05-07T20:31:45.5517862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5517969Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5518102Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5518667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5518774Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5519132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5519358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5519724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5519985Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5520391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5520646Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5521022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5521274Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5521616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5521701Z fn() 2025-05-07T20:31:45.5522097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5522183Z self.fn.run( 2025-05-07T20:31:45.5522523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5522617Z kernel = self.compile( 2025-05-07T20:31:45.5522998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5523173Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5523303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5523312Z 2025-05-07T20:31:45.5523525Z self = 2025-05-07T20:31:45.5524299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5524973Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9bea5c0>} 2025-05-07T20:31:45.5525722Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5525916Z context = 2025-05-07T20:31:45.5525921Z 2025-05-07T20:31:45.5526086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5526357Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5526470Z module_map=module_map) 2025-05-07T20:31:45.5526632Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5526735Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5526821Z E ^ 2025-05-07T20:31:45.5527184Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5527188Z 2025-05-07T20:31:45.5527728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5527734Z 2025-05-07T20:31:45.5527841Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5528062Z self=, 2025-05-07T20:31:45.5528148Z T=16384, 2025-05-07T20:31:45.5528234Z D=5120, 2025-05-07T20:31:45.5528323Z scale_ub=None, 2025-05-07T20:31:45.5528411Z contiguous=True, 2025-05-07T20:31:45.5528497Z compiled=True, 2025-05-07T20:31:45.5528578Z ) 2025-05-07T20:31:45.5528797Z self = 2025-05-07T20:31:45.5528973Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.5528978Z 2025-05-07T20:31:45.5529059Z @given( 2025-05-07T20:31:45.5529183Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5529285Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5529407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5529525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5529644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5529721Z ) 2025-05-07T20:31:45.5529966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5530147Z def test_silu_mul_quant( 2025-05-07T20:31:45.5530227Z self, 2025-05-07T20:31:45.5530307Z T: int, 2025-05-07T20:31:45.5530395Z D: int, 2025-05-07T20:31:45.5530495Z scale_ub: Optional[float], 2025-05-07T20:31:45.5530586Z contiguous: bool, 2025-05-07T20:31:45.5530676Z compiled: bool, 2025-05-07T20:31:45.5530756Z ) -> None: 2025-05-07T20:31:45.5530852Z torch.manual_seed(2025) 2025-05-07T20:31:45.5530935Z 2025-05-07T20:31:45.5531113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5531190Z 2025-05-07T20:31:45.5531284Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5531410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5531503Z x = x_sign * x_clamp 2025-05-07T20:31:45.5531585Z x0 = x[:, :D] 2025-05-07T20:31:45.5531668Z x1 = x[:, D:] 2025-05-07T20:31:45.5531749Z 2025-05-07T20:31:45.5531835Z if contiguous: 2025-05-07T20:31:45.5531932Z x0 = x0.contiguous() 2025-05-07T20:31:45.5532029Z x1 = x1.contiguous() 2025-05-07T20:31:45.5532105Z 2025-05-07T20:31:45.5532197Z if scale_ub is not None: 2025-05-07T20:31:45.5532307Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5532442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5532525Z ) 2025-05-07T20:31:45.5532608Z else: 2025-05-07T20:31:45.5532704Z scale_ub_tensor = None 2025-05-07T20:31:45.5532864Z 2025-05-07T20:31:45.5532994Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5533089Z op = silu_mul_quant 2025-05-07T20:31:45.5533177Z if compiled: 2025-05-07T20:31:45.5533280Z op = torch.compile(op) 2025-05-07T20:31:45.5533387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5533467Z 2025-05-07T20:31:45.5533559Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5533684Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5533763Z 2025-05-07T20:31:45.5533899Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5534000Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5534100Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5534222Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5534365Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5534440Z 2025-05-07T20:31:45.5534547Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5534552Z 2025-05-07T20:31:45.5534653Z moe/activation_test.py:126: 2025-05-07T20:31:45.5534783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5534891Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5535028Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5535587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5535696Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5536053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5536274Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5536650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5536906Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5537304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5537557Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5537929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5538184Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5538524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5538604Z fn() 2025-05-07T20:31:45.5539007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5539099Z self.fn.run( 2025-05-07T20:31:45.5539445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5539543Z kernel = self.compile( 2025-05-07T20:31:45.5539922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5540099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5540230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5540235Z 2025-05-07T20:31:45.5540447Z self = 2025-05-07T20:31:45.5541221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5541799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51caa054e0>} 2025-05-07T20:31:45.5542555Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5542749Z context = 2025-05-07T20:31:45.5542758Z 2025-05-07T20:31:45.5542927Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5543191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5543299Z module_map=module_map) 2025-05-07T20:31:45.5543467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5543571Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5543652Z E ^ 2025-05-07T20:31:45.5544015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5544019Z 2025-05-07T20:31:45.5544431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5544436Z 2025-05-07T20:31:45.5544543Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5544764Z self=, 2025-05-07T20:31:45.5544850Z T=1, 2025-05-07T20:31:45.5544937Z D=5120, 2025-05-07T20:31:45.5545021Z scale_ub=1200.0, 2025-05-07T20:31:45.5545110Z contiguous=True, 2025-05-07T20:31:45.5545194Z compiled=True, 2025-05-07T20:31:45.5545272Z ) 2025-05-07T20:31:45.5545502Z self = 2025-05-07T20:31:45.5550916Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.5550924Z 2025-05-07T20:31:45.5551027Z @given( 2025-05-07T20:31:45.5551164Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5551272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5551391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5551522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5551640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5551724Z ) 2025-05-07T20:31:45.5552088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5552187Z def test_silu_mul_quant( 2025-05-07T20:31:45.5552275Z self, 2025-05-07T20:31:45.5552357Z T: int, 2025-05-07T20:31:45.5552437Z D: int, 2025-05-07T20:31:45.5552545Z scale_ub: Optional[float], 2025-05-07T20:31:45.5552643Z contiguous: bool, 2025-05-07T20:31:45.5552733Z compiled: bool, 2025-05-07T20:31:45.5552822Z ) -> None: 2025-05-07T20:31:45.5552926Z torch.manual_seed(2025) 2025-05-07T20:31:45.5553004Z 2025-05-07T20:31:45.5553189Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5553268Z 2025-05-07T20:31:45.5553371Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5553502Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5553596Z x = x_sign * x_clamp 2025-05-07T20:31:45.5553683Z x0 = x[:, :D] 2025-05-07T20:31:45.5553772Z x1 = x[:, D:] 2025-05-07T20:31:45.5553848Z 2025-05-07T20:31:45.5553939Z if contiguous: 2025-05-07T20:31:45.5554034Z x0 = x0.contiguous() 2025-05-07T20:31:45.5554126Z x1 = x1.contiguous() 2025-05-07T20:31:45.5554208Z 2025-05-07T20:31:45.5554304Z if scale_ub is not None: 2025-05-07T20:31:45.5554416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5554560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5554640Z ) 2025-05-07T20:31:45.5554807Z else: 2025-05-07T20:31:45.5554909Z scale_ub_tensor = None 2025-05-07T20:31:45.5554986Z 2025-05-07T20:31:45.5555126Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5555220Z op = silu_mul_quant 2025-05-07T20:31:45.5555310Z if compiled: 2025-05-07T20:31:45.5555417Z op = torch.compile(op) 2025-05-07T20:31:45.5555525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5555607Z 2025-05-07T20:31:45.5555705Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5555709Z 2025-05-07T20:31:45.5555811Z moe/activation_test.py:117: 2025-05-07T20:31:45.5555947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5556055Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5556159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5556549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5556646Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5557151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5557255Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5557619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5557850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5558199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5558298Z kernel = self.compile( 2025-05-07T20:31:45.5558687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5558869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5559003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5559008Z 2025-05-07T20:31:45.5559220Z self = 2025-05-07T20:31:45.5560008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5560602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51ca2f09a0>} 2025-05-07T20:31:45.5561358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5561559Z context = 2025-05-07T20:31:45.5561564Z 2025-05-07T20:31:45.5561732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5562002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5562115Z module_map=module_map) 2025-05-07T20:31:45.5562280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5562382Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5562470Z E ^ 2025-05-07T20:31:45.5562829Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5562834Z 2025-05-07T20:31:45.5563259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5563264Z 2025-05-07T20:31:45.5563370Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5563699Z self=, 2025-05-07T20:31:45.5563784Z T=1, 2025-05-07T20:31:45.5563866Z D=5120, 2025-05-07T20:31:45.5563954Z scale_ub=None, 2025-05-07T20:31:45.5564052Z contiguous=False, 2025-05-07T20:31:45.5564140Z compiled=True, 2025-05-07T20:31:45.5564220Z ) 2025-05-07T20:31:45.5564443Z self = 2025-05-07T20:31:45.5564612Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5564624Z 2025-05-07T20:31:45.5564708Z @given( 2025-05-07T20:31:45.5564831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5564937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5565059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5565180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5565295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5565380Z ) 2025-05-07T20:31:45.5565638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5565739Z def test_silu_mul_quant( 2025-05-07T20:31:45.5565820Z self, 2025-05-07T20:31:45.5565900Z T: int, 2025-05-07T20:31:45.5565985Z D: int, 2025-05-07T20:31:45.5566089Z scale_ub: Optional[float], 2025-05-07T20:31:45.5566181Z contiguous: bool, 2025-05-07T20:31:45.5566275Z compiled: bool, 2025-05-07T20:31:45.5566364Z ) -> None: 2025-05-07T20:31:45.5566466Z torch.manual_seed(2025) 2025-05-07T20:31:45.5566545Z 2025-05-07T20:31:45.5566716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5566793Z 2025-05-07T20:31:45.5566897Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5567025Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5567122Z x = x_sign * x_clamp 2025-05-07T20:31:45.5567205Z x0 = x[:, :D] 2025-05-07T20:31:45.5567294Z x1 = x[:, D:] 2025-05-07T20:31:45.5567378Z 2025-05-07T20:31:45.5567468Z if contiguous: 2025-05-07T20:31:45.5567631Z x0 = x0.contiguous() 2025-05-07T20:31:45.5567727Z x1 = x1.contiguous() 2025-05-07T20:31:45.5567804Z 2025-05-07T20:31:45.5567898Z if scale_ub is not None: 2025-05-07T20:31:45.5568015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5568154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5568322Z ) 2025-05-07T20:31:45.5568408Z else: 2025-05-07T20:31:45.5568508Z scale_ub_tensor = None 2025-05-07T20:31:45.5568591Z 2025-05-07T20:31:45.5568726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5568819Z op = silu_mul_quant 2025-05-07T20:31:45.5568915Z if compiled: 2025-05-07T20:31:45.5569019Z op = torch.compile(op) 2025-05-07T20:31:45.5569128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5569213Z 2025-05-07T20:31:45.5569313Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5569438Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5569516Z 2025-05-07T20:31:45.5569654Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5569759Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5569867Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5569994Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5570149Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5570226Z 2025-05-07T20:31:45.5570332Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5570337Z 2025-05-07T20:31:45.5570446Z moe/activation_test.py:126: 2025-05-07T20:31:45.5570583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5570694Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5570916Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5571533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5571645Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5572007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5572233Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5572613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5572874Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5573274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5573540Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5573919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5574093Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5574437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5574518Z fn() 2025-05-07T20:31:45.5574932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5575026Z self.fn.run( 2025-05-07T20:31:45.5575367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5575464Z kernel = self.compile( 2025-05-07T20:31:45.5575847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5576032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5576164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5576168Z 2025-05-07T20:31:45.5576375Z self = 2025-05-07T20:31:45.5577157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5577742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c97751c0>} 2025-05-07T20:31:45.5578502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5578696Z context = 2025-05-07T20:31:45.5578700Z 2025-05-07T20:31:45.5578876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5579142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5579252Z module_map=module_map) 2025-05-07T20:31:45.5579424Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5579531Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5579611Z E ^ 2025-05-07T20:31:45.5579972Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5579977Z 2025-05-07T20:31:45.5580393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5580397Z 2025-05-07T20:31:45.5580588Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5580815Z self=, 2025-05-07T20:31:45.5580895Z T=1, 2025-05-07T20:31:45.5580994Z D=5120, 2025-05-07T20:31:45.5581091Z scale_ub=None, 2025-05-07T20:31:45.5581194Z contiguous=True, 2025-05-07T20:31:45.5581295Z compiled=False, 2025-05-07T20:31:45.5581375Z ) 2025-05-07T20:31:45.5581598Z self = 2025-05-07T20:31:45.5581771Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.5581776Z 2025-05-07T20:31:45.5581858Z @given( 2025-05-07T20:31:45.5581986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5582088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5582210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5582331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5582450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5582531Z ) 2025-05-07T20:31:45.5582779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5582875Z def test_silu_mul_quant( 2025-05-07T20:31:45.5582961Z self, 2025-05-07T20:31:45.5583043Z T: int, 2025-05-07T20:31:45.5583122Z D: int, 2025-05-07T20:31:45.5583229Z scale_ub: Optional[float], 2025-05-07T20:31:45.5583326Z contiguous: bool, 2025-05-07T20:31:45.5583416Z compiled: bool, 2025-05-07T20:31:45.5583502Z ) -> None: 2025-05-07T20:31:45.5583598Z torch.manual_seed(2025) 2025-05-07T20:31:45.5583679Z 2025-05-07T20:31:45.5583853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5583929Z 2025-05-07T20:31:45.5584030Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5584158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5584255Z x = x_sign * x_clamp 2025-05-07T20:31:45.5584344Z x0 = x[:, :D] 2025-05-07T20:31:45.5584429Z x1 = x[:, D:] 2025-05-07T20:31:45.5584504Z 2025-05-07T20:31:45.5584594Z if contiguous: 2025-05-07T20:31:45.5584688Z x0 = x0.contiguous() 2025-05-07T20:31:45.5584782Z x1 = x1.contiguous() 2025-05-07T20:31:45.5584863Z 2025-05-07T20:31:45.5584959Z if scale_ub is not None: 2025-05-07T20:31:45.5585071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5585301Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5585381Z ) 2025-05-07T20:31:45.5585463Z else: 2025-05-07T20:31:45.5585564Z scale_ub_tensor = None 2025-05-07T20:31:45.5585639Z 2025-05-07T20:31:45.5585774Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5585868Z op = silu_mul_quant 2025-05-07T20:31:45.5585957Z if compiled: 2025-05-07T20:31:45.5586068Z op = torch.compile(op) 2025-05-07T20:31:45.5586179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5586256Z 2025-05-07T20:31:45.5586352Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5586357Z 2025-05-07T20:31:45.5586459Z moe/activation_test.py:117: 2025-05-07T20:31:45.5586596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5586701Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5586816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5587325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5587426Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5587788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5588016Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5588439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5588542Z kernel = self.compile( 2025-05-07T20:31:45.5588925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5589104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5589236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5589246Z 2025-05-07T20:31:45.5589453Z self = 2025-05-07T20:31:45.5590236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5590744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9a50a40>} 2025-05-07T20:31:45.5591523Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5591744Z context = 2025-05-07T20:31:45.5591754Z 2025-05-07T20:31:45.5591922Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5592191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5592301Z module_map=module_map) 2025-05-07T20:31:45.5592468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5592575Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5592656Z E ^ 2025-05-07T20:31:45.5593019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5593024Z 2025-05-07T20:31:45.5593439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5593444Z 2025-05-07T20:31:45.5593552Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5593783Z self=, 2025-05-07T20:31:45.5593968Z T=128, 2025-05-07T20:31:45.5594051Z D=5120, 2025-05-07T20:31:45.5594142Z scale_ub=None, 2025-05-07T20:31:45.5594234Z contiguous=False, 2025-05-07T20:31:45.5594326Z compiled=True, 2025-05-07T20:31:45.5594405Z ) 2025-05-07T20:31:45.5594626Z self = 2025-05-07T20:31:45.5594803Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5594808Z 2025-05-07T20:31:45.5594895Z @given( 2025-05-07T20:31:45.5595017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5595124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5595245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5595367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5595489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5595568Z ) 2025-05-07T20:31:45.5595819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5595920Z def test_silu_mul_quant( 2025-05-07T20:31:45.5596000Z self, 2025-05-07T20:31:45.5596088Z T: int, 2025-05-07T20:31:45.5596167Z D: int, 2025-05-07T20:31:45.5596272Z scale_ub: Optional[float], 2025-05-07T20:31:45.5596373Z contiguous: bool, 2025-05-07T20:31:45.5596461Z compiled: bool, 2025-05-07T20:31:45.5596542Z ) -> None: 2025-05-07T20:31:45.5596645Z torch.manual_seed(2025) 2025-05-07T20:31:45.5596802Z 2025-05-07T20:31:45.5596979Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5597062Z 2025-05-07T20:31:45.5597157Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5597289Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5597380Z x = x_sign * x_clamp 2025-05-07T20:31:45.5597463Z x0 = x[:, :D] 2025-05-07T20:31:45.5597551Z x1 = x[:, D:] 2025-05-07T20:31:45.5597633Z 2025-05-07T20:31:45.5597720Z if contiguous: 2025-05-07T20:31:45.5597819Z x0 = x0.contiguous() 2025-05-07T20:31:45.5597911Z x1 = x1.contiguous() 2025-05-07T20:31:45.5597987Z 2025-05-07T20:31:45.5598084Z if scale_ub is not None: 2025-05-07T20:31:45.5598195Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5598331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5598414Z ) 2025-05-07T20:31:45.5598499Z else: 2025-05-07T20:31:45.5598598Z scale_ub_tensor = None 2025-05-07T20:31:45.5598682Z 2025-05-07T20:31:45.5598814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5598910Z op = silu_mul_quant 2025-05-07T20:31:45.5598998Z if compiled: 2025-05-07T20:31:45.5599101Z op = torch.compile(op) 2025-05-07T20:31:45.5599215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5599293Z 2025-05-07T20:31:45.5599390Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5599395Z 2025-05-07T20:31:45.5599499Z moe/activation_test.py:117: 2025-05-07T20:31:45.5599632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5599735Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5599841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5600211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5600314Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5600813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5600913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5601275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5601498Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5601927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5602027Z kernel = self.compile( 2025-05-07T20:31:45.5602409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5602588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5602723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5602727Z 2025-05-07T20:31:45.5602934Z self = 2025-05-07T20:31:45.5603714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5604223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9a52c00>} 2025-05-07T20:31:45.5604975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5605173Z context = 2025-05-07T20:31:45.5605252Z 2025-05-07T20:31:45.5605427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5605958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5606114Z module_map=module_map) 2025-05-07T20:31:45.5606286Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5606386Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5606471Z E ^ 2025-05-07T20:31:45.5606829Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5606834Z 2025-05-07T20:31:45.5607247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5607251Z 2025-05-07T20:31:45.5607358Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5607639Z self=, 2025-05-07T20:31:45.5607722Z T=128, 2025-05-07T20:31:45.5607804Z D=7168, 2025-05-07T20:31:45.5607891Z scale_ub=1200.0, 2025-05-07T20:31:45.5607979Z contiguous=False, 2025-05-07T20:31:45.5608066Z compiled=False, 2025-05-07T20:31:45.5608141Z ) 2025-05-07T20:31:45.5608362Z self = 2025-05-07T20:31:45.5608540Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.5608549Z 2025-05-07T20:31:45.5608627Z @given( 2025-05-07T20:31:45.5608750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5608850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5608963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5609085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5609197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5609271Z ) 2025-05-07T20:31:45.5609528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5609623Z def test_silu_mul_quant( 2025-05-07T20:31:45.5609708Z self, 2025-05-07T20:31:45.5609786Z T: int, 2025-05-07T20:31:45.5609864Z D: int, 2025-05-07T20:31:45.5609966Z scale_ub: Optional[float], 2025-05-07T20:31:45.5610055Z contiguous: bool, 2025-05-07T20:31:45.5610142Z compiled: bool, 2025-05-07T20:31:45.5610222Z ) -> None: 2025-05-07T20:31:45.5610481Z torch.manual_seed(2025) 2025-05-07T20:31:45.5610558Z 2025-05-07T20:31:45.5610729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5610805Z 2025-05-07T20:31:45.5610902Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5611024Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5611114Z x = x_sign * x_clamp 2025-05-07T20:31:45.5611202Z x0 = x[:, :D] 2025-05-07T20:31:45.5611282Z x1 = x[:, D:] 2025-05-07T20:31:45.5611360Z 2025-05-07T20:31:45.5611450Z if contiguous: 2025-05-07T20:31:45.5611544Z x0 = x0.contiguous() 2025-05-07T20:31:45.5611635Z x1 = x1.contiguous() 2025-05-07T20:31:45.5611712Z 2025-05-07T20:31:45.5611804Z if scale_ub is not None: 2025-05-07T20:31:45.5611913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5612049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5612133Z ) 2025-05-07T20:31:45.5612213Z else: 2025-05-07T20:31:45.5612308Z scale_ub_tensor = None 2025-05-07T20:31:45.5612383Z 2025-05-07T20:31:45.5612523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5612612Z op = silu_mul_quant 2025-05-07T20:31:45.5612699Z if compiled: 2025-05-07T20:31:45.5612804Z op = torch.compile(op) 2025-05-07T20:31:45.5612911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5613095Z 2025-05-07T20:31:45.5613192Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5613196Z 2025-05-07T20:31:45.5613292Z moe/activation_test.py:117: 2025-05-07T20:31:45.5613423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5613525Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5613626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5614135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5614245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5614602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5614826Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5615166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5615270Z kernel = self.compile( 2025-05-07T20:31:45.5615652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5615826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5615956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5615960Z 2025-05-07T20:31:45.5616162Z self = 2025-05-07T20:31:45.5616948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5617449Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c971f380>} 2025-05-07T20:31:45.5618204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5618396Z context = 2025-05-07T20:31:45.5618401Z 2025-05-07T20:31:45.5618565Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5618918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5619024Z module_map=module_map) 2025-05-07T20:31:45.5619186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5619289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5619365Z E ^ 2025-05-07T20:31:45.5619722Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5619732Z 2025-05-07T20:31:45.5620145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5620149Z 2025-05-07T20:31:45.5620254Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5620482Z self=, 2025-05-07T20:31:45.5620559Z T=128, 2025-05-07T20:31:45.5620637Z D=5120, 2025-05-07T20:31:45.5620726Z scale_ub=None, 2025-05-07T20:31:45.5620842Z contiguous=False, 2025-05-07T20:31:45.5620938Z compiled=False, 2025-05-07T20:31:45.5621031Z ) 2025-05-07T20:31:45.5621250Z self = 2025-05-07T20:31:45.5621422Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.5621427Z 2025-05-07T20:31:45.5621502Z @given( 2025-05-07T20:31:45.5621620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5621802Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5621920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5622034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5622155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5622228Z ) 2025-05-07T20:31:45.5622477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5622567Z def test_silu_mul_quant( 2025-05-07T20:31:45.5622654Z self, 2025-05-07T20:31:45.5622736Z T: int, 2025-05-07T20:31:45.5622813Z D: int, 2025-05-07T20:31:45.5622910Z scale_ub: Optional[float], 2025-05-07T20:31:45.5623006Z contiguous: bool, 2025-05-07T20:31:45.5623092Z compiled: bool, 2025-05-07T20:31:45.5623171Z ) -> None: 2025-05-07T20:31:45.5623269Z torch.manual_seed(2025) 2025-05-07T20:31:45.5623344Z 2025-05-07T20:31:45.5623511Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5623596Z 2025-05-07T20:31:45.5623688Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5623818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5623907Z x = x_sign * x_clamp 2025-05-07T20:31:45.5623990Z x0 = x[:, :D] 2025-05-07T20:31:45.5624072Z x1 = x[:, D:] 2025-05-07T20:31:45.5624148Z 2025-05-07T20:31:45.5624234Z if contiguous: 2025-05-07T20:31:45.5624328Z x0 = x0.contiguous() 2025-05-07T20:31:45.5624422Z x1 = x1.contiguous() 2025-05-07T20:31:45.5624495Z 2025-05-07T20:31:45.5624592Z if scale_ub is not None: 2025-05-07T20:31:45.5624696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5624831Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5624910Z ) 2025-05-07T20:31:45.5624987Z else: 2025-05-07T20:31:45.5625084Z scale_ub_tensor = None 2025-05-07T20:31:45.5625158Z 2025-05-07T20:31:45.5625290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5625384Z op = silu_mul_quant 2025-05-07T20:31:45.5625466Z if compiled: 2025-05-07T20:31:45.5625565Z op = torch.compile(op) 2025-05-07T20:31:45.5625672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5625745Z 2025-05-07T20:31:45.5625834Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5625838Z 2025-05-07T20:31:45.5625939Z moe/activation_test.py:117: 2025-05-07T20:31:45.5626177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5626276Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5626379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5626876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5626978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5627340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5627563Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5627907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5627999Z kernel = self.compile( 2025-05-07T20:31:45.5628381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5628563Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5628690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5628695Z 2025-05-07T20:31:45.5628903Z self = 2025-05-07T20:31:45.5629747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5630251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9777380>} 2025-05-07T20:31:45.5630997Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5631192Z context = 2025-05-07T20:31:45.5631197Z 2025-05-07T20:31:45.5631364Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5631624Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5631734Z module_map=module_map) 2025-05-07T20:31:45.5631899Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5631998Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5632079Z E ^ 2025-05-07T20:31:45.5632434Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5632438Z 2025-05-07T20:31:45.5632853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5632866Z 2025-05-07T20:31:45.5632968Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5633189Z self=, 2025-05-07T20:31:45.5633268Z T=128, 2025-05-07T20:31:45.5633343Z D=5120, 2025-05-07T20:31:45.5633426Z scale_ub=1200.0, 2025-05-07T20:31:45.5633516Z contiguous=True, 2025-05-07T20:31:45.5633601Z compiled=False, 2025-05-07T20:31:45.5633677Z ) 2025-05-07T20:31:45.5633901Z self = 2025-05-07T20:31:45.5634071Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.5634076Z 2025-05-07T20:31:45.5634157Z @given( 2025-05-07T20:31:45.5634277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5634377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5634494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5634693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5634806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5634885Z ) 2025-05-07T20:31:45.5635127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5635221Z def test_silu_mul_quant( 2025-05-07T20:31:45.5635301Z self, 2025-05-07T20:31:45.5635379Z T: int, 2025-05-07T20:31:45.5635458Z D: int, 2025-05-07T20:31:45.5635557Z scale_ub: Optional[float], 2025-05-07T20:31:45.5635653Z contiguous: bool, 2025-05-07T20:31:45.5635742Z compiled: bool, 2025-05-07T20:31:45.5635822Z ) -> None: 2025-05-07T20:31:45.5635915Z torch.manual_seed(2025) 2025-05-07T20:31:45.5635997Z 2025-05-07T20:31:45.5636168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5636244Z 2025-05-07T20:31:45.5636338Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5636460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5636555Z x = x_sign * x_clamp 2025-05-07T20:31:45.5636641Z x0 = x[:, :D] 2025-05-07T20:31:45.5636722Z x1 = x[:, D:] 2025-05-07T20:31:45.5636795Z 2025-05-07T20:31:45.5636881Z if contiguous: 2025-05-07T20:31:45.5636972Z x0 = x0.contiguous() 2025-05-07T20:31:45.5637064Z x1 = x1.contiguous() 2025-05-07T20:31:45.5637141Z 2025-05-07T20:31:45.5637231Z if scale_ub is not None: 2025-05-07T20:31:45.5637419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5637557Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5637631Z ) 2025-05-07T20:31:45.5637714Z else: 2025-05-07T20:31:45.5637809Z scale_ub_tensor = None 2025-05-07T20:31:45.5637886Z 2025-05-07T20:31:45.5638025Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5638116Z op = silu_mul_quant 2025-05-07T20:31:45.5638204Z if compiled: 2025-05-07T20:31:45.5638307Z op = torch.compile(op) 2025-05-07T20:31:45.5638414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5638493Z 2025-05-07T20:31:45.5638586Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5638591Z 2025-05-07T20:31:45.5638685Z moe/activation_test.py:117: 2025-05-07T20:31:45.5638815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5638915Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5639016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5639521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5639622Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5639982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5640206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5640549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5640648Z kernel = self.compile( 2025-05-07T20:31:45.5641027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5641202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5641334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5641339Z 2025-05-07T20:31:45.5641540Z self = 2025-05-07T20:31:45.5642322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5642957Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c97749a0>} 2025-05-07T20:31:45.5643705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5643895Z context = 2025-05-07T20:31:45.5643905Z 2025-05-07T20:31:45.5644067Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5644333Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5644442Z module_map=module_map) 2025-05-07T20:31:45.5644606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5644704Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5644787Z E ^ 2025-05-07T20:31:45.5645142Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5645147Z 2025-05-07T20:31:45.5645559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5645563Z 2025-05-07T20:31:45.5645666Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5645965Z self=, 2025-05-07T20:31:45.5646047Z T=1, 2025-05-07T20:31:45.5646130Z D=7168, 2025-05-07T20:31:45.5646214Z scale_ub=1200.0, 2025-05-07T20:31:45.5646298Z contiguous=True, 2025-05-07T20:31:45.5646385Z compiled=True, 2025-05-07T20:31:45.5646460Z ) 2025-05-07T20:31:45.5646675Z self = 2025-05-07T20:31:45.5646845Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.5646855Z 2025-05-07T20:31:45.5646935Z @given( 2025-05-07T20:31:45.5647054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5647155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5647267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5647388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5647553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5647630Z ) 2025-05-07T20:31:45.5647881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5647975Z def test_silu_mul_quant( 2025-05-07T20:31:45.5648050Z self, 2025-05-07T20:31:45.5648132Z T: int, 2025-05-07T20:31:45.5648212Z D: int, 2025-05-07T20:31:45.5648311Z scale_ub: Optional[float], 2025-05-07T20:31:45.5648402Z contiguous: bool, 2025-05-07T20:31:45.5648489Z compiled: bool, 2025-05-07T20:31:45.5648566Z ) -> None: 2025-05-07T20:31:45.5648669Z torch.manual_seed(2025) 2025-05-07T20:31:45.5648740Z 2025-05-07T20:31:45.5648912Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5648989Z 2025-05-07T20:31:45.5649079Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5649204Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5649293Z x = x_sign * x_clamp 2025-05-07T20:31:45.5649373Z x0 = x[:, :D] 2025-05-07T20:31:45.5649457Z x1 = x[:, D:] 2025-05-07T20:31:45.5649532Z 2025-05-07T20:31:45.5649617Z if contiguous: 2025-05-07T20:31:45.5649711Z x0 = x0.contiguous() 2025-05-07T20:31:45.5649800Z x1 = x1.contiguous() 2025-05-07T20:31:45.5649874Z 2025-05-07T20:31:45.5649968Z if scale_ub is not None: 2025-05-07T20:31:45.5650074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5650213Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5650379Z ) 2025-05-07T20:31:45.5650457Z else: 2025-05-07T20:31:45.5650554Z scale_ub_tensor = None 2025-05-07T20:31:45.5650630Z 2025-05-07T20:31:45.5650757Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5650849Z op = silu_mul_quant 2025-05-07T20:31:45.5650932Z if compiled: 2025-05-07T20:31:45.5651030Z op = torch.compile(op) 2025-05-07T20:31:45.5651138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5651216Z 2025-05-07T20:31:45.5651307Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5651311Z 2025-05-07T20:31:45.5651410Z moe/activation_test.py:117: 2025-05-07T20:31:45.5651543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5651648Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5651746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5652114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5652218Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5652708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5652807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5653167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5653487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5653831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5653925Z kernel = self.compile( 2025-05-07T20:31:45.5654305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5654483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5654613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5654618Z 2025-05-07T20:31:45.5654827Z self = 2025-05-07T20:31:45.5655599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5656104Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9832840>} 2025-05-07T20:31:45.5656852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5657041Z context = 2025-05-07T20:31:45.5657051Z 2025-05-07T20:31:45.5657221Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5657484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5657590Z module_map=module_map) 2025-05-07T20:31:45.5657756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5657854Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5657940Z E ^ 2025-05-07T20:31:45.5658291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5658295Z 2025-05-07T20:31:45.5658707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5658712Z 2025-05-07T20:31:45.5658819Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5659123Z self=, 2025-05-07T20:31:45.5659201Z T=1, 2025-05-07T20:31:45.5659279Z D=7168, 2025-05-07T20:31:45.5659362Z scale_ub=1200.0, 2025-05-07T20:31:45.5659450Z contiguous=False, 2025-05-07T20:31:45.5659538Z compiled=True, 2025-05-07T20:31:45.5659613Z ) 2025-05-07T20:31:45.5659833Z self = 2025-05-07T20:31:45.5660005Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.5660009Z 2025-05-07T20:31:45.5660087Z @given( 2025-05-07T20:31:45.5660211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5660310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5660424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5660547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5660659Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5660743Z ) 2025-05-07T20:31:45.5660985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5661077Z def test_silu_mul_quant( 2025-05-07T20:31:45.5661162Z self, 2025-05-07T20:31:45.5661240Z T: int, 2025-05-07T20:31:45.5661314Z D: int, 2025-05-07T20:31:45.5661416Z scale_ub: Optional[float], 2025-05-07T20:31:45.5661505Z contiguous: bool, 2025-05-07T20:31:45.5661588Z compiled: bool, 2025-05-07T20:31:45.5661824Z ) -> None: 2025-05-07T20:31:45.5661919Z torch.manual_seed(2025) 2025-05-07T20:31:45.5661996Z 2025-05-07T20:31:45.5662169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5662243Z 2025-05-07T20:31:45.5662338Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5662463Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5662550Z x = x_sign * x_clamp 2025-05-07T20:31:45.5662632Z x0 = x[:, :D] 2025-05-07T20:31:45.5662719Z x1 = x[:, D:] 2025-05-07T20:31:45.5662792Z 2025-05-07T20:31:45.5662878Z if contiguous: 2025-05-07T20:31:45.5662972Z x0 = x0.contiguous() 2025-05-07T20:31:45.5663058Z x1 = x1.contiguous() 2025-05-07T20:31:45.5663132Z 2025-05-07T20:31:45.5663224Z if scale_ub is not None: 2025-05-07T20:31:45.5663328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5663466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5663551Z ) 2025-05-07T20:31:45.5663629Z else: 2025-05-07T20:31:45.5663728Z scale_ub_tensor = None 2025-05-07T20:31:45.5663803Z 2025-05-07T20:31:45.5663935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5664027Z op = silu_mul_quant 2025-05-07T20:31:45.5664111Z if compiled: 2025-05-07T20:31:45.5664214Z op = torch.compile(op) 2025-05-07T20:31:45.5664321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5664400Z 2025-05-07T20:31:45.5664495Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5664500Z 2025-05-07T20:31:45.5664595Z moe/activation_test.py:117: 2025-05-07T20:31:45.5664723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5664828Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5664931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5665307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5665401Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5665893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5665995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5666350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5666652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5666996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5667088Z kernel = self.compile( 2025-05-07T20:31:45.5667469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5667647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5667773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5667777Z 2025-05-07T20:31:45.5667987Z self = 2025-05-07T20:31:45.5668761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5669269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8d37420>} 2025-05-07T20:31:45.5670012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5670280Z context = 2025-05-07T20:31:45.5670285Z 2025-05-07T20:31:45.5670451Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5670712Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5670821Z module_map=module_map) 2025-05-07T20:31:45.5670984Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5671086Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5671167Z E ^ 2025-05-07T20:31:45.5671518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5671522Z 2025-05-07T20:31:45.5671937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5671942Z 2025-05-07T20:31:45.5672045Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5672271Z self=, 2025-05-07T20:31:45.5672350Z T=1, 2025-05-07T20:31:45.5672426Z D=7168, 2025-05-07T20:31:45.5672506Z scale_ub=None, 2025-05-07T20:31:45.5672592Z contiguous=False, 2025-05-07T20:31:45.5672675Z compiled=True, 2025-05-07T20:31:45.5677737Z ) 2025-05-07T20:31:45.5677983Z self = 2025-05-07T20:31:45.5678170Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5678175Z 2025-05-07T20:31:45.5678259Z @given( 2025-05-07T20:31:45.5678387Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5678490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5678608Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5678733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5678847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5678933Z ) 2025-05-07T20:31:45.5679192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5679288Z def test_silu_mul_quant( 2025-05-07T20:31:45.5679371Z self, 2025-05-07T20:31:45.5679451Z T: int, 2025-05-07T20:31:45.5679532Z D: int, 2025-05-07T20:31:45.5679635Z scale_ub: Optional[float], 2025-05-07T20:31:45.5679731Z contiguous: bool, 2025-05-07T20:31:45.5679928Z compiled: bool, 2025-05-07T20:31:45.5680016Z ) -> None: 2025-05-07T20:31:45.5680113Z torch.manual_seed(2025) 2025-05-07T20:31:45.5680192Z 2025-05-07T20:31:45.5680371Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5680447Z 2025-05-07T20:31:45.5680543Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5680674Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5680766Z x = x_sign * x_clamp 2025-05-07T20:31:45.5680853Z x0 = x[:, :D] 2025-05-07T20:31:45.5680938Z x1 = x[:, D:] 2025-05-07T20:31:45.5681012Z 2025-05-07T20:31:45.5681103Z if contiguous: 2025-05-07T20:31:45.5681198Z x0 = x0.contiguous() 2025-05-07T20:31:45.5681292Z x1 = x1.contiguous() 2025-05-07T20:31:45.5681372Z 2025-05-07T20:31:45.5681472Z if scale_ub is not None: 2025-05-07T20:31:45.5681600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5681764Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5681861Z ) 2025-05-07T20:31:45.5681940Z else: 2025-05-07T20:31:45.5682040Z scale_ub_tensor = None 2025-05-07T20:31:45.5682116Z 2025-05-07T20:31:45.5682250Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5682347Z op = silu_mul_quant 2025-05-07T20:31:45.5682434Z if compiled: 2025-05-07T20:31:45.5682539Z op = torch.compile(op) 2025-05-07T20:31:45.5682726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5682808Z 2025-05-07T20:31:45.5682906Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.5683029Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.5683105Z 2025-05-07T20:31:45.5683248Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5683352Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.5683457Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.5683590Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.5683732Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5683809Z 2025-05-07T20:31:45.5683916Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.5683921Z 2025-05-07T20:31:45.5684023Z moe/activation_test.py:126: 2025-05-07T20:31:45.5684161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5684275Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.5684410Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.5684980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.5685086Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.5685453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5685684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5686054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.5686314Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5686715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.5686973Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.5687351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.5687606Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.5687952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.5688119Z fn() 2025-05-07T20:31:45.5688521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.5688614Z self.fn.run( 2025-05-07T20:31:45.5688954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5689052Z kernel = self.compile( 2025-05-07T20:31:45.5689438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5689616Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5689755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5689760Z 2025-05-07T20:31:45.5689968Z self = 2025-05-07T20:31:45.5690745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5691259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8d36c00>} 2025-05-07T20:31:45.5692110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5692309Z context = 2025-05-07T20:31:45.5692313Z 2025-05-07T20:31:45.5692480Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5692753Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5692862Z module_map=module_map) 2025-05-07T20:31:45.5693030Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5693138Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.5693218Z E ^ 2025-05-07T20:31:45.5693575Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5693580Z 2025-05-07T20:31:45.5694001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5694012Z 2025-05-07T20:31:45.5694119Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5694349Z self=, 2025-05-07T20:31:45.5694431Z T=1, 2025-05-07T20:31:45.5694510Z D=5120, 2025-05-07T20:31:45.5694600Z scale_ub=1200.0, 2025-05-07T20:31:45.5694692Z contiguous=False, 2025-05-07T20:31:45.5694780Z compiled=True, 2025-05-07T20:31:45.5694863Z ) 2025-05-07T20:31:45.5695088Z self = 2025-05-07T20:31:45.5695256Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.5695265Z 2025-05-07T20:31:45.5695345Z @given( 2025-05-07T20:31:45.5695469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5695573Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5695690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5695812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5695931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5696011Z ) 2025-05-07T20:31:45.5696259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5696362Z def test_silu_mul_quant( 2025-05-07T20:31:45.5696440Z self, 2025-05-07T20:31:45.5696521Z T: int, 2025-05-07T20:31:45.5696604Z D: int, 2025-05-07T20:31:45.5696706Z scale_ub: Optional[float], 2025-05-07T20:31:45.5696887Z contiguous: bool, 2025-05-07T20:31:45.5696976Z compiled: bool, 2025-05-07T20:31:45.5697057Z ) -> None: 2025-05-07T20:31:45.5697159Z torch.manual_seed(2025) 2025-05-07T20:31:45.5697234Z 2025-05-07T20:31:45.5697408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5697486Z 2025-05-07T20:31:45.5697581Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5697714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5697807Z x = x_sign * x_clamp 2025-05-07T20:31:45.5697891Z x0 = x[:, :D] 2025-05-07T20:31:45.5697976Z x1 = x[:, D:] 2025-05-07T20:31:45.5698054Z 2025-05-07T20:31:45.5698140Z if contiguous: 2025-05-07T20:31:45.5698236Z x0 = x0.contiguous() 2025-05-07T20:31:45.5698328Z x1 = x1.contiguous() 2025-05-07T20:31:45.5698404Z 2025-05-07T20:31:45.5698504Z if scale_ub is not None: 2025-05-07T20:31:45.5698619Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5698756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5698843Z ) 2025-05-07T20:31:45.5698923Z else: 2025-05-07T20:31:45.5699020Z scale_ub_tensor = None 2025-05-07T20:31:45.5699100Z 2025-05-07T20:31:45.5699232Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5699324Z op = silu_mul_quant 2025-05-07T20:31:45.5699493Z if compiled: 2025-05-07T20:31:45.5699598Z op = torch.compile(op) 2025-05-07T20:31:45.5699714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5699791Z 2025-05-07T20:31:45.5699886Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5699891Z 2025-05-07T20:31:45.5699993Z moe/activation_test.py:117: 2025-05-07T20:31:45.5700126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5700233Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5700347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5700717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5700815Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5701367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5701468Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5701836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5702060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5702405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5702507Z kernel = self.compile( 2025-05-07T20:31:45.5702891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5703075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5703206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5703210Z 2025-05-07T20:31:45.5703420Z self = 2025-05-07T20:31:45.5704205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5704711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8d351c0>} 2025-05-07T20:31:45.5705471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5705988Z context = 2025-05-07T20:31:45.5705998Z 2025-05-07T20:31:45.5706211Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5706485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5706603Z module_map=module_map) 2025-05-07T20:31:45.5706771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5706873Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5706956Z E ^ 2025-05-07T20:31:45.5707317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5707322Z 2025-05-07T20:31:45.5707739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5707750Z 2025-05-07T20:31:45.5707862Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5708087Z self=, 2025-05-07T20:31:45.5708167Z T=1, 2025-05-07T20:31:45.5708250Z D=5120, 2025-05-07T20:31:45.5708341Z scale_ub=1200.0, 2025-05-07T20:31:45.5708433Z contiguous=False, 2025-05-07T20:31:45.5708525Z compiled=False, 2025-05-07T20:31:45.5708602Z ) 2025-05-07T20:31:45.5708969Z self = 2025-05-07T20:31:45.5709146Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.5709151Z 2025-05-07T20:31:45.5709231Z @given( 2025-05-07T20:31:45.5709357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5709459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5709578Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5709704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5709822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5709903Z ) 2025-05-07T20:31:45.5710152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5710250Z def test_silu_mul_quant( 2025-05-07T20:31:45.5710333Z self, 2025-05-07T20:31:45.5710414Z T: int, 2025-05-07T20:31:45.5710496Z D: int, 2025-05-07T20:31:45.5710613Z scale_ub: Optional[float], 2025-05-07T20:31:45.5710710Z contiguous: bool, 2025-05-07T20:31:45.5710798Z compiled: bool, 2025-05-07T20:31:45.5710885Z ) -> None: 2025-05-07T20:31:45.5710985Z torch.manual_seed(2025) 2025-05-07T20:31:45.5711063Z 2025-05-07T20:31:45.5711240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5711317Z 2025-05-07T20:31:45.5711411Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5711548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5711640Z x = x_sign * x_clamp 2025-05-07T20:31:45.5711726Z x0 = x[:, :D] 2025-05-07T20:31:45.5711811Z x1 = x[:, D:] 2025-05-07T20:31:45.5711886Z 2025-05-07T20:31:45.5711979Z if contiguous: 2025-05-07T20:31:45.5712075Z x0 = x0.contiguous() 2025-05-07T20:31:45.5712169Z x1 = x1.contiguous() 2025-05-07T20:31:45.5712248Z 2025-05-07T20:31:45.5712347Z if scale_ub is not None: 2025-05-07T20:31:45.5712455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5712598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5712677Z ) 2025-05-07T20:31:45.5712757Z else: 2025-05-07T20:31:45.5712857Z scale_ub_tensor = None 2025-05-07T20:31:45.5712933Z 2025-05-07T20:31:45.5713068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5713163Z op = silu_mul_quant 2025-05-07T20:31:45.5713376Z if compiled: 2025-05-07T20:31:45.5713485Z op = torch.compile(op) 2025-05-07T20:31:45.5713595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5713672Z 2025-05-07T20:31:45.5713770Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5713775Z 2025-05-07T20:31:45.5713876Z moe/activation_test.py:117: 2025-05-07T20:31:45.5714010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5714122Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5714225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5714735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5714836Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5715196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5715429Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5715773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5715873Z kernel = self.compile( 2025-05-07T20:31:45.5716259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5716437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5716651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5716656Z 2025-05-07T20:31:45.5716865Z self = 2025-05-07T20:31:45.5717643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5718156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c91994e0>} 2025-05-07T20:31:45.5718903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5719103Z context = 2025-05-07T20:31:45.5719108Z 2025-05-07T20:31:45.5719277Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5719548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5719660Z module_map=module_map) 2025-05-07T20:31:45.5719826Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5719938Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5720021Z E ^ 2025-05-07T20:31:45.5720380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5720385Z 2025-05-07T20:31:45.5720807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5720811Z 2025-05-07T20:31:45.5720920Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5721155Z self=, 2025-05-07T20:31:45.5721239Z T=16384, 2025-05-07T20:31:45.5721333Z D=5120, 2025-05-07T20:31:45.5721440Z scale_ub=1200.0, 2025-05-07T20:31:45.5721548Z contiguous=False, 2025-05-07T20:31:45.5721646Z compiled=True, 2025-05-07T20:31:45.5721728Z ) 2025-05-07T20:31:45.5721951Z self = 2025-05-07T20:31:45.5722134Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.5722249Z 2025-05-07T20:31:45.5722332Z @given( 2025-05-07T20:31:45.5722457Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5722564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5722685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5722807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5722930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5723015Z ) 2025-05-07T20:31:45.5723265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5723367Z def test_silu_mul_quant( 2025-05-07T20:31:45.5723448Z self, 2025-05-07T20:31:45.5723534Z T: int, 2025-05-07T20:31:45.5723619Z D: int, 2025-05-07T20:31:45.5723721Z scale_ub: Optional[float], 2025-05-07T20:31:45.5723817Z contiguous: bool, 2025-05-07T20:31:45.5723907Z compiled: bool, 2025-05-07T20:31:45.5723996Z ) -> None: 2025-05-07T20:31:45.5724100Z torch.manual_seed(2025) 2025-05-07T20:31:45.5724180Z 2025-05-07T20:31:45.5724353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5724434Z 2025-05-07T20:31:45.5724529Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5724658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5724754Z x = x_sign * x_clamp 2025-05-07T20:31:45.5724838Z x0 = x[:, :D] 2025-05-07T20:31:45.5725001Z x1 = x[:, D:] 2025-05-07T20:31:45.5725085Z 2025-05-07T20:31:45.5725173Z if contiguous: 2025-05-07T20:31:45.5725268Z x0 = x0.contiguous() 2025-05-07T20:31:45.5725363Z x1 = x1.contiguous() 2025-05-07T20:31:45.5725441Z 2025-05-07T20:31:45.5725540Z if scale_ub is not None: 2025-05-07T20:31:45.5725648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5725786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5725877Z ) 2025-05-07T20:31:45.5725960Z else: 2025-05-07T20:31:45.5726058Z scale_ub_tensor = None 2025-05-07T20:31:45.5726140Z 2025-05-07T20:31:45.5726273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5726367Z op = silu_mul_quant 2025-05-07T20:31:45.5726462Z if compiled: 2025-05-07T20:31:45.5726567Z op = torch.compile(op) 2025-05-07T20:31:45.5726684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5726767Z 2025-05-07T20:31:45.5726862Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5726867Z 2025-05-07T20:31:45.5726969Z moe/activation_test.py:117: 2025-05-07T20:31:45.5727103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5727208Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5727314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5727747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5727850Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5728346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5728449Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5728811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5729042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5729384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5729487Z kernel = self.compile( 2025-05-07T20:31:45.5729870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5730051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5730266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5730271Z 2025-05-07T20:31:45.5730480Z self = 2025-05-07T20:31:45.5731268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5731773Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c9199580>} 2025-05-07T20:31:45.5732525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5732724Z context = 2025-05-07T20:31:45.5732728Z 2025-05-07T20:31:45.5732898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5733169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5733282Z module_map=module_map) 2025-05-07T20:31:45.5733450Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5733629Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5733714Z E ^ 2025-05-07T20:31:45.5734076Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5734081Z 2025-05-07T20:31:45.5734500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5734504Z 2025-05-07T20:31:45.5734617Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5734851Z self=, 2025-05-07T20:31:45.5734932Z T=2048, 2025-05-07T20:31:45.5735018Z D=7168, 2025-05-07T20:31:45.5735105Z scale_ub=1200.0, 2025-05-07T20:31:45.5735198Z contiguous=False, 2025-05-07T20:31:45.5735290Z compiled=True, 2025-05-07T20:31:45.5735370Z ) 2025-05-07T20:31:45.5735591Z self = 2025-05-07T20:31:45.5735780Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.5735785Z 2025-05-07T20:31:45.5735866Z @given( 2025-05-07T20:31:45.5735993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5736098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5736217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5736342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5736466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5736546Z ) 2025-05-07T20:31:45.5736804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5736901Z def test_silu_mul_quant( 2025-05-07T20:31:45.5736989Z self, 2025-05-07T20:31:45.5737071Z T: int, 2025-05-07T20:31:45.5737153Z D: int, 2025-05-07T20:31:45.5737258Z scale_ub: Optional[float], 2025-05-07T20:31:45.5737350Z contiguous: bool, 2025-05-07T20:31:45.5737447Z compiled: bool, 2025-05-07T20:31:45.5737534Z ) -> None: 2025-05-07T20:31:45.5737632Z torch.manual_seed(2025) 2025-05-07T20:31:45.5737713Z 2025-05-07T20:31:45.5737891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5737968Z 2025-05-07T20:31:45.5738067Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5738196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5738289Z x = x_sign * x_clamp 2025-05-07T20:31:45.5738460Z x0 = x[:, :D] 2025-05-07T20:31:45.5738546Z x1 = x[:, D:] 2025-05-07T20:31:45.5738624Z 2025-05-07T20:31:45.5738716Z if contiguous: 2025-05-07T20:31:45.5738814Z x0 = x0.contiguous() 2025-05-07T20:31:45.5738906Z x1 = x1.contiguous() 2025-05-07T20:31:45.5738986Z 2025-05-07T20:31:45.5739080Z if scale_ub is not None: 2025-05-07T20:31:45.5739189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5739333Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5739415Z ) 2025-05-07T20:31:45.5739496Z else: 2025-05-07T20:31:45.5739598Z scale_ub_tensor = None 2025-05-07T20:31:45.5739674Z 2025-05-07T20:31:45.5739810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5739902Z op = silu_mul_quant 2025-05-07T20:31:45.5739988Z if compiled: 2025-05-07T20:31:45.5740094Z op = torch.compile(op) 2025-05-07T20:31:45.5740207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5740281Z 2025-05-07T20:31:45.5740377Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5740381Z 2025-05-07T20:31:45.5740478Z moe/activation_test.py:117: 2025-05-07T20:31:45.5740609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5740714Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5740815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5741264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5741361Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5741854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5741956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5742314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5742540Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5742884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5742980Z kernel = self.compile( 2025-05-07T20:31:45.5743362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5743993Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5744123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5744130Z 2025-05-07T20:31:45.5744335Z self = 2025-05-07T20:31:45.5745107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5745618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c919b060>} 2025-05-07T20:31:45.5746368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5746565Z context = 2025-05-07T20:31:45.5746570Z 2025-05-07T20:31:45.5746736Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5747002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5747115Z module_map=module_map) 2025-05-07T20:31:45.5747280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5747467Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5747550Z E ^ 2025-05-07T20:31:45.5747906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5747911Z 2025-05-07T20:31:45.5748326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5748330Z 2025-05-07T20:31:45.5748441Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5748664Z self=, 2025-05-07T20:31:45.5748746Z T=1, 2025-05-07T20:31:45.5748825Z D=5120, 2025-05-07T20:31:45.5748915Z scale_ub=None, 2025-05-07T20:31:45.5749004Z contiguous=False, 2025-05-07T20:31:45.5749091Z compiled=False, 2025-05-07T20:31:45.5749173Z ) 2025-05-07T20:31:45.5749390Z self = 2025-05-07T20:31:45.5749564Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.5749569Z 2025-05-07T20:31:45.5749653Z @given( 2025-05-07T20:31:45.5749777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5749876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5749997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5750114Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5750311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5750391Z ) 2025-05-07T20:31:45.5750637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5750736Z def test_silu_mul_quant( 2025-05-07T20:31:45.5750816Z self, 2025-05-07T20:31:45.5750895Z T: int, 2025-05-07T20:31:45.5750976Z D: int, 2025-05-07T20:31:45.5751076Z scale_ub: Optional[float], 2025-05-07T20:31:45.5751168Z contiguous: bool, 2025-05-07T20:31:45.5751263Z compiled: bool, 2025-05-07T20:31:45.5751346Z ) -> None: 2025-05-07T20:31:45.5751442Z torch.manual_seed(2025) 2025-05-07T20:31:45.5751524Z 2025-05-07T20:31:45.5751694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5751772Z 2025-05-07T20:31:45.5751870Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5751995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5752092Z x = x_sign * x_clamp 2025-05-07T20:31:45.5752173Z x0 = x[:, :D] 2025-05-07T20:31:45.5752254Z x1 = x[:, D:] 2025-05-07T20:31:45.5752332Z 2025-05-07T20:31:45.5752417Z if contiguous: 2025-05-07T20:31:45.5752508Z x0 = x0.contiguous() 2025-05-07T20:31:45.5752602Z x1 = x1.contiguous() 2025-05-07T20:31:45.5752673Z 2025-05-07T20:31:45.5752764Z if scale_ub is not None: 2025-05-07T20:31:45.5752873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5753014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5753093Z ) 2025-05-07T20:31:45.5753175Z else: 2025-05-07T20:31:45.5753269Z scale_ub_tensor = None 2025-05-07T20:31:45.5753347Z 2025-05-07T20:31:45.5753476Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5753568Z op = silu_mul_quant 2025-05-07T20:31:45.5753658Z if compiled: 2025-05-07T20:31:45.5753763Z op = torch.compile(op) 2025-05-07T20:31:45.5753871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5753951Z 2025-05-07T20:31:45.5754042Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5754046Z 2025-05-07T20:31:45.5754143Z moe/activation_test.py:117: 2025-05-07T20:31:45.5754278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5754382Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5754483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5755089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5755190Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5755551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5755774Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5756119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5756220Z kernel = self.compile( 2025-05-07T20:31:45.5756602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5756779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5756908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5756918Z 2025-05-07T20:31:45.5757124Z self = 2025-05-07T20:31:45.5757900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5758478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8e485e0>} 2025-05-07T20:31:45.5759230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5759423Z context = 2025-05-07T20:31:45.5759428Z 2025-05-07T20:31:45.5759605Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5759870Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5759978Z module_map=module_map) 2025-05-07T20:31:45.5760146Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5760248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5760330Z E ^ 2025-05-07T20:31:45.5760691Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5760696Z 2025-05-07T20:31:45.5761108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5761112Z 2025-05-07T20:31:45.5761224Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5761447Z self=, 2025-05-07T20:31:45.5761533Z T=4096, 2025-05-07T20:31:45.5761617Z D=7168, 2025-05-07T20:31:45.5761704Z scale_ub=1200.0, 2025-05-07T20:31:45.5761790Z contiguous=False, 2025-05-07T20:31:45.5761880Z compiled=False, 2025-05-07T20:31:45.5761957Z ) 2025-05-07T20:31:45.5762176Z self = 2025-05-07T20:31:45.5762356Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.5762361Z 2025-05-07T20:31:45.5762443Z @given( 2025-05-07T20:31:45.5762566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5762667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5762784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5762908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5763023Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5763099Z ) 2025-05-07T20:31:45.5763354Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5763535Z def test_silu_mul_quant( 2025-05-07T20:31:45.5763612Z self, 2025-05-07T20:31:45.5763697Z T: int, 2025-05-07T20:31:45.5763777Z D: int, 2025-05-07T20:31:45.5763879Z scale_ub: Optional[float], 2025-05-07T20:31:45.5763971Z contiguous: bool, 2025-05-07T20:31:45.5764060Z compiled: bool, 2025-05-07T20:31:45.5764142Z ) -> None: 2025-05-07T20:31:45.5764239Z torch.manual_seed(2025) 2025-05-07T20:31:45.5764321Z 2025-05-07T20:31:45.5764496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5764575Z 2025-05-07T20:31:45.5764669Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5764797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5764888Z x = x_sign * x_clamp 2025-05-07T20:31:45.5764971Z x0 = x[:, :D] 2025-05-07T20:31:45.5765058Z x1 = x[:, D:] 2025-05-07T20:31:45.5765138Z 2025-05-07T20:31:45.5765230Z if contiguous: 2025-05-07T20:31:45.5765323Z x0 = x0.contiguous() 2025-05-07T20:31:45.5765415Z x1 = x1.contiguous() 2025-05-07T20:31:45.5765492Z 2025-05-07T20:31:45.5765585Z if scale_ub is not None: 2025-05-07T20:31:45.5765690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5765829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5765907Z ) 2025-05-07T20:31:45.5765982Z else: 2025-05-07T20:31:45.5766157Z scale_ub_tensor = None 2025-05-07T20:31:45.5766231Z 2025-05-07T20:31:45.5766362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5766457Z op = silu_mul_quant 2025-05-07T20:31:45.5766544Z if compiled: 2025-05-07T20:31:45.5766643Z op = torch.compile(op) 2025-05-07T20:31:45.5766754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5766829Z 2025-05-07T20:31:45.5766926Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5766930Z 2025-05-07T20:31:45.5767029Z moe/activation_test.py:117: 2025-05-07T20:31:45.5767162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5767267Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5767365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5767921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5768030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5768389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5768615Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5768956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5769058Z kernel = self.compile( 2025-05-07T20:31:45.5769444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5769620Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5769748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5769756Z 2025-05-07T20:31:45.5769962Z self = 2025-05-07T20:31:45.5770746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5771302Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8e499e0>} 2025-05-07T20:31:45.5772138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5772331Z context = 2025-05-07T20:31:45.5772336Z 2025-05-07T20:31:45.5772503Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5772774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5772885Z module_map=module_map) 2025-05-07T20:31:45.5773049Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5773154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5773233Z E ^ 2025-05-07T20:31:45.5773587Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5773592Z 2025-05-07T20:31:45.5774017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5774022Z 2025-05-07T20:31:45.5774130Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5774356Z self=, 2025-05-07T20:31:45.5774436Z T=16384, 2025-05-07T20:31:45.5774513Z D=7168, 2025-05-07T20:31:45.5774605Z scale_ub=None, 2025-05-07T20:31:45.5774694Z contiguous=True, 2025-05-07T20:31:45.5774852Z compiled=True, 2025-05-07T20:31:45.5774932Z ) 2025-05-07T20:31:45.5775149Z self = 2025-05-07T20:31:45.5775324Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.5775329Z 2025-05-07T20:31:45.5775414Z @given( 2025-05-07T20:31:45.5775537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5775639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5775763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5775882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5775999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5776079Z ) 2025-05-07T20:31:45.5776324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5776423Z def test_silu_mul_quant( 2025-05-07T20:31:45.5776505Z self, 2025-05-07T20:31:45.5776589Z T: int, 2025-05-07T20:31:45.5776671Z D: int, 2025-05-07T20:31:45.5776771Z scale_ub: Optional[float], 2025-05-07T20:31:45.5776860Z contiguous: bool, 2025-05-07T20:31:45.5776950Z compiled: bool, 2025-05-07T20:31:45.5777032Z ) -> None: 2025-05-07T20:31:45.5777129Z torch.manual_seed(2025) 2025-05-07T20:31:45.5777210Z 2025-05-07T20:31:45.5777381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5777464Z 2025-05-07T20:31:45.5777557Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5777685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5777777Z x = x_sign * x_clamp 2025-05-07T20:31:45.5777861Z x0 = x[:, :D] 2025-05-07T20:31:45.5777944Z x1 = x[:, D:] 2025-05-07T20:31:45.5778023Z 2025-05-07T20:31:45.5778110Z if contiguous: 2025-05-07T20:31:45.5778203Z x0 = x0.contiguous() 2025-05-07T20:31:45.5778303Z x1 = x1.contiguous() 2025-05-07T20:31:45.5778381Z 2025-05-07T20:31:45.5778474Z if scale_ub is not None: 2025-05-07T20:31:45.5778583Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5778719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5778798Z ) 2025-05-07T20:31:45.5778877Z else: 2025-05-07T20:31:45.5778974Z scale_ub_tensor = None 2025-05-07T20:31:45.5779056Z 2025-05-07T20:31:45.5779185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5779360Z op = silu_mul_quant 2025-05-07T20:31:45.5779447Z if compiled: 2025-05-07T20:31:45.5779547Z op = torch.compile(op) 2025-05-07T20:31:45.5779655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5779735Z 2025-05-07T20:31:45.5779825Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5779830Z 2025-05-07T20:31:45.5779927Z moe/activation_test.py:117: 2025-05-07T20:31:45.5780066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5780171Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5780274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5780643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5780739Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5781235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5781340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5781699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5781924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5782264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5782462Z kernel = self.compile( 2025-05-07T20:31:45.5782844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5783020Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5783153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5783157Z 2025-05-07T20:31:45.5783363Z self = 2025-05-07T20:31:45.5784148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5784651Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8e4ab60>} 2025-05-07T20:31:45.5785403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5785599Z context = 2025-05-07T20:31:45.5785604Z 2025-05-07T20:31:45.5785773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5786039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5786150Z module_map=module_map) 2025-05-07T20:31:45.5786313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5786420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5786500Z E ^ 2025-05-07T20:31:45.5786856Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5786860Z 2025-05-07T20:31:45.5787281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5787286Z 2025-05-07T20:31:45.5787389Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5787617Z self=, 2025-05-07T20:31:45.5787697Z T=4096, 2025-05-07T20:31:45.5787773Z D=5120, 2025-05-07T20:31:45.5787862Z scale_ub=None, 2025-05-07T20:31:45.5788032Z contiguous=False, 2025-05-07T20:31:45.5788119Z compiled=True, 2025-05-07T20:31:45.5788194Z ) 2025-05-07T20:31:45.5788412Z self = 2025-05-07T20:31:45.5788588Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5788593Z 2025-05-07T20:31:45.5788669Z @given( 2025-05-07T20:31:45.5788789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5788896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5789011Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5789127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5789245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5789322Z ) 2025-05-07T20:31:45.5789569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5789667Z def test_silu_mul_quant( 2025-05-07T20:31:45.5789754Z self, 2025-05-07T20:31:45.5789840Z T: int, 2025-05-07T20:31:45.5789920Z D: int, 2025-05-07T20:31:45.5790018Z scale_ub: Optional[float], 2025-05-07T20:31:45.5790110Z contiguous: bool, 2025-05-07T20:31:45.5790199Z compiled: bool, 2025-05-07T20:31:45.5790280Z ) -> None: 2025-05-07T20:31:45.5790379Z torch.manual_seed(2025) 2025-05-07T20:31:45.5790456Z 2025-05-07T20:31:45.5790625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5790786Z 2025-05-07T20:31:45.5790881Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5791009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5791101Z x = x_sign * x_clamp 2025-05-07T20:31:45.5791182Z x0 = x[:, :D] 2025-05-07T20:31:45.5791267Z x1 = x[:, D:] 2025-05-07T20:31:45.5791352Z 2025-05-07T20:31:45.5791450Z if contiguous: 2025-05-07T20:31:45.5791564Z x0 = x0.contiguous() 2025-05-07T20:31:45.5791670Z x1 = x1.contiguous() 2025-05-07T20:31:45.5791741Z 2025-05-07T20:31:45.5791836Z if scale_ub is not None: 2025-05-07T20:31:45.5791944Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5792081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5792165Z ) 2025-05-07T20:31:45.5792241Z else: 2025-05-07T20:31:45.5792339Z scale_ub_tensor = None 2025-05-07T20:31:45.5792416Z 2025-05-07T20:31:45.5792551Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5792647Z op = silu_mul_quant 2025-05-07T20:31:45.5792733Z if compiled: 2025-05-07T20:31:45.5792835Z op = torch.compile(op) 2025-05-07T20:31:45.5792947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5793018Z 2025-05-07T20:31:45.5793109Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5793114Z 2025-05-07T20:31:45.5793214Z moe/activation_test.py:117: 2025-05-07T20:31:45.5793349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5793452Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5793552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5793922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5794019Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5794520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5794620Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5794982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5795203Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5795548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5795807Z kernel = self.compile( 2025-05-07T20:31:45.5796187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5796370Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5796497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5796502Z 2025-05-07T20:31:45.5796713Z self = 2025-05-07T20:31:45.5797491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5797993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c8e4bd80>} 2025-05-07T20:31:45.5798746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5798942Z context = 2025-05-07T20:31:45.5798946Z 2025-05-07T20:31:45.5799113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5799452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5799562Z module_map=module_map) 2025-05-07T20:31:45.5799727Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5799826Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5804959Z E ^ 2025-05-07T20:31:45.5805345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5805366Z 2025-05-07T20:31:45.5806057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5806066Z 2025-05-07T20:31:45.5806182Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5806414Z self=, 2025-05-07T20:31:45.5806496Z T=4096, 2025-05-07T20:31:45.5806576Z D=5120, 2025-05-07T20:31:45.5806677Z scale_ub=1200.0, 2025-05-07T20:31:45.5806768Z contiguous=False, 2025-05-07T20:31:45.5806856Z compiled=False, 2025-05-07T20:31:45.5806940Z ) 2025-05-07T20:31:45.5807163Z self = 2025-05-07T20:31:45.5807345Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.5807350Z 2025-05-07T20:31:45.5807430Z @given( 2025-05-07T20:31:45.5807612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5807725Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5807843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5807966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5808086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5808164Z ) 2025-05-07T20:31:45.5808413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5808512Z def test_silu_mul_quant( 2025-05-07T20:31:45.5808594Z self, 2025-05-07T20:31:45.5808675Z T: int, 2025-05-07T20:31:45.5808754Z D: int, 2025-05-07T20:31:45.5808855Z scale_ub: Optional[float], 2025-05-07T20:31:45.5808951Z contiguous: bool, 2025-05-07T20:31:45.5809040Z compiled: bool, 2025-05-07T20:31:45.5809124Z ) -> None: 2025-05-07T20:31:45.5809225Z torch.manual_seed(2025) 2025-05-07T20:31:45.5809305Z 2025-05-07T20:31:45.5809480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5809735Z 2025-05-07T20:31:45.5809832Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5809960Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5810055Z x = x_sign * x_clamp 2025-05-07T20:31:45.5810139Z x0 = x[:, :D] 2025-05-07T20:31:45.5810226Z x1 = x[:, D:] 2025-05-07T20:31:45.5810304Z 2025-05-07T20:31:45.5810389Z if contiguous: 2025-05-07T20:31:45.5810491Z x0 = x0.contiguous() 2025-05-07T20:31:45.5810583Z x1 = x1.contiguous() 2025-05-07T20:31:45.5810659Z 2025-05-07T20:31:45.5810760Z if scale_ub is not None: 2025-05-07T20:31:45.5810867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5811007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5811089Z ) 2025-05-07T20:31:45.5811167Z else: 2025-05-07T20:31:45.5811264Z scale_ub_tensor = None 2025-05-07T20:31:45.5811347Z 2025-05-07T20:31:45.5811481Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5811574Z op = silu_mul_quant 2025-05-07T20:31:45.5811665Z if compiled: 2025-05-07T20:31:45.5811769Z op = torch.compile(op) 2025-05-07T20:31:45.5811882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5811956Z 2025-05-07T20:31:45.5812048Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5812052Z 2025-05-07T20:31:45.5812265Z moe/activation_test.py:117: 2025-05-07T20:31:45.5812403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5812506Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5812612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5813120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5813223Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5813594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5813819Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5814165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5814263Z kernel = self.compile( 2025-05-07T20:31:45.5814655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5814837Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5814968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5814972Z 2025-05-07T20:31:45.5815183Z self = 2025-05-07T20:31:45.5815964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5816474Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbe28c20>} 2025-05-07T20:31:45.5817234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5817431Z context = 2025-05-07T20:31:45.5817436Z 2025-05-07T20:31:45.5817613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5817879Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5817991Z module_map=module_map) 2025-05-07T20:31:45.5818240Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5818343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5818428Z E ^ 2025-05-07T20:31:45.5818785Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5818790Z 2025-05-07T20:31:45.5819211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5819216Z 2025-05-07T20:31:45.5819326Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5819551Z self=, 2025-05-07T20:31:45.5819635Z T=4096, 2025-05-07T20:31:45.5819716Z D=5120, 2025-05-07T20:31:45.5819804Z scale_ub=1200.0, 2025-05-07T20:31:45.5819897Z contiguous=False, 2025-05-07T20:31:45.5819983Z compiled=True, 2025-05-07T20:31:45.5820066Z ) 2025-05-07T20:31:45.5820290Z self = 2025-05-07T20:31:45.5820465Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.5820470Z 2025-05-07T20:31:45.5820549Z @given( 2025-05-07T20:31:45.5820677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5820781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5820902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5821124Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5821242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5821324Z ) 2025-05-07T20:31:45.5821573Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5821667Z def test_silu_mul_quant( 2025-05-07T20:31:45.5821748Z self, 2025-05-07T20:31:45.5821827Z T: int, 2025-05-07T20:31:45.5821907Z D: int, 2025-05-07T20:31:45.5822020Z scale_ub: Optional[float], 2025-05-07T20:31:45.5822112Z contiguous: bool, 2025-05-07T20:31:45.5822199Z compiled: bool, 2025-05-07T20:31:45.5822282Z ) -> None: 2025-05-07T20:31:45.5822378Z torch.manual_seed(2025) 2025-05-07T20:31:45.5822459Z 2025-05-07T20:31:45.5822633Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5822708Z 2025-05-07T20:31:45.5822805Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5822940Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5823033Z x = x_sign * x_clamp 2025-05-07T20:31:45.5823120Z x0 = x[:, :D] 2025-05-07T20:31:45.5823204Z x1 = x[:, D:] 2025-05-07T20:31:45.5823278Z 2025-05-07T20:31:45.5823367Z if contiguous: 2025-05-07T20:31:45.5823459Z x0 = x0.contiguous() 2025-05-07T20:31:45.5823550Z x1 = x1.contiguous() 2025-05-07T20:31:45.5823630Z 2025-05-07T20:31:45.5823725Z if scale_ub is not None: 2025-05-07T20:31:45.5823841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5823977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5824054Z ) 2025-05-07T20:31:45.5824135Z else: 2025-05-07T20:31:45.5824231Z scale_ub_tensor = None 2025-05-07T20:31:45.5824309Z 2025-05-07T20:31:45.5824449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5824541Z op = silu_mul_quant 2025-05-07T20:31:45.5824636Z if compiled: 2025-05-07T20:31:45.5824745Z op = torch.compile(op) 2025-05-07T20:31:45.5824853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5824929Z 2025-05-07T20:31:45.5825029Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5825033Z 2025-05-07T20:31:45.5825131Z moe/activation_test.py:117: 2025-05-07T20:31:45.5825267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5825455Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5825558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5825932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5826028Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5826523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5826632Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5826993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5827225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5827568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5827667Z kernel = self.compile( 2025-05-07T20:31:45.5828065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5828243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5828374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5828382Z 2025-05-07T20:31:45.5828591Z self = 2025-05-07T20:31:45.5829460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5829975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbe29f80>} 2025-05-07T20:31:45.5830726Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5830929Z context = 2025-05-07T20:31:45.5830934Z 2025-05-07T20:31:45.5831102Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5831369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5831487Z module_map=module_map) 2025-05-07T20:31:45.5831655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5831764Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5831845Z E ^ 2025-05-07T20:31:45.5832202Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5832206Z 2025-05-07T20:31:45.5832628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5832638Z 2025-05-07T20:31:45.5832745Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5832970Z self=, 2025-05-07T20:31:45.5833057Z T=2048, 2025-05-07T20:31:45.5833138Z D=7168, 2025-05-07T20:31:45.5833231Z scale_ub=1200.0, 2025-05-07T20:31:45.5833320Z contiguous=False, 2025-05-07T20:31:45.5833406Z compiled=False, 2025-05-07T20:31:45.5833493Z ) 2025-05-07T20:31:45.5833714Z self = 2025-05-07T20:31:45.5833891Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.5833895Z 2025-05-07T20:31:45.5833978Z @given( 2025-05-07T20:31:45.5834099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5834203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5834320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5834520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5834640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5834716Z ) 2025-05-07T20:31:45.5834962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5835061Z def test_silu_mul_quant( 2025-05-07T20:31:45.5835139Z self, 2025-05-07T20:31:45.5835220Z T: int, 2025-05-07T20:31:45.5835304Z D: int, 2025-05-07T20:31:45.5835405Z scale_ub: Optional[float], 2025-05-07T20:31:45.5835498Z contiguous: bool, 2025-05-07T20:31:45.5835585Z compiled: bool, 2025-05-07T20:31:45.5835665Z ) -> None: 2025-05-07T20:31:45.5835765Z torch.manual_seed(2025) 2025-05-07T20:31:45.5835841Z 2025-05-07T20:31:45.5836015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5836093Z 2025-05-07T20:31:45.5836185Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5836318Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5836413Z x = x_sign * x_clamp 2025-05-07T20:31:45.5836498Z x0 = x[:, :D] 2025-05-07T20:31:45.5836584Z x1 = x[:, D:] 2025-05-07T20:31:45.5836660Z 2025-05-07T20:31:45.5836745Z if contiguous: 2025-05-07T20:31:45.5836842Z x0 = x0.contiguous() 2025-05-07T20:31:45.5836934Z x1 = x1.contiguous() 2025-05-07T20:31:45.5837008Z 2025-05-07T20:31:45.5837186Z if scale_ub is not None: 2025-05-07T20:31:45.5837294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5837432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5837514Z ) 2025-05-07T20:31:45.5837594Z else: 2025-05-07T20:31:45.5837689Z scale_ub_tensor = None 2025-05-07T20:31:45.5837767Z 2025-05-07T20:31:45.5837898Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5837995Z op = silu_mul_quant 2025-05-07T20:31:45.5838086Z if compiled: 2025-05-07T20:31:45.5838187Z op = torch.compile(op) 2025-05-07T20:31:45.5838297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5838371Z 2025-05-07T20:31:45.5838463Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5838467Z 2025-05-07T20:31:45.5838569Z moe/activation_test.py:117: 2025-05-07T20:31:45.5838701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5838809Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5838913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5839414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5839518Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5839878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5840106Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5840453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5840547Z kernel = self.compile( 2025-05-07T20:31:45.5840935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5841118Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5841246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5841251Z 2025-05-07T20:31:45.5841483Z self = 2025-05-07T20:31:45.5842287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5842873Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbe2ad40>} 2025-05-07T20:31:45.5843625Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5843824Z context = 2025-05-07T20:31:45.5843829Z 2025-05-07T20:31:45.5843997Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5844263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5844374Z module_map=module_map) 2025-05-07T20:31:45.5844540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5844645Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5844729Z E ^ 2025-05-07T20:31:45.5845085Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5845089Z 2025-05-07T20:31:45.5845505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5845510Z 2025-05-07T20:31:45.5845619Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5845918Z self=, 2025-05-07T20:31:45.5846005Z T=1, 2025-05-07T20:31:45.5846085Z D=7168, 2025-05-07T20:31:45.5846171Z scale_ub=None, 2025-05-07T20:31:45.5846262Z contiguous=True, 2025-05-07T20:31:45.5846349Z compiled=False, 2025-05-07T20:31:45.5846427Z ) 2025-05-07T20:31:45.5846650Z self = 2025-05-07T20:31:45.5846824Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.5846828Z 2025-05-07T20:31:45.5846908Z @given( 2025-05-07T20:31:45.5847035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5847139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5847260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5847381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5847552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5847642Z ) 2025-05-07T20:31:45.5847891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5847989Z def test_silu_mul_quant( 2025-05-07T20:31:45.5848074Z self, 2025-05-07T20:31:45.5848154Z T: int, 2025-05-07T20:31:45.5848235Z D: int, 2025-05-07T20:31:45.5848339Z scale_ub: Optional[float], 2025-05-07T20:31:45.5848433Z contiguous: bool, 2025-05-07T20:31:45.5848528Z compiled: bool, 2025-05-07T20:31:45.5848611Z ) -> None: 2025-05-07T20:31:45.5848711Z torch.manual_seed(2025) 2025-05-07T20:31:45.5848793Z 2025-05-07T20:31:45.5848964Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5849040Z 2025-05-07T20:31:45.5849140Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5849268Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5849360Z x = x_sign * x_clamp 2025-05-07T20:31:45.5849453Z x0 = x[:, :D] 2025-05-07T20:31:45.5849537Z x1 = x[:, D:] 2025-05-07T20:31:45.5849613Z 2025-05-07T20:31:45.5849703Z if contiguous: 2025-05-07T20:31:45.5849797Z x0 = x0.contiguous() 2025-05-07T20:31:45.5849889Z x1 = x1.contiguous() 2025-05-07T20:31:45.5849968Z 2025-05-07T20:31:45.5850060Z if scale_ub is not None: 2025-05-07T20:31:45.5850169Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5850311Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5850502Z ) 2025-05-07T20:31:45.5850583Z else: 2025-05-07T20:31:45.5850682Z scale_ub_tensor = None 2025-05-07T20:31:45.5850757Z 2025-05-07T20:31:45.5850893Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5850985Z op = silu_mul_quant 2025-05-07T20:31:45.5851074Z if compiled: 2025-05-07T20:31:45.5851182Z op = torch.compile(op) 2025-05-07T20:31:45.5851296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5851374Z 2025-05-07T20:31:45.5851468Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5851472Z 2025-05-07T20:31:45.5851571Z moe/activation_test.py:117: 2025-05-07T20:31:45.5851707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5851810Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5851911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5852419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5852519Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5852877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5853103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5853521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5853626Z kernel = self.compile( 2025-05-07T20:31:45.5854012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5854188Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5854320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5854332Z 2025-05-07T20:31:45.5854538Z self = 2025-05-07T20:31:45.5855317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5855824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbe2afc0>} 2025-05-07T20:31:45.5856573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5856768Z context = 2025-05-07T20:31:45.5856773Z 2025-05-07T20:31:45.5856940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5857217Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5857326Z module_map=module_map) 2025-05-07T20:31:45.5857489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5857596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5857679Z E ^ 2025-05-07T20:31:45.5858041Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5858049Z 2025-05-07T20:31:45.5858463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5858468Z 2025-05-07T20:31:45.5858574Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5858801Z self=, 2025-05-07T20:31:45.5858883Z T=16384, 2025-05-07T20:31:45.5859045Z D=7168, 2025-05-07T20:31:45.5859136Z scale_ub=1200.0, 2025-05-07T20:31:45.5859226Z contiguous=False, 2025-05-07T20:31:45.5859316Z compiled=True, 2025-05-07T20:31:45.5859396Z ) 2025-05-07T20:31:45.5859618Z self = 2025-05-07T20:31:45.5859802Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.5859807Z 2025-05-07T20:31:45.5859887Z @given( 2025-05-07T20:31:45.5860014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5860120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5860239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5860358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5860477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5860555Z ) 2025-05-07T20:31:45.5860809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5860915Z def test_silu_mul_quant( 2025-05-07T20:31:45.5860994Z self, 2025-05-07T20:31:45.5861079Z T: int, 2025-05-07T20:31:45.5861159Z D: int, 2025-05-07T20:31:45.5861260Z scale_ub: Optional[float], 2025-05-07T20:31:45.5861358Z contiguous: bool, 2025-05-07T20:31:45.5861449Z compiled: bool, 2025-05-07T20:31:45.5861537Z ) -> None: 2025-05-07T20:31:45.5861635Z torch.manual_seed(2025) 2025-05-07T20:31:45.5861716Z 2025-05-07T20:31:45.5861970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5862050Z 2025-05-07T20:31:45.5862144Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5862277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5862369Z x = x_sign * x_clamp 2025-05-07T20:31:45.5862453Z x0 = x[:, :D] 2025-05-07T20:31:45.5862539Z x1 = x[:, D:] 2025-05-07T20:31:45.5862616Z 2025-05-07T20:31:45.5862707Z if contiguous: 2025-05-07T20:31:45.5862805Z x0 = x0.contiguous() 2025-05-07T20:31:45.5862897Z x1 = x1.contiguous() 2025-05-07T20:31:45.5862978Z 2025-05-07T20:31:45.5863072Z if scale_ub is not None: 2025-05-07T20:31:45.5863180Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5863322Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5863401Z ) 2025-05-07T20:31:45.5863481Z else: 2025-05-07T20:31:45.5863589Z scale_ub_tensor = None 2025-05-07T20:31:45.5863666Z 2025-05-07T20:31:45.5863797Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5863896Z op = silu_mul_quant 2025-05-07T20:31:45.5863985Z if compiled: 2025-05-07T20:31:45.5864085Z op = torch.compile(op) 2025-05-07T20:31:45.5864196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5864274Z 2025-05-07T20:31:45.5864373Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5864382Z 2025-05-07T20:31:45.5864482Z moe/activation_test.py:117: 2025-05-07T20:31:45.5864615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5864721Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5864825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5865194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5865294Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5865794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5865896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5866255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5866479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5867573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5867670Z kernel = self.compile( 2025-05-07T20:31:45.5868054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5868236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5868365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5868374Z 2025-05-07T20:31:45.5868584Z self = 2025-05-07T20:31:45.5869364Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5869871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbdf5300>} 2025-05-07T20:31:45.5870627Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5870820Z context = 2025-05-07T20:31:45.5870825Z 2025-05-07T20:31:45.5871067Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5871335Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5871447Z module_map=module_map) 2025-05-07T20:31:45.5871610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5871711Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5871792Z E ^ 2025-05-07T20:31:45.5872154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5872158Z 2025-05-07T20:31:45.5872574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5872583Z 2025-05-07T20:31:45.5872688Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5872911Z self=, 2025-05-07T20:31:45.5873000Z T=1, 2025-05-07T20:31:45.5873080Z D=7168, 2025-05-07T20:31:45.5873166Z scale_ub=None, 2025-05-07T20:31:45.5873261Z contiguous=False, 2025-05-07T20:31:45.5873348Z compiled=False, 2025-05-07T20:31:45.5873426Z ) 2025-05-07T20:31:45.5873647Z self = 2025-05-07T20:31:45.5873816Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.5873820Z 2025-05-07T20:31:45.5873905Z @given( 2025-05-07T20:31:45.5874029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5874134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5874256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5874374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5874489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5874569Z ) 2025-05-07T20:31:45.5874820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5874917Z def test_silu_mul_quant( 2025-05-07T20:31:45.5875006Z self, 2025-05-07T20:31:45.5875085Z T: int, 2025-05-07T20:31:45.5875166Z D: int, 2025-05-07T20:31:45.5875270Z scale_ub: Optional[float], 2025-05-07T20:31:45.5875362Z contiguous: bool, 2025-05-07T20:31:45.5875450Z compiled: bool, 2025-05-07T20:31:45.5875537Z ) -> None: 2025-05-07T20:31:45.5875635Z torch.manual_seed(2025) 2025-05-07T20:31:45.5875800Z 2025-05-07T20:31:45.5875973Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5876048Z 2025-05-07T20:31:45.5876147Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5876273Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5876364Z x = x_sign * x_clamp 2025-05-07T20:31:45.5876450Z x0 = x[:, :D] 2025-05-07T20:31:45.5876534Z x1 = x[:, D:] 2025-05-07T20:31:45.5876610Z 2025-05-07T20:31:45.5876709Z if contiguous: 2025-05-07T20:31:45.5876804Z x0 = x0.contiguous() 2025-05-07T20:31:45.5876896Z x1 = x1.contiguous() 2025-05-07T20:31:45.5876976Z 2025-05-07T20:31:45.5877069Z if scale_ub is not None: 2025-05-07T20:31:45.5877182Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5877319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5877397Z ) 2025-05-07T20:31:45.5877486Z else: 2025-05-07T20:31:45.5877592Z scale_ub_tensor = None 2025-05-07T20:31:45.5877668Z 2025-05-07T20:31:45.5877803Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5877896Z op = silu_mul_quant 2025-05-07T20:31:45.5877982Z if compiled: 2025-05-07T20:31:45.5878087Z op = torch.compile(op) 2025-05-07T20:31:45.5878194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5878271Z 2025-05-07T20:31:45.5878364Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5878449Z 2025-05-07T20:31:45.5878550Z moe/activation_test.py:117: 2025-05-07T20:31:45.5878685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5878786Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5878887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5879392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5879496Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5879855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5880083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5880425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5880525Z kernel = self.compile( 2025-05-07T20:31:45.5880913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5881090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5881223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5881228Z 2025-05-07T20:31:45.5881433Z self = 2025-05-07T20:31:45.5882269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5882775Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbdf60c0>} 2025-05-07T20:31:45.5883531Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5883727Z context = 2025-05-07T20:31:45.5883732Z 2025-05-07T20:31:45.5883898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5884169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5884383Z module_map=module_map) 2025-05-07T20:31:45.5884547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5884652Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5884733Z E ^ 2025-05-07T20:31:45.5885091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5885096Z 2025-05-07T20:31:45.5885517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5885521Z 2025-05-07T20:31:45.5885627Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5885853Z self=, 2025-05-07T20:31:45.5885936Z T=2048, 2025-05-07T20:31:45.5886017Z D=7168, 2025-05-07T20:31:45.5886105Z scale_ub=None, 2025-05-07T20:31:45.5886194Z contiguous=False, 2025-05-07T20:31:45.5886289Z compiled=True, 2025-05-07T20:31:45.5886367Z ) 2025-05-07T20:31:45.5886586Z self = 2025-05-07T20:31:45.5886765Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5886770Z 2025-05-07T20:31:45.5886851Z @given( 2025-05-07T20:31:45.5886975Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5887081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5887274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5887395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5887582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5887661Z ) 2025-05-07T20:31:45.5887914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5888012Z def test_silu_mul_quant( 2025-05-07T20:31:45.5888092Z self, 2025-05-07T20:31:45.5888181Z T: int, 2025-05-07T20:31:45.5888260Z D: int, 2025-05-07T20:31:45.5888361Z scale_ub: Optional[float], 2025-05-07T20:31:45.5888456Z contiguous: bool, 2025-05-07T20:31:45.5888544Z compiled: bool, 2025-05-07T20:31:45.5888625Z ) -> None: 2025-05-07T20:31:45.5888727Z torch.manual_seed(2025) 2025-05-07T20:31:45.5888805Z 2025-05-07T20:31:45.5888976Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5889055Z 2025-05-07T20:31:45.5889154Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5889287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5889380Z x = x_sign * x_clamp 2025-05-07T20:31:45.5889463Z x0 = x[:, :D] 2025-05-07T20:31:45.5889551Z x1 = x[:, D:] 2025-05-07T20:31:45.5889627Z 2025-05-07T20:31:45.5889714Z if contiguous: 2025-05-07T20:31:45.5889811Z x0 = x0.contiguous() 2025-05-07T20:31:45.5889903Z x1 = x1.contiguous() 2025-05-07T20:31:45.5889983Z 2025-05-07T20:31:45.5890082Z if scale_ub is not None: 2025-05-07T20:31:45.5890190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5890328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5890413Z ) 2025-05-07T20:31:45.5890492Z else: 2025-05-07T20:31:45.5890588Z scale_ub_tensor = None 2025-05-07T20:31:45.5890670Z 2025-05-07T20:31:45.5890800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5890903Z op = silu_mul_quant 2025-05-07T20:31:45.5890991Z if compiled: 2025-05-07T20:31:45.5891092Z op = torch.compile(op) 2025-05-07T20:31:45.5891204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5891280Z 2025-05-07T20:31:45.5891372Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5891376Z 2025-05-07T20:31:45.5891483Z moe/activation_test.py:117: 2025-05-07T20:31:45.5891615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5891832Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5891957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5892327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5892424Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5892925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5893025Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5893389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5893616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5893964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5894065Z kernel = self.compile( 2025-05-07T20:31:45.5894447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5894626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5894757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5894762Z 2025-05-07T20:31:45.5894970Z self = 2025-05-07T20:31:45.5895828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5896335Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbdf7560>} 2025-05-07T20:31:45.5897096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5897287Z context = 2025-05-07T20:31:45.5897292Z 2025-05-07T20:31:45.5897461Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5897729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5897840Z module_map=module_map) 2025-05-07T20:31:45.5898007Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5898109Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5898190Z E ^ 2025-05-07T20:31:45.5898548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5898558Z 2025-05-07T20:31:45.5898972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5898977Z 2025-05-07T20:31:45.5899087Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5899310Z self=, 2025-05-07T20:31:45.5899392Z T=4096, 2025-05-07T20:31:45.5899473Z D=7168, 2025-05-07T20:31:45.5899559Z scale_ub=None, 2025-05-07T20:31:45.5899655Z contiguous=False, 2025-05-07T20:31:45.5899744Z compiled=True, 2025-05-07T20:31:45.5899821Z ) 2025-05-07T20:31:45.5900043Z self = 2025-05-07T20:31:45.5900217Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5900221Z 2025-05-07T20:31:45.5900301Z @given( 2025-05-07T20:31:45.5900431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5900534Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5900737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5900858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5900973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5901051Z ) 2025-05-07T20:31:45.5901301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5901398Z def test_silu_mul_quant( 2025-05-07T20:31:45.5901479Z self, 2025-05-07T20:31:45.5901563Z T: int, 2025-05-07T20:31:45.5901645Z D: int, 2025-05-07T20:31:45.5901751Z scale_ub: Optional[float], 2025-05-07T20:31:45.5901844Z contiguous: bool, 2025-05-07T20:31:45.5901934Z compiled: bool, 2025-05-07T20:31:45.5902019Z ) -> None: 2025-05-07T20:31:45.5902117Z torch.manual_seed(2025) 2025-05-07T20:31:45.5902194Z 2025-05-07T20:31:45.5902366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5902447Z 2025-05-07T20:31:45.5902542Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5902673Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5902765Z x = x_sign * x_clamp 2025-05-07T20:31:45.5902851Z x0 = x[:, :D] 2025-05-07T20:31:45.5902936Z x1 = x[:, D:] 2025-05-07T20:31:45.5903012Z 2025-05-07T20:31:45.5903102Z if contiguous: 2025-05-07T20:31:45.5903197Z x0 = x0.contiguous() 2025-05-07T20:31:45.5903374Z x1 = x1.contiguous() 2025-05-07T20:31:45.5903455Z 2025-05-07T20:31:45.5903549Z if scale_ub is not None: 2025-05-07T20:31:45.5903660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5903802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5903883Z ) 2025-05-07T20:31:45.5903963Z else: 2025-05-07T20:31:45.5904062Z scale_ub_tensor = None 2025-05-07T20:31:45.5904138Z 2025-05-07T20:31:45.5904274Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5904371Z op = silu_mul_quant 2025-05-07T20:31:45.5904460Z if compiled: 2025-05-07T20:31:45.5904567Z op = torch.compile(op) 2025-05-07T20:31:45.5904674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5904749Z 2025-05-07T20:31:45.5904845Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5904849Z 2025-05-07T20:31:45.5904951Z moe/activation_test.py:117: 2025-05-07T20:31:45.5905087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5905193Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5905296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5905881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5906025Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5906537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5906643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5906999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5907224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5907565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5907664Z kernel = self.compile( 2025-05-07T20:31:45.5908048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5908221Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5908350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5908354Z 2025-05-07T20:31:45.5908564Z self = 2025-05-07T20:31:45.5909479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5909983Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c81d07c0>} 2025-05-07T20:31:45.5910730Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5910922Z context = 2025-05-07T20:31:45.5910933Z 2025-05-07T20:31:45.5911097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5911367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5911478Z module_map=module_map) 2025-05-07T20:31:45.5911641Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5911739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5911822Z E ^ 2025-05-07T20:31:45.5912175Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5912310Z 2025-05-07T20:31:45.5912727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5912732Z 2025-05-07T20:31:45.5912834Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5913053Z self=, 2025-05-07T20:31:45.5913133Z T=16384, 2025-05-07T20:31:45.5913214Z D=5120, 2025-05-07T20:31:45.5913308Z scale_ub=1200.0, 2025-05-07T20:31:45.5913395Z contiguous=False, 2025-05-07T20:31:45.5913480Z compiled=False, 2025-05-07T20:31:45.5913556Z ) 2025-05-07T20:31:45.5913779Z self = 2025-05-07T20:31:45.5913958Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.5913962Z 2025-05-07T20:31:45.5914042Z @given( 2025-05-07T20:31:45.5914160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5914261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5914384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5914498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5914608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5914690Z ) 2025-05-07T20:31:45.5914933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5915026Z def test_silu_mul_quant( 2025-05-07T20:31:45.5915109Z self, 2025-05-07T20:31:45.5915186Z T: int, 2025-05-07T20:31:45.5915266Z D: int, 2025-05-07T20:31:45.5915362Z scale_ub: Optional[float], 2025-05-07T20:31:45.5915451Z contiguous: bool, 2025-05-07T20:31:45.5915542Z compiled: bool, 2025-05-07T20:31:45.5915621Z ) -> None: 2025-05-07T20:31:45.5915714Z torch.manual_seed(2025) 2025-05-07T20:31:45.5915791Z 2025-05-07T20:31:45.5915964Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5916040Z 2025-05-07T20:31:45.5916134Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5916258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5916348Z x = x_sign * x_clamp 2025-05-07T20:31:45.5916432Z x0 = x[:, :D] 2025-05-07T20:31:45.5916514Z x1 = x[:, D:] 2025-05-07T20:31:45.5916591Z 2025-05-07T20:31:45.5916675Z if contiguous: 2025-05-07T20:31:45.5916765Z x0 = x0.contiguous() 2025-05-07T20:31:45.5916943Z x1 = x1.contiguous() 2025-05-07T20:31:45.5917018Z 2025-05-07T20:31:45.5917107Z if scale_ub is not None: 2025-05-07T20:31:45.5917215Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5917349Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5917422Z ) 2025-05-07T20:31:45.5917503Z else: 2025-05-07T20:31:45.5917598Z scale_ub_tensor = None 2025-05-07T20:31:45.5917673Z 2025-05-07T20:31:45.5917813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5917904Z op = silu_mul_quant 2025-05-07T20:31:45.5917987Z if compiled: 2025-05-07T20:31:45.5918092Z op = torch.compile(op) 2025-05-07T20:31:45.5918196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5918270Z 2025-05-07T20:31:45.5918362Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5918366Z 2025-05-07T20:31:45.5918464Z moe/activation_test.py:117: 2025-05-07T20:31:45.5918604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5918706Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5918803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5919304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5919402Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5919840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5920063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5920401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5920501Z kernel = self.compile( 2025-05-07T20:31:45.5920878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5921056Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5921186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5921190Z 2025-05-07T20:31:45.5921392Z self = 2025-05-07T20:31:45.5922170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5922669Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c81d1620>} 2025-05-07T20:31:45.5923416Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5923615Z context = 2025-05-07T20:31:45.5923619Z 2025-05-07T20:31:45.5923784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5924050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5924156Z module_map=module_map) 2025-05-07T20:31:45.5924322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5924420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5924496Z E ^ 2025-05-07T20:31:45.5924850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5924854Z 2025-05-07T20:31:45.5925266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5925351Z 2025-05-07T20:31:45.5925455Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5925682Z self=, 2025-05-07T20:31:45.5925760Z T=16384, 2025-05-07T20:31:45.5925839Z D=5120, 2025-05-07T20:31:45.5925924Z scale_ub=1200.0, 2025-05-07T20:31:45.5926010Z contiguous=True, 2025-05-07T20:31:45.5926095Z compiled=True, 2025-05-07T20:31:45.5926168Z ) 2025-05-07T20:31:45.5926387Z self = 2025-05-07T20:31:45.5926561Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.5926566Z 2025-05-07T20:31:45.5926646Z @given( 2025-05-07T20:31:45.5926765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5926880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5926998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5932210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5932355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5932436Z ) 2025-05-07T20:31:45.5932690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5932792Z def test_silu_mul_quant( 2025-05-07T20:31:45.5932874Z self, 2025-05-07T20:31:45.5932954Z T: int, 2025-05-07T20:31:45.5933039Z D: int, 2025-05-07T20:31:45.5933246Z scale_ub: Optional[float], 2025-05-07T20:31:45.5933339Z contiguous: bool, 2025-05-07T20:31:45.5933432Z compiled: bool, 2025-05-07T20:31:45.5933516Z ) -> None: 2025-05-07T20:31:45.5933612Z torch.manual_seed(2025) 2025-05-07T20:31:45.5933692Z 2025-05-07T20:31:45.5933870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5933955Z 2025-05-07T20:31:45.5934051Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5934186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5934282Z x = x_sign * x_clamp 2025-05-07T20:31:45.5934364Z x0 = x[:, :D] 2025-05-07T20:31:45.5934448Z x1 = x[:, D:] 2025-05-07T20:31:45.5934528Z 2025-05-07T20:31:45.5934614Z if contiguous: 2025-05-07T20:31:45.5934707Z x0 = x0.contiguous() 2025-05-07T20:31:45.5934803Z x1 = x1.contiguous() 2025-05-07T20:31:45.5934877Z 2025-05-07T20:31:45.5934971Z if scale_ub is not None: 2025-05-07T20:31:45.5935091Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5935233Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5935312Z ) 2025-05-07T20:31:45.5935397Z else: 2025-05-07T20:31:45.5935494Z scale_ub_tensor = None 2025-05-07T20:31:45.5935574Z 2025-05-07T20:31:45.5935710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5935805Z op = silu_mul_quant 2025-05-07T20:31:45.5935901Z if compiled: 2025-05-07T20:31:45.5936006Z op = torch.compile(op) 2025-05-07T20:31:45.5936115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5936194Z 2025-05-07T20:31:45.5936287Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5936293Z 2025-05-07T20:31:45.5936393Z moe/activation_test.py:117: 2025-05-07T20:31:45.5936529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5936638Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5936744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5937123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5937220Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5937723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5937822Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5938338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5938566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5938911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5939014Z kernel = self.compile( 2025-05-07T20:31:45.5939407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5939586Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5939720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5939725Z 2025-05-07T20:31:45.5939934Z self = 2025-05-07T20:31:45.5940720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5941233Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c81d2a20>} 2025-05-07T20:31:45.5942066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5942266Z context = 2025-05-07T20:31:45.5942270Z 2025-05-07T20:31:45.5942438Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5942708Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5942823Z module_map=module_map) 2025-05-07T20:31:45.5942991Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5943095Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5943176Z E ^ 2025-05-07T20:31:45.5943539Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5943544Z 2025-05-07T20:31:45.5943966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5943971Z 2025-05-07T20:31:45.5944076Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5944309Z self=, 2025-05-07T20:31:45.5944388Z T=16384, 2025-05-07T20:31:45.5944468Z D=5120, 2025-05-07T20:31:45.5944554Z scale_ub=None, 2025-05-07T20:31:45.5944644Z contiguous=False, 2025-05-07T20:31:45.5944735Z compiled=True, 2025-05-07T20:31:45.5944818Z ) 2025-05-07T20:31:45.5945040Z self = 2025-05-07T20:31:45.5945222Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5945226Z 2025-05-07T20:31:45.5945306Z @given( 2025-05-07T20:31:45.5945430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5945536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5945654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5945777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5945895Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5945973Z ) 2025-05-07T20:31:45.5946223Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5946320Z def test_silu_mul_quant( 2025-05-07T20:31:45.5946401Z self, 2025-05-07T20:31:45.5946485Z T: int, 2025-05-07T20:31:45.5946648Z D: int, 2025-05-07T20:31:45.5946749Z scale_ub: Optional[float], 2025-05-07T20:31:45.5946843Z contiguous: bool, 2025-05-07T20:31:45.5946931Z compiled: bool, 2025-05-07T20:31:45.5947010Z ) -> None: 2025-05-07T20:31:45.5947111Z torch.manual_seed(2025) 2025-05-07T20:31:45.5947189Z 2025-05-07T20:31:45.5947365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5947445Z 2025-05-07T20:31:45.5947539Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5947676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5947768Z x = x_sign * x_clamp 2025-05-07T20:31:45.5947851Z x0 = x[:, :D] 2025-05-07T20:31:45.5947936Z x1 = x[:, D:] 2025-05-07T20:31:45.5948011Z 2025-05-07T20:31:45.5948098Z if contiguous: 2025-05-07T20:31:45.5948200Z x0 = x0.contiguous() 2025-05-07T20:31:45.5948293Z x1 = x1.contiguous() 2025-05-07T20:31:45.5948367Z 2025-05-07T20:31:45.5948468Z if scale_ub is not None: 2025-05-07T20:31:45.5948577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5948714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5948801Z ) 2025-05-07T20:31:45.5948880Z else: 2025-05-07T20:31:45.5948976Z scale_ub_tensor = None 2025-05-07T20:31:45.5949057Z 2025-05-07T20:31:45.5949189Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5949285Z op = silu_mul_quant 2025-05-07T20:31:45.5949477Z if compiled: 2025-05-07T20:31:45.5949582Z op = torch.compile(op) 2025-05-07T20:31:45.5949695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5949770Z 2025-05-07T20:31:45.5949863Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5949867Z 2025-05-07T20:31:45.5949973Z moe/activation_test.py:117: 2025-05-07T20:31:45.5950108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5950219Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5950322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5950692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5950792Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5951288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5951395Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5951761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5951985Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5952335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5952433Z kernel = self.compile( 2025-05-07T20:31:45.5952821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5953002Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5953131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5953136Z 2025-05-07T20:31:45.5953341Z self = 2025-05-07T20:31:45.5954126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5954630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c81d3c40>} 2025-05-07T20:31:45.5955383Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5955664Z context = 2025-05-07T20:31:45.5955669Z 2025-05-07T20:31:45.5955836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5956101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5956214Z module_map=module_map) 2025-05-07T20:31:45.5956383Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5956484Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5956563Z E ^ 2025-05-07T20:31:45.5956927Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5956932Z 2025-05-07T20:31:45.5957346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5957356Z 2025-05-07T20:31:45.5957464Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5957686Z self=, 2025-05-07T20:31:45.5957766Z T=2048, 2025-05-07T20:31:45.5957853Z D=5120, 2025-05-07T20:31:45.5957937Z scale_ub=None, 2025-05-07T20:31:45.5958025Z contiguous=False, 2025-05-07T20:31:45.5958113Z compiled=True, 2025-05-07T20:31:45.5958265Z ) 2025-05-07T20:31:45.5958486Z self = 2025-05-07T20:31:45.5958665Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5958670Z 2025-05-07T20:31:45.5958748Z @given( 2025-05-07T20:31:45.5958877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5958978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5959100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5959222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5959337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5959414Z ) 2025-05-07T20:31:45.5959663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5959761Z def test_silu_mul_quant( 2025-05-07T20:31:45.5959847Z self, 2025-05-07T20:31:45.5959927Z T: int, 2025-05-07T20:31:45.5960010Z D: int, 2025-05-07T20:31:45.5960115Z scale_ub: Optional[float], 2025-05-07T20:31:45.5960205Z contiguous: bool, 2025-05-07T20:31:45.5960293Z compiled: bool, 2025-05-07T20:31:45.5960377Z ) -> None: 2025-05-07T20:31:45.5960474Z torch.manual_seed(2025) 2025-05-07T20:31:45.5960553Z 2025-05-07T20:31:45.5960721Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5960795Z 2025-05-07T20:31:45.5960893Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5961016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5961104Z x = x_sign * x_clamp 2025-05-07T20:31:45.5961185Z x0 = x[:, :D] 2025-05-07T20:31:45.5961265Z x1 = x[:, D:] 2025-05-07T20:31:45.5961336Z 2025-05-07T20:31:45.5961421Z if contiguous: 2025-05-07T20:31:45.5961512Z x0 = x0.contiguous() 2025-05-07T20:31:45.5961606Z x1 = x1.contiguous() 2025-05-07T20:31:45.5961685Z 2025-05-07T20:31:45.5961780Z if scale_ub is not None: 2025-05-07T20:31:45.5961884Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5962026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5962103Z ) 2025-05-07T20:31:45.5962185Z else: 2025-05-07T20:31:45.5962279Z scale_ub_tensor = None 2025-05-07T20:31:45.5962354Z 2025-05-07T20:31:45.5962489Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5962667Z op = silu_mul_quant 2025-05-07T20:31:45.5962750Z if compiled: 2025-05-07T20:31:45.5962852Z op = torch.compile(op) 2025-05-07T20:31:45.5962956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5963031Z 2025-05-07T20:31:45.5963126Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5963130Z 2025-05-07T20:31:45.5963226Z moe/activation_test.py:117: 2025-05-07T20:31:45.5963360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5963460Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5963558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5963931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5964027Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5964519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5964624Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5964982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5965206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5965546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5965640Z kernel = self.compile( 2025-05-07T20:31:45.5966098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5966274Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5966399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5966408Z 2025-05-07T20:31:45.5966612Z self = 2025-05-07T20:31:45.5967391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5967967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbbe47c0>} 2025-05-07T20:31:45.5968717Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5968913Z context = 2025-05-07T20:31:45.5968917Z 2025-05-07T20:31:45.5969079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5969341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5969454Z module_map=module_map) 2025-05-07T20:31:45.5969615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5969717Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5969795Z E ^ 2025-05-07T20:31:45.5970147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5970152Z 2025-05-07T20:31:45.5970574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5970578Z 2025-05-07T20:31:45.5970681Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5970900Z self=, 2025-05-07T20:31:45.5970984Z T=2048, 2025-05-07T20:31:45.5971060Z D=5120, 2025-05-07T20:31:45.5971165Z scale_ub=1200.0, 2025-05-07T20:31:45.5971255Z contiguous=False, 2025-05-07T20:31:45.5971446Z compiled=True, 2025-05-07T20:31:45.5971524Z ) 2025-05-07T20:31:45.5971741Z self = 2025-05-07T20:31:45.5971912Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.5971916Z 2025-05-07T20:31:45.5971996Z @given( 2025-05-07T20:31:45.5972114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5972212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5972333Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5972449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5972564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5972640Z ) 2025-05-07T20:31:45.5972884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5972981Z def test_silu_mul_quant( 2025-05-07T20:31:45.5973062Z self, 2025-05-07T20:31:45.5973145Z T: int, 2025-05-07T20:31:45.5973229Z D: int, 2025-05-07T20:31:45.5973327Z scale_ub: Optional[float], 2025-05-07T20:31:45.5973415Z contiguous: bool, 2025-05-07T20:31:45.5973505Z compiled: bool, 2025-05-07T20:31:45.5973584Z ) -> None: 2025-05-07T20:31:45.5973681Z torch.manual_seed(2025) 2025-05-07T20:31:45.5973762Z 2025-05-07T20:31:45.5973931Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5974013Z 2025-05-07T20:31:45.5974226Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5974353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5974444Z x = x_sign * x_clamp 2025-05-07T20:31:45.5974529Z x0 = x[:, :D] 2025-05-07T20:31:45.5974608Z x1 = x[:, D:] 2025-05-07T20:31:45.5974687Z 2025-05-07T20:31:45.5974769Z if contiguous: 2025-05-07T20:31:45.5974862Z x0 = x0.contiguous() 2025-05-07T20:31:45.5974953Z x1 = x1.contiguous() 2025-05-07T20:31:45.5975035Z 2025-05-07T20:31:45.5975125Z if scale_ub is not None: 2025-05-07T20:31:45.5975235Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5975367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5975450Z ) 2025-05-07T20:31:45.5975529Z else: 2025-05-07T20:31:45.5975624Z scale_ub_tensor = None 2025-05-07T20:31:45.5975702Z 2025-05-07T20:31:45.5975838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5975926Z op = silu_mul_quant 2025-05-07T20:31:45.5976013Z if compiled: 2025-05-07T20:31:45.5976112Z op = torch.compile(op) 2025-05-07T20:31:45.5976217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5976294Z 2025-05-07T20:31:45.5976384Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5976388Z 2025-05-07T20:31:45.5976486Z moe/activation_test.py:117: 2025-05-07T20:31:45.5976617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5976720Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5976819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5977185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5977278Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5977781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5977878Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5978234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5978461Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5978799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5979004Z kernel = self.compile( 2025-05-07T20:31:45.5979383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5979556Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5979685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5979690Z 2025-05-07T20:31:45.5979900Z self = 2025-05-07T20:31:45.5980672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5981206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbbe58a0>} 2025-05-07T20:31:45.5981968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5982160Z context = 2025-05-07T20:31:45.5982164Z 2025-05-07T20:31:45.5982328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5982669Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5982775Z module_map=module_map) 2025-05-07T20:31:45.5982937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5983035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5983112Z E ^ 2025-05-07T20:31:45.5983469Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5983478Z 2025-05-07T20:31:45.5983890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5983894Z 2025-05-07T20:31:45.5983996Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5984219Z self=, 2025-05-07T20:31:45.5984296Z T=4096, 2025-05-07T20:31:45.5984374Z D=5120, 2025-05-07T20:31:45.5984461Z scale_ub=1200.0, 2025-05-07T20:31:45.5984549Z contiguous=True, 2025-05-07T20:31:45.5984637Z compiled=True, 2025-05-07T20:31:45.5984714Z ) 2025-05-07T20:31:45.5984932Z self = 2025-05-07T20:31:45.5985105Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.5985110Z 2025-05-07T20:31:45.5985185Z @given( 2025-05-07T20:31:45.5985302Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5985413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5985528Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5985645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5985760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5985835Z ) 2025-05-07T20:31:45.5986081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5986175Z def test_silu_mul_quant( 2025-05-07T20:31:45.5986254Z self, 2025-05-07T20:31:45.5986339Z T: int, 2025-05-07T20:31:45.5986418Z D: int, 2025-05-07T20:31:45.5986515Z scale_ub: Optional[float], 2025-05-07T20:31:45.5986608Z contiguous: bool, 2025-05-07T20:31:45.5986693Z compiled: bool, 2025-05-07T20:31:45.5986771Z ) -> None: 2025-05-07T20:31:45.5986868Z torch.manual_seed(2025) 2025-05-07T20:31:45.5986944Z 2025-05-07T20:31:45.5987113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5987276Z 2025-05-07T20:31:45.5987368Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5987496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5987585Z x = x_sign * x_clamp 2025-05-07T20:31:45.5987665Z x0 = x[:, :D] 2025-05-07T20:31:45.5987750Z x1 = x[:, D:] 2025-05-07T20:31:45.5987823Z 2025-05-07T20:31:45.5987914Z if contiguous: 2025-05-07T20:31:45.5988005Z x0 = x0.contiguous() 2025-05-07T20:31:45.5988099Z x1 = x1.contiguous() 2025-05-07T20:31:45.5988179Z 2025-05-07T20:31:45.5988276Z if scale_ub is not None: 2025-05-07T20:31:45.5988380Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5988515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5988594Z ) 2025-05-07T20:31:45.5988672Z else: 2025-05-07T20:31:45.5988770Z scale_ub_tensor = None 2025-05-07T20:31:45.5988844Z 2025-05-07T20:31:45.5988977Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5989071Z op = silu_mul_quant 2025-05-07T20:31:45.5989158Z if compiled: 2025-05-07T20:31:45.5989262Z op = torch.compile(op) 2025-05-07T20:31:45.5989366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5989440Z 2025-05-07T20:31:45.5989532Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5989537Z 2025-05-07T20:31:45.5989635Z moe/activation_test.py:117: 2025-05-07T20:31:45.5989846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5989955Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5990057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5990425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5990523Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5991015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5991122Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5991478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5991699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5992053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5992146Z kernel = self.compile( 2025-05-07T20:31:45.5992529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5992704Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5992830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5992835Z 2025-05-07T20:31:45.5993050Z self = 2025-05-07T20:31:45.5993819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5994324Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bbbe6ac0>} 2025-05-07T20:31:45.5995066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5995256Z context = 2025-05-07T20:31:45.5995260Z 2025-05-07T20:31:45.5995428Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5995775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5995884Z module_map=module_map) 2025-05-07T20:31:45.5996045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5996142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5996222Z E ^ 2025-05-07T20:31:45.5996579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5996584Z 2025-05-07T20:31:45.5996995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5997002Z 2025-05-07T20:31:45.5997104Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5997325Z self=, 2025-05-07T20:31:45.5997407Z T=128, 2025-05-07T20:31:45.5997485Z D=5120, 2025-05-07T20:31:45.5997575Z scale_ub=1200.0, 2025-05-07T20:31:45.5997665Z contiguous=False, 2025-05-07T20:31:45.5997747Z compiled=True, 2025-05-07T20:31:45.5997823Z ) 2025-05-07T20:31:45.5998041Z self = 2025-05-07T20:31:45.5998211Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.5998216Z 2025-05-07T20:31:45.5998298Z @given( 2025-05-07T20:31:45.5998415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5998594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5998715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5998830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5998942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5999019Z ) 2025-05-07T20:31:45.5999265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5999360Z def test_silu_mul_quant( 2025-05-07T20:31:45.5999447Z self, 2025-05-07T20:31:45.5999526Z T: int, 2025-05-07T20:31:45.5999603Z D: int, 2025-05-07T20:31:45.5999704Z scale_ub: Optional[float], 2025-05-07T20:31:45.5999795Z contiguous: bool, 2025-05-07T20:31:45.5999882Z compiled: bool, 2025-05-07T20:31:45.5999958Z ) -> None: 2025-05-07T20:31:45.6000052Z torch.manual_seed(2025) 2025-05-07T20:31:45.6000129Z 2025-05-07T20:31:45.6000302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6000379Z 2025-05-07T20:31:45.6000475Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6000598Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6000686Z x = x_sign * x_clamp 2025-05-07T20:31:45.6000767Z x0 = x[:, :D] 2025-05-07T20:31:45.6000847Z x1 = x[:, D:] 2025-05-07T20:31:45.6000921Z 2025-05-07T20:31:45.6001006Z if contiguous: 2025-05-07T20:31:45.6001098Z x0 = x0.contiguous() 2025-05-07T20:31:45.6001197Z x1 = x1.contiguous() 2025-05-07T20:31:45.6001270Z 2025-05-07T20:31:45.6001361Z if scale_ub is not None: 2025-05-07T20:31:45.6001470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6001602Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6001676Z ) 2025-05-07T20:31:45.6001754Z else: 2025-05-07T20:31:45.6001872Z scale_ub_tensor = None 2025-05-07T20:31:45.6001948Z 2025-05-07T20:31:45.6002108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6002199Z op = silu_mul_quant 2025-05-07T20:31:45.6002282Z if compiled: 2025-05-07T20:31:45.6002385Z op = torch.compile(op) 2025-05-07T20:31:45.6002489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6002568Z 2025-05-07T20:31:45.6002659Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6002663Z 2025-05-07T20:31:45.6002845Z moe/activation_test.py:117: 2025-05-07T20:31:45.6002978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6003078Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6003177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6003546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.6003639Z return fn(*args, **kwargs) 2025-05-07T20:31:45.6004135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6004239Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6004593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6004816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6005153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6005251Z kernel = self.compile( 2025-05-07T20:31:45.6005821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6006069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6006208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6006213Z 2025-05-07T20:31:45.6006558Z self = 2025-05-07T20:31:45.6007332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6007890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb970540>} 2025-05-07T20:31:45.6008643Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6008833Z context = 2025-05-07T20:31:45.6008838Z 2025-05-07T20:31:45.6009001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6009267Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6009381Z module_map=module_map) 2025-05-07T20:31:45.6009543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6009643Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6009718Z E ^ 2025-05-07T20:31:45.6010071Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6010081Z 2025-05-07T20:31:45.6010497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6010502Z 2025-05-07T20:31:45.6010606Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6010829Z self=, 2025-05-07T20:31:45.6010907Z T=16384, 2025-05-07T20:31:45.6010983Z D=7168, 2025-05-07T20:31:45.6011072Z scale_ub=1200.0, 2025-05-07T20:31:45.6011157Z contiguous=True, 2025-05-07T20:31:45.6011240Z compiled=True, 2025-05-07T20:31:45.6011320Z ) 2025-05-07T20:31:45.6011536Z self = 2025-05-07T20:31:45.6011709Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.6011714Z 2025-05-07T20:31:45.6011796Z @given( 2025-05-07T20:31:45.6012064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6012165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6012279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6012394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6012510Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6012585Z ) 2025-05-07T20:31:45.6012830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6012937Z def test_silu_mul_quant( 2025-05-07T20:31:45.6013016Z self, 2025-05-07T20:31:45.6013092Z T: int, 2025-05-07T20:31:45.6013173Z D: int, 2025-05-07T20:31:45.6013272Z scale_ub: Optional[float], 2025-05-07T20:31:45.6013361Z contiguous: bool, 2025-05-07T20:31:45.6013452Z compiled: bool, 2025-05-07T20:31:45.6013533Z ) -> None: 2025-05-07T20:31:45.6013629Z torch.manual_seed(2025) 2025-05-07T20:31:45.6013705Z 2025-05-07T20:31:45.6013879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6013959Z 2025-05-07T20:31:45.6014049Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6014172Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6014263Z x = x_sign * x_clamp 2025-05-07T20:31:45.6014344Z x0 = x[:, :D] 2025-05-07T20:31:45.6014423Z x1 = x[:, D:] 2025-05-07T20:31:45.6014501Z 2025-05-07T20:31:45.6014585Z if contiguous: 2025-05-07T20:31:45.6014754Z x0 = x0.contiguous() 2025-05-07T20:31:45.6014848Z x1 = x1.contiguous() 2025-05-07T20:31:45.6014922Z 2025-05-07T20:31:45.6015014Z if scale_ub is not None: 2025-05-07T20:31:45.6015124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6015258Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6015338Z ) 2025-05-07T20:31:45.6015413Z else: 2025-05-07T20:31:45.6015507Z scale_ub_tensor = None 2025-05-07T20:31:45.6015589Z 2025-05-07T20:31:45.6015718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6015806Z op = silu_mul_quant 2025-05-07T20:31:45.6015894Z if compiled: 2025-05-07T20:31:45.6015994Z op = torch.compile(op) 2025-05-07T20:31:45.6016098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6016177Z 2025-05-07T20:31:45.6016268Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6016272Z 2025-05-07T20:31:45.6016375Z moe/activation_test.py:117: 2025-05-07T20:31:45.6016509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6016609Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6016710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6017074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.6017166Z return fn(*args, **kwargs) 2025-05-07T20:31:45.6017669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6017768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6018128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6018349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6018693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6018792Z kernel = self.compile( 2025-05-07T20:31:45.6019174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6019346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6019476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6019561Z 2025-05-07T20:31:45.6019765Z self = 2025-05-07T20:31:45.6020540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6021041Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb971080>} 2025-05-07T20:31:45.6021790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6021980Z context = 2025-05-07T20:31:45.6021985Z 2025-05-07T20:31:45.6022152Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6022415Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6022522Z module_map=module_map) 2025-05-07T20:31:45.6022688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6022786Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6022861Z E ^ 2025-05-07T20:31:45.6023293Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6023298Z 2025-05-07T20:31:45.6023712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6023717Z 2025-05-07T20:31:45.6023819Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6024043Z self=, 2025-05-07T20:31:45.6024128Z T=16384, 2025-05-07T20:31:45.6024210Z D=5120, 2025-05-07T20:31:45.6024293Z scale_ub=1200.0, 2025-05-07T20:31:45.6024375Z contiguous=True, 2025-05-07T20:31:45.6024462Z compiled=False, 2025-05-07T20:31:45.6024535Z ) 2025-05-07T20:31:45.6024749Z self = 2025-05-07T20:31:45.6024930Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6024935Z 2025-05-07T20:31:45.6025013Z @given( 2025-05-07T20:31:45.6025137Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6025237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6025350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6025473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6025585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6025661Z ) 2025-05-07T20:31:45.6025908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6026006Z def test_silu_mul_quant( 2025-05-07T20:31:45.6026085Z self, 2025-05-07T20:31:45.6026168Z T: int, 2025-05-07T20:31:45.6026247Z D: int, 2025-05-07T20:31:45.6026345Z scale_ub: Optional[float], 2025-05-07T20:31:45.6026436Z contiguous: bool, 2025-05-07T20:31:45.6026522Z compiled: bool, 2025-05-07T20:31:45.6026599Z ) -> None: 2025-05-07T20:31:45.6026698Z torch.manual_seed(2025) 2025-05-07T20:31:45.6026773Z 2025-05-07T20:31:45.6026948Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6027021Z 2025-05-07T20:31:45.6027114Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6027243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6027332Z x = x_sign * x_clamp 2025-05-07T20:31:45.6027413Z x0 = x[:, :D] 2025-05-07T20:31:45.6027496Z x1 = x[:, D:] 2025-05-07T20:31:45.6027568Z 2025-05-07T20:31:45.6027735Z if contiguous: 2025-05-07T20:31:45.6027831Z x0 = x0.contiguous() 2025-05-07T20:31:45.6027919Z x1 = x1.contiguous() 2025-05-07T20:31:45.6027994Z 2025-05-07T20:31:45.6028090Z if scale_ub is not None: 2025-05-07T20:31:45.6028193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6028330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6028406Z ) 2025-05-07T20:31:45.6028483Z else: 2025-05-07T20:31:45.6028586Z scale_ub_tensor = None 2025-05-07T20:31:45.6028658Z 2025-05-07T20:31:45.6028787Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6028880Z op = silu_mul_quant 2025-05-07T20:31:45.6028964Z if compiled: 2025-05-07T20:31:45.6029064Z op = torch.compile(op) 2025-05-07T20:31:45.6029171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6029247Z 2025-05-07T20:31:45.6029337Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6029347Z 2025-05-07T20:31:45.6029447Z moe/activation_test.py:117: 2025-05-07T20:31:45.6029575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6029676Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6029775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6030270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6030450Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6030807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6031026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6031368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6031462Z kernel = self.compile( 2025-05-07T20:31:45.6031848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6032020Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6032147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6032152Z 2025-05-07T20:31:45.6032358Z self = 2025-05-07T20:31:45.6033134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6033636Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb972660>} 2025-05-07T20:31:45.6034380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6034574Z context = 2025-05-07T20:31:45.6034584Z 2025-05-07T20:31:45.6034748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6035012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6035121Z module_map=module_map) 2025-05-07T20:31:45.6035281Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6035384Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6035467Z E ^ 2025-05-07T20:31:45.6035821Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6035826Z 2025-05-07T20:31:45.6036323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6036327Z 2025-05-07T20:31:45.6036430Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6036650Z self=, 2025-05-07T20:31:45.6036736Z T=1, 2025-05-07T20:31:45.6036812Z D=7168, 2025-05-07T20:31:45.6036898Z scale_ub=1200.0, 2025-05-07T20:31:45.6036989Z contiguous=False, 2025-05-07T20:31:45.6037080Z compiled=False, 2025-05-07T20:31:45.6037156Z ) 2025-05-07T20:31:45.6037376Z self = 2025-05-07T20:31:45.6037543Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.6037548Z 2025-05-07T20:31:45.6037628Z @given( 2025-05-07T20:31:45.6037746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6037844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6037970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6038086Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6038200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6038278Z ) 2025-05-07T20:31:45.6038521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6038615Z def test_silu_mul_quant( 2025-05-07T20:31:45.6038694Z self, 2025-05-07T20:31:45.6038772Z T: int, 2025-05-07T20:31:45.6038950Z D: int, 2025-05-07T20:31:45.6039049Z scale_ub: Optional[float], 2025-05-07T20:31:45.6039138Z contiguous: bool, 2025-05-07T20:31:45.6039226Z compiled: bool, 2025-05-07T20:31:45.6039304Z ) -> None: 2025-05-07T20:31:45.6039397Z torch.manual_seed(2025) 2025-05-07T20:31:45.6039480Z 2025-05-07T20:31:45.6039652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6039727Z 2025-05-07T20:31:45.6039826Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6039949Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6040039Z x = x_sign * x_clamp 2025-05-07T20:31:45.6040122Z x0 = x[:, :D] 2025-05-07T20:31:45.6040202Z x1 = x[:, D:] 2025-05-07T20:31:45.6040280Z 2025-05-07T20:31:45.6040364Z if contiguous: 2025-05-07T20:31:45.6040454Z x0 = x0.contiguous() 2025-05-07T20:31:45.6040547Z x1 = x1.contiguous() 2025-05-07T20:31:45.6040623Z 2025-05-07T20:31:45.6040714Z if scale_ub is not None: 2025-05-07T20:31:45.6040821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6040954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6041032Z ) 2025-05-07T20:31:45.6041111Z else: 2025-05-07T20:31:45.6041205Z scale_ub_tensor = None 2025-05-07T20:31:45.6041278Z 2025-05-07T20:31:45.6041412Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6041509Z op = silu_mul_quant 2025-05-07T20:31:45.6041598Z if compiled: 2025-05-07T20:31:45.6041699Z op = torch.compile(op) 2025-05-07T20:31:45.6041802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6041880Z 2025-05-07T20:31:45.6041972Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6041977Z 2025-05-07T20:31:45.6042074Z moe/activation_test.py:117: 2025-05-07T20:31:45.6042209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6042310Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6042407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6042909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6043007Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6043366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6043672Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6044009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6044106Z kernel = self.compile( 2025-05-07T20:31:45.6044486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6044668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6044795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6044799Z 2025-05-07T20:31:45.6045002Z self = 2025-05-07T20:31:45.6045777Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6046283Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb971d00>} 2025-05-07T20:31:45.6047030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6047299Z context = 2025-05-07T20:31:45.6047305Z 2025-05-07T20:31:45.6047469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6047809Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6047915Z module_map=module_map) 2025-05-07T20:31:45.6048079Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6048183Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6048261Z E ^ 2025-05-07T20:31:45.6048613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6048617Z 2025-05-07T20:31:45.6049030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6049034Z 2025-05-07T20:31:45.6049143Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6049365Z self=, 2025-05-07T20:31:45.6049445Z T=4096, 2025-05-07T20:31:45.6049526Z D=7168, 2025-05-07T20:31:45.6049612Z scale_ub=1200.0, 2025-05-07T20:31:45.6049696Z contiguous=False, 2025-05-07T20:31:45.6049779Z compiled=True, 2025-05-07T20:31:45.6049853Z ) 2025-05-07T20:31:45.6050070Z self = 2025-05-07T20:31:45.6050257Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.6050261Z 2025-05-07T20:31:45.6050343Z @given( 2025-05-07T20:31:45.6050468Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6050569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6050683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6050802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6050917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6050994Z ) 2025-05-07T20:31:45.6051240Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6051331Z def test_silu_mul_quant( 2025-05-07T20:31:45.6051404Z self, 2025-05-07T20:31:45.6051484Z T: int, 2025-05-07T20:31:45.6051562Z D: int, 2025-05-07T20:31:45.6051658Z scale_ub: Optional[float], 2025-05-07T20:31:45.6051753Z contiguous: bool, 2025-05-07T20:31:45.6051924Z compiled: bool, 2025-05-07T20:31:45.6057280Z ) -> None: 2025-05-07T20:31:45.6057397Z torch.manual_seed(2025) 2025-05-07T20:31:45.6057480Z 2025-05-07T20:31:45.6057660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6057744Z 2025-05-07T20:31:45.6057839Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6057968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6058073Z x = x_sign * x_clamp 2025-05-07T20:31:45.6058155Z x0 = x[:, :D] 2025-05-07T20:31:45.6058237Z x1 = x[:, D:] 2025-05-07T20:31:45.6058316Z 2025-05-07T20:31:45.6058404Z if contiguous: 2025-05-07T20:31:45.6058497Z x0 = x0.contiguous() 2025-05-07T20:31:45.6058594Z x1 = x1.contiguous() 2025-05-07T20:31:45.6058670Z 2025-05-07T20:31:45.6058763Z if scale_ub is not None: 2025-05-07T20:31:45.6058878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6059024Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6059108Z ) 2025-05-07T20:31:45.6059186Z else: 2025-05-07T20:31:45.6059283Z scale_ub_tensor = None 2025-05-07T20:31:45.6059363Z 2025-05-07T20:31:45.6059497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6059589Z op = silu_mul_quant 2025-05-07T20:31:45.6059679Z if compiled: 2025-05-07T20:31:45.6059888Z op = torch.compile(op) 2025-05-07T20:31:45.6060000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6060082Z 2025-05-07T20:31:45.6060177Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6060183Z 2025-05-07T20:31:45.6060286Z moe/activation_test.py:117: 2025-05-07T20:31:45.6060426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6060530Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6060638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6061028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.6061126Z return fn(*args, **kwargs) 2025-05-07T20:31:45.6061635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6061738Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6062109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6062339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6062682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6062782Z kernel = self.compile( 2025-05-07T20:31:45.6063169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6063356Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6063491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6063496Z 2025-05-07T20:31:45.6063704Z self = 2025-05-07T20:31:45.6064498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6065005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c852ccc0>} 2025-05-07T20:31:45.6065766Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6066042Z context = 2025-05-07T20:31:45.6066047Z 2025-05-07T20:31:45.6066216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6066487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6066600Z module_map=module_map) 2025-05-07T20:31:45.6066771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6066880Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6066961Z E ^ 2025-05-07T20:31:45.6067325Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6067330Z 2025-05-07T20:31:45.6067747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6067756Z 2025-05-07T20:31:45.6067863Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6068092Z self=, 2025-05-07T20:31:45.6068173Z T=128, 2025-05-07T20:31:45.6068254Z D=7168, 2025-05-07T20:31:45.6068342Z scale_ub=1200.0, 2025-05-07T20:31:45.6068430Z contiguous=False, 2025-05-07T20:31:45.6068523Z compiled=True, 2025-05-07T20:31:45.6068598Z ) 2025-05-07T20:31:45.6068955Z self = 2025-05-07T20:31:45.6069134Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.6069139Z 2025-05-07T20:31:45.6069220Z @given( 2025-05-07T20:31:45.6069344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6069452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6069569Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6069689Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6069812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6069890Z ) 2025-05-07T20:31:45.6070145Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6070242Z def test_silu_mul_quant( 2025-05-07T20:31:45.6070322Z self, 2025-05-07T20:31:45.6070404Z T: int, 2025-05-07T20:31:45.6070482Z D: int, 2025-05-07T20:31:45.6070583Z scale_ub: Optional[float], 2025-05-07T20:31:45.6070683Z contiguous: bool, 2025-05-07T20:31:45.6070768Z compiled: bool, 2025-05-07T20:31:45.6070846Z ) -> None: 2025-05-07T20:31:45.6070942Z torch.manual_seed(2025) 2025-05-07T20:31:45.6071015Z 2025-05-07T20:31:45.6071185Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6071266Z 2025-05-07T20:31:45.6071359Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6071487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6071580Z x = x_sign * x_clamp 2025-05-07T20:31:45.6071660Z x0 = x[:, :D] 2025-05-07T20:31:45.6071744Z x1 = x[:, D:] 2025-05-07T20:31:45.6071816Z 2025-05-07T20:31:45.6071900Z if contiguous: 2025-05-07T20:31:45.6071996Z x0 = x0.contiguous() 2025-05-07T20:31:45.6072087Z x1 = x1.contiguous() 2025-05-07T20:31:45.6072162Z 2025-05-07T20:31:45.6072258Z if scale_ub is not None: 2025-05-07T20:31:45.6072365Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6072501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6072583Z ) 2025-05-07T20:31:45.6072663Z else: 2025-05-07T20:31:45.6072758Z scale_ub_tensor = None 2025-05-07T20:31:45.6072833Z 2025-05-07T20:31:45.6072963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6073061Z op = silu_mul_quant 2025-05-07T20:31:45.6073147Z if compiled: 2025-05-07T20:31:45.6073332Z op = torch.compile(op) 2025-05-07T20:31:45.6073440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6073513Z 2025-05-07T20:31:45.6073604Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6073609Z 2025-05-07T20:31:45.6073716Z moe/activation_test.py:117: 2025-05-07T20:31:45.6073847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6073950Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6074054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6074422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.6074517Z return fn(*args, **kwargs) 2025-05-07T20:31:45.6075012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6075109Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6075474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6075696Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6076036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6076135Z kernel = self.compile( 2025-05-07T20:31:45.6076599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6076779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6076908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6076912Z 2025-05-07T20:31:45.6077116Z self = 2025-05-07T20:31:45.6077897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6078398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c852d580>} 2025-05-07T20:31:45.6079159Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6079348Z context = 2025-05-07T20:31:45.6079353Z 2025-05-07T20:31:45.6079521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6079785Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6079894Z module_map=module_map) 2025-05-07T20:31:45.6080068Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6080169Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6080248Z E ^ 2025-05-07T20:31:45.6080604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6080609Z 2025-05-07T20:31:45.6081022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6081031Z 2025-05-07T20:31:45.6081140Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6081360Z self=, 2025-05-07T20:31:45.6081439Z T=2048, 2025-05-07T20:31:45.6081520Z D=7168, 2025-05-07T20:31:45.6081604Z scale_ub=None, 2025-05-07T20:31:45.6081689Z contiguous=True, 2025-05-07T20:31:45.6081778Z compiled=True, 2025-05-07T20:31:45.6081852Z ) 2025-05-07T20:31:45.6082154Z self = 2025-05-07T20:31:45.6082324Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.6082329Z 2025-05-07T20:31:45.6082408Z @given( 2025-05-07T20:31:45.6082529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6082629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6082741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6082869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6082981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6083055Z ) 2025-05-07T20:31:45.6083302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6083396Z def test_silu_mul_quant( 2025-05-07T20:31:45.6083479Z self, 2025-05-07T20:31:45.6083557Z T: int, 2025-05-07T20:31:45.6083633Z D: int, 2025-05-07T20:31:45.6083740Z scale_ub: Optional[float], 2025-05-07T20:31:45.6083834Z contiguous: bool, 2025-05-07T20:31:45.6083920Z compiled: bool, 2025-05-07T20:31:45.6084000Z ) -> None: 2025-05-07T20:31:45.6084095Z torch.manual_seed(2025) 2025-05-07T20:31:45.6084168Z 2025-05-07T20:31:45.6084341Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6084420Z 2025-05-07T20:31:45.6084513Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6084723Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6084816Z x = x_sign * x_clamp 2025-05-07T20:31:45.6084899Z x0 = x[:, :D] 2025-05-07T20:31:45.6084980Z x1 = x[:, D:] 2025-05-07T20:31:45.6085054Z 2025-05-07T20:31:45.6085141Z if contiguous: 2025-05-07T20:31:45.6085233Z x0 = x0.contiguous() 2025-05-07T20:31:45.6085323Z x1 = x1.contiguous() 2025-05-07T20:31:45.6085402Z 2025-05-07T20:31:45.6085495Z if scale_ub is not None: 2025-05-07T20:31:45.6085616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6085749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6085826Z ) 2025-05-07T20:31:45.6085907Z else: 2025-05-07T20:31:45.6086002Z scale_ub_tensor = None 2025-05-07T20:31:45.6086076Z 2025-05-07T20:31:45.6086212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6086303Z op = silu_mul_quant 2025-05-07T20:31:45.6086393Z if compiled: 2025-05-07T20:31:45.6086497Z op = torch.compile(op) 2025-05-07T20:31:45.6086601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6086677Z 2025-05-07T20:31:45.6086772Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6086777Z 2025-05-07T20:31:45.6086876Z moe/activation_test.py:117: 2025-05-07T20:31:45.6087009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6087110Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6087215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6087657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.6087753Z return fn(*args, **kwargs) 2025-05-07T20:31:45.6088245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6088348Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6088706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6088929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6089268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6089362Z kernel = self.compile( 2025-05-07T20:31:45.6089744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6090026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6090158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6090162Z 2025-05-07T20:31:45.6090367Z self = 2025-05-07T20:31:45.6091144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6091643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51c852e480>} 2025-05-07T20:31:45.6092387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6092583Z context = 2025-05-07T20:31:45.6092588Z 2025-05-07T20:31:45.6092750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6093011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6093195Z module_map=module_map) 2025-05-07T20:31:45.6093360Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6093462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6093540Z E ^ 2025-05-07T20:31:45.6093894Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6093898Z 2025-05-07T20:31:45.6094312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6094322Z 2025-05-07T20:31:45.6094425Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6094650Z self=, 2025-05-07T20:31:45.6094732Z T=16384, 2025-05-07T20:31:45.6094811Z D=5120, 2025-05-07T20:31:45.6094900Z scale_ub=None, 2025-05-07T20:31:45.6094987Z contiguous=False, 2025-05-07T20:31:45.6095072Z compiled=False, 2025-05-07T20:31:45.6095153Z ) 2025-05-07T20:31:45.6095374Z self = 2025-05-07T20:31:45.6095551Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.6095556Z 2025-05-07T20:31:45.6095639Z @given( 2025-05-07T20:31:45.6095759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6095857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6095976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6096097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6096211Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6096288Z ) 2025-05-07T20:31:45.6096532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6096629Z def test_silu_mul_quant( 2025-05-07T20:31:45.6096707Z self, 2025-05-07T20:31:45.6096785Z T: int, 2025-05-07T20:31:45.6096864Z D: int, 2025-05-07T20:31:45.6096968Z scale_ub: Optional[float], 2025-05-07T20:31:45.6097057Z contiguous: bool, 2025-05-07T20:31:45.6097147Z compiled: bool, 2025-05-07T20:31:45.6097228Z ) -> None: 2025-05-07T20:31:45.6097325Z torch.manual_seed(2025) 2025-05-07T20:31:45.6097400Z 2025-05-07T20:31:45.6097567Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6097653Z 2025-05-07T20:31:45.6097748Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6097964Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6099771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6099777Z 2025-05-07T20:31:45.6099896Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.6099901Z 2025-05-07T20:31:45.6100007Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6100226Z self=, 2025-05-07T20:31:45.6100313Z T=4096, 2025-05-07T20:31:45.6100397Z D=7168, 2025-05-07T20:31:45.6100483Z scale_ub=1200.0, 2025-05-07T20:31:45.6100576Z contiguous=True, 2025-05-07T20:31:45.6100661Z compiled=True, 2025-05-07T20:31:45.6100737Z ) 2025-05-07T20:31:45.6100958Z self = 2025-05-07T20:31:45.6101129Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.6101134Z 2025-05-07T20:31:45.6101213Z @given( 2025-05-07T20:31:45.6101412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6101510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6101625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6101746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6101858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6101942Z ) 2025-05-07T20:31:45.6102184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6102287Z def test_silu_mul_quant( 2025-05-07T20:31:45.6102369Z self, 2025-05-07T20:31:45.6102448Z T: int, 2025-05-07T20:31:45.6102525Z D: int, 2025-05-07T20:31:45.6102628Z scale_ub: Optional[float], 2025-05-07T20:31:45.6102717Z contiguous: bool, 2025-05-07T20:31:45.6102803Z compiled: bool, 2025-05-07T20:31:45.6102887Z ) -> None: 2025-05-07T20:31:45.6102981Z torch.manual_seed(2025) 2025-05-07T20:31:45.6103057Z 2025-05-07T20:31:45.6103234Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6103312Z 2025-05-07T20:31:45.6103403Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6103532Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6105317Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6105332Z 2025-05-07T20:31:45.6105448Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.6105453Z 2025-05-07T20:31:45.6105558Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6106045Z self=, 2025-05-07T20:31:45.6106142Z T=16384, 2025-05-07T20:31:45.6106221Z D=7168, 2025-05-07T20:31:45.6106306Z scale_ub=None, 2025-05-07T20:31:45.6106395Z contiguous=False, 2025-05-07T20:31:45.6106480Z compiled=False, 2025-05-07T20:31:45.6106559Z ) 2025-05-07T20:31:45.6106775Z self = 2025-05-07T20:31:45.6107102Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.6107108Z 2025-05-07T20:31:45.6107187Z @given( 2025-05-07T20:31:45.6107305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6107406Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6107518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6107634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6107753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6107829Z ) 2025-05-07T20:31:45.6108074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6108168Z def test_silu_mul_quant( 2025-05-07T20:31:45.6108248Z self, 2025-05-07T20:31:45.6108330Z T: int, 2025-05-07T20:31:45.6108407Z D: int, 2025-05-07T20:31:45.6108504Z scale_ub: Optional[float], 2025-05-07T20:31:45.6108601Z contiguous: bool, 2025-05-07T20:31:45.6108687Z compiled: bool, 2025-05-07T20:31:45.6108766Z ) -> None: 2025-05-07T20:31:45.6108864Z torch.manual_seed(2025) 2025-05-07T20:31:45.6108937Z 2025-05-07T20:31:45.6109105Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6111003Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6111010Z 2025-05-07T20:31:45.6111131Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6111141Z 2025-05-07T20:31:45.6111248Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6111469Z self=, 2025-05-07T20:31:45.6111554Z T=2048, 2025-05-07T20:31:45.6111629Z D=7168, 2025-05-07T20:31:45.6111712Z scale_ub=1200.0, 2025-05-07T20:31:45.6111801Z contiguous=True, 2025-05-07T20:31:45.6111886Z compiled=True, 2025-05-07T20:31:45.6111962Z ) 2025-05-07T20:31:45.6112186Z self = 2025-05-07T20:31:45.6112357Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.6112361Z 2025-05-07T20:31:45.6112441Z @given( 2025-05-07T20:31:45.6112562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6112658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6112778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6112901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6113016Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6113099Z ) 2025-05-07T20:31:45.6113341Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6113438Z def test_silu_mul_quant( 2025-05-07T20:31:45.6113518Z self, 2025-05-07T20:31:45.6113596Z T: int, 2025-05-07T20:31:45.6113674Z D: int, 2025-05-07T20:31:45.6113779Z scale_ub: Optional[float], 2025-05-07T20:31:45.6113869Z contiguous: bool, 2025-05-07T20:31:45.6113955Z compiled: bool, 2025-05-07T20:31:45.6114043Z ) -> None: 2025-05-07T20:31:45.6114137Z torch.manual_seed(2025) 2025-05-07T20:31:45.6114210Z 2025-05-07T20:31:45.6114381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6114459Z 2025-05-07T20:31:45.6114556Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6114679Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6116537Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6116543Z 2025-05-07T20:31:45.6116662Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.6116666Z 2025-05-07T20:31:45.6116774Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6116994Z self=, 2025-05-07T20:31:45.6117071Z T=2048, 2025-05-07T20:31:45.6117158Z D=7168, 2025-05-07T20:31:45.6117242Z scale_ub=None, 2025-05-07T20:31:45.6117329Z contiguous=True, 2025-05-07T20:31:45.6117417Z compiled=False, 2025-05-07T20:31:45.6117495Z ) 2025-05-07T20:31:45.6117708Z self = 2025-05-07T20:31:45.6117881Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6117885Z 2025-05-07T20:31:45.6117965Z @given( 2025-05-07T20:31:45.6118187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6118287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6118402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6118523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6118636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6118714Z ) 2025-05-07T20:31:45.6118960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6119061Z def test_silu_mul_quant( 2025-05-07T20:31:45.6119141Z self, 2025-05-07T20:31:45.6119221Z T: int, 2025-05-07T20:31:45.6119300Z D: int, 2025-05-07T20:31:45.6119401Z scale_ub: Optional[float], 2025-05-07T20:31:45.6119493Z contiguous: bool, 2025-05-07T20:31:45.6119579Z compiled: bool, 2025-05-07T20:31:45.6119660Z ) -> None: 2025-05-07T20:31:45.6119756Z torch.manual_seed(2025) 2025-05-07T20:31:45.6119828Z 2025-05-07T20:31:45.6120003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6120081Z 2025-05-07T20:31:45.6120172Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.6121939Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6121950Z 2025-05-07T20:31:45.6122066Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.6122071Z 2025-05-07T20:31:45.6122176Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6122397Z self=, 2025-05-07T20:31:45.6122480Z T=1, 2025-05-07T20:31:45.6122560Z D=7168, 2025-05-07T20:31:45.6122644Z scale_ub=1200.0, 2025-05-07T20:31:45.6122731Z contiguous=True, 2025-05-07T20:31:45.6122815Z compiled=False, 2025-05-07T20:31:45.6122890Z ) 2025-05-07T20:31:45.6123107Z self = 2025-05-07T20:31:45.6123271Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6123363Z 2025-05-07T20:31:45.6123445Z @given( 2025-05-07T20:31:45.6123564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6123660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6123772Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6123890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6124002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6124077Z ) 2025-05-07T20:31:45.6124324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6124420Z def test_silu_mul_quant( 2025-05-07T20:31:45.6124502Z self, 2025-05-07T20:31:45.6124580Z T: int, 2025-05-07T20:31:45.6124657Z D: int, 2025-05-07T20:31:45.6124759Z scale_ub: Optional[float], 2025-05-07T20:31:45.6124850Z contiguous: bool, 2025-05-07T20:31:45.6124934Z compiled: bool, 2025-05-07T20:31:45.6125022Z ) -> None: 2025-05-07T20:31:45.6125117Z torch.manual_seed(2025) 2025-05-07T20:31:45.6125194Z 2025-05-07T20:31:45.6125363Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6125438Z 2025-05-07T20:31:45.6125532Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6125656Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6125748Z x = x_sign * x_clamp 2025-05-07T20:31:45.6125831Z x0 = x[:, :D] 2025-05-07T20:31:45.6125989Z x1 = x[:, D:] 2025-05-07T20:31:45.6126066Z 2025-05-07T20:31:45.6126152Z if contiguous: 2025-05-07T20:31:45.6126245Z x0 = x0.contiguous() 2025-05-07T20:31:45.6126336Z x1 = x1.contiguous() 2025-05-07T20:31:45.6126415Z 2025-05-07T20:31:45.6126506Z if scale_ub is not None: 2025-05-07T20:31:45.6126612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6126752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6126837Z ) 2025-05-07T20:31:45.6126917Z else: 2025-05-07T20:31:45.6127012Z scale_ub_tensor = None 2025-05-07T20:31:45.6127086Z 2025-05-07T20:31:45.6127222Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6127313Z op = silu_mul_quant 2025-05-07T20:31:45.6127400Z if compiled: 2025-05-07T20:31:45.6127602Z op = torch.compile(op) 2025-05-07T20:31:45.6127709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6127790Z 2025-05-07T20:31:45.6127885Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6127889Z 2025-05-07T20:31:45.6127986Z moe/activation_test.py:117: 2025-05-07T20:31:45.6128115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6128221Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6128319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6128823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6128927Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6129285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6129511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6129853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6129951Z kernel = self.compile( 2025-05-07T20:31:45.6130330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6130505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6130634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6130639Z 2025-05-07T20:31:45.6130843Z self = 2025-05-07T20:31:45.6131713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6132217Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb759d00>} 2025-05-07T20:31:45.6132961Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6133158Z context = 2025-05-07T20:31:45.6133162Z 2025-05-07T20:31:45.6133324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6133594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6133708Z module_map=module_map) 2025-05-07T20:31:45.6133870Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6133971Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6134051Z E ^ 2025-05-07T20:31:45.6134403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6134490Z 2025-05-07T20:31:45.6134907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6134912Z 2025-05-07T20:31:45.6135015Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6135239Z self=, 2025-05-07T20:31:45.6135321Z T=128, 2025-05-07T20:31:45.6135402Z D=5120, 2025-05-07T20:31:45.6135498Z scale_ub=None, 2025-05-07T20:31:45.6135585Z contiguous=True, 2025-05-07T20:31:45.6135672Z compiled=False, 2025-05-07T20:31:45.6135749Z ) 2025-05-07T20:31:45.6135964Z self = 2025-05-07T20:31:45.6136139Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6136144Z 2025-05-07T20:31:45.6136222Z @given( 2025-05-07T20:31:45.6136339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6136445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6136560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6136677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6136790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6136868Z ) 2025-05-07T20:31:45.6137112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6137209Z def test_silu_mul_quant( 2025-05-07T20:31:45.6137291Z self, 2025-05-07T20:31:45.6137371Z T: int, 2025-05-07T20:31:45.6137449Z D: int, 2025-05-07T20:31:45.6137549Z scale_ub: Optional[float], 2025-05-07T20:31:45.6137642Z contiguous: bool, 2025-05-07T20:31:45.6137729Z compiled: bool, 2025-05-07T20:31:45.6137809Z ) -> None: 2025-05-07T20:31:45.6137910Z torch.manual_seed(2025) 2025-05-07T20:31:45.6137985Z 2025-05-07T20:31:45.6138156Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6138234Z 2025-05-07T20:31:45.6138327Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6138451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6138543Z x = x_sign * x_clamp 2025-05-07T20:31:45.6138625Z x0 = x[:, :D] 2025-05-07T20:31:45.6138714Z x1 = x[:, D:] 2025-05-07T20:31:45.6138791Z 2025-05-07T20:31:45.6138877Z if contiguous: 2025-05-07T20:31:45.6138972Z x0 = x0.contiguous() 2025-05-07T20:31:45.6139151Z x1 = x1.contiguous() 2025-05-07T20:31:45.6139225Z 2025-05-07T20:31:45.6139318Z if scale_ub is not None: 2025-05-07T20:31:45.6139422Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6139556Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6139637Z ) 2025-05-07T20:31:45.6139715Z else: 2025-05-07T20:31:45.6139808Z scale_ub_tensor = None 2025-05-07T20:31:45.6139888Z 2025-05-07T20:31:45.6140024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6140118Z op = silu_mul_quant 2025-05-07T20:31:45.6140205Z if compiled: 2025-05-07T20:31:45.6140306Z op = torch.compile(op) 2025-05-07T20:31:45.6140416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6140492Z 2025-05-07T20:31:45.6140583Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6140588Z 2025-05-07T20:31:45.6140696Z moe/activation_test.py:117: 2025-05-07T20:31:45.6140826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6140929Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6141034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6141531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6141631Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6142066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6142289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6142632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6142727Z kernel = self.compile( 2025-05-07T20:31:45.6143105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6143285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6143413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6143417Z 2025-05-07T20:31:45.6143625Z self = 2025-05-07T20:31:45.6144400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6144900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb75ae80>} 2025-05-07T20:31:45.6145650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6145846Z context = 2025-05-07T20:31:45.6145850Z 2025-05-07T20:31:45.6146017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6146279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6146392Z module_map=module_map) 2025-05-07T20:31:45.6146560Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6146664Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6146747Z E ^ 2025-05-07T20:31:45.6147098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6147103Z 2025-05-07T20:31:45.6147513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6147599Z 2025-05-07T20:31:45.6147708Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6147928Z self=, 2025-05-07T20:31:45.6148012Z T=128, 2025-05-07T20:31:45.6148089Z D=7168, 2025-05-07T20:31:45.6148174Z scale_ub=None, 2025-05-07T20:31:45.6148262Z contiguous=True, 2025-05-07T20:31:45.6148349Z compiled=False, 2025-05-07T20:31:45.6148426Z ) 2025-05-07T20:31:45.6148655Z self = 2025-05-07T20:31:45.6148824Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6148828Z 2025-05-07T20:31:45.6148907Z @given( 2025-05-07T20:31:45.6149030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6149131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6149247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6149368Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6149481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6149564Z ) 2025-05-07T20:31:45.6149806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6149901Z def test_silu_mul_quant( 2025-05-07T20:31:45.6149985Z self, 2025-05-07T20:31:45.6150063Z T: int, 2025-05-07T20:31:45.6150141Z D: int, 2025-05-07T20:31:45.6150344Z scale_ub: Optional[float], 2025-05-07T20:31:45.6150436Z contiguous: bool, 2025-05-07T20:31:45.6150524Z compiled: bool, 2025-05-07T20:31:45.6150606Z ) -> None: 2025-05-07T20:31:45.6150704Z torch.manual_seed(2025) 2025-05-07T20:31:45.6150782Z 2025-05-07T20:31:45.6150951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6151031Z 2025-05-07T20:31:45.6151125Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6151254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6151344Z x = x_sign * x_clamp 2025-05-07T20:31:45.6151430Z x0 = x[:, :D] 2025-05-07T20:31:45.6151512Z x1 = x[:, D:] 2025-05-07T20:31:45.6151588Z 2025-05-07T20:31:45.6151675Z if contiguous: 2025-05-07T20:31:45.6151767Z x0 = x0.contiguous() 2025-05-07T20:31:45.6151858Z x1 = x1.contiguous() 2025-05-07T20:31:45.6151940Z 2025-05-07T20:31:45.6152032Z if scale_ub is not None: 2025-05-07T20:31:45.6152147Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6152281Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6152359Z ) 2025-05-07T20:31:45.6152442Z else: 2025-05-07T20:31:45.6152536Z scale_ub_tensor = None 2025-05-07T20:31:45.6152610Z 2025-05-07T20:31:45.6152742Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6152834Z op = silu_mul_quant 2025-05-07T20:31:45.6152923Z if compiled: 2025-05-07T20:31:45.6153024Z op = torch.compile(op) 2025-05-07T20:31:45.6153128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6153204Z 2025-05-07T20:31:45.6153302Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6153307Z 2025-05-07T20:31:45.6153404Z moe/activation_test.py:117: 2025-05-07T20:31:45.6153538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6153638Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6153744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6154244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6154342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6154700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6154926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6155354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6155454Z kernel = self.compile( 2025-05-07T20:31:45.6155834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6156011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6156148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6156153Z 2025-05-07T20:31:45.6156358Z self = 2025-05-07T20:31:45.6157134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6157636Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb75bec0>} 2025-05-07T20:31:45.6158380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6158649Z context = 2025-05-07T20:31:45.6158654Z 2025-05-07T20:31:45.6158819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6159082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6159190Z module_map=module_map) 2025-05-07T20:31:45.6159350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6159453Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6159537Z E ^ 2025-05-07T20:31:45.6159891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6159901Z 2025-05-07T20:31:45.6160313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6160318Z 2025-05-07T20:31:45.6160423Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6160653Z self=, 2025-05-07T20:31:45.6160732Z T=2048, 2025-05-07T20:31:45.6160810Z D=7168, 2025-05-07T20:31:45.6160900Z scale_ub=1200.0, 2025-05-07T20:31:45.6160987Z contiguous=True, 2025-05-07T20:31:45.6161072Z compiled=False, 2025-05-07T20:31:45.6161151Z ) 2025-05-07T20:31:45.6161369Z self = 2025-05-07T20:31:45.6161547Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6161556Z 2025-05-07T20:31:45.6161635Z @given( 2025-05-07T20:31:45.6161754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6161858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6161973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6162088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6162204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6162280Z ) 2025-05-07T20:31:45.6162534Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6162629Z def test_silu_mul_quant( 2025-05-07T20:31:45.6162708Z self, 2025-05-07T20:31:45.6162789Z T: int, 2025-05-07T20:31:45.6162867Z D: int, 2025-05-07T20:31:45.6162964Z scale_ub: Optional[float], 2025-05-07T20:31:45.6163056Z contiguous: bool, 2025-05-07T20:31:45.6163142Z compiled: bool, 2025-05-07T20:31:45.6163301Z ) -> None: 2025-05-07T20:31:45.6163404Z torch.manual_seed(2025) 2025-05-07T20:31:45.6163478Z 2025-05-07T20:31:45.6163648Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6165429Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6165436Z 2025-05-07T20:31:45.6165553Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6165560Z 2025-05-07T20:31:45.6165663Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6165890Z self=, 2025-05-07T20:31:45.6165974Z T=1, 2025-05-07T20:31:45.6166052Z D=5120, 2025-05-07T20:31:45.6166134Z scale_ub=1200.0, 2025-05-07T20:31:45.6166225Z contiguous=True, 2025-05-07T20:31:45.6166313Z compiled=False, 2025-05-07T20:31:45.6166391Z ) 2025-05-07T20:31:45.6166610Z self = 2025-05-07T20:31:45.6166851Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6166857Z 2025-05-07T20:31:45.6166934Z @given( 2025-05-07T20:31:45.6167056Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6167152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6167268Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6167384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6167548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6167634Z ) 2025-05-07T20:31:45.6167878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6167973Z def test_silu_mul_quant( 2025-05-07T20:31:45.6168054Z self, 2025-05-07T20:31:45.6168133Z T: int, 2025-05-07T20:31:45.6168209Z D: int, 2025-05-07T20:31:45.6168308Z scale_ub: Optional[float], 2025-05-07T20:31:45.6168400Z contiguous: bool, 2025-05-07T20:31:45.6168488Z compiled: bool, 2025-05-07T20:31:45.6168574Z ) -> None: 2025-05-07T20:31:45.6168669Z torch.manual_seed(2025) 2025-05-07T20:31:45.6168747Z 2025-05-07T20:31:45.6168915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6168993Z 2025-05-07T20:31:45.6169087Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6169212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6169303Z x = x_sign * x_clamp 2025-05-07T20:31:45.6169391Z x0 = x[:, :D] 2025-05-07T20:31:45.6169472Z x1 = x[:, D:] 2025-05-07T20:31:45.6169548Z 2025-05-07T20:31:45.6169632Z if contiguous: 2025-05-07T20:31:45.6169725Z x0 = x0.contiguous() 2025-05-07T20:31:45.6169815Z x1 = x1.contiguous() 2025-05-07T20:31:45.6169895Z 2025-05-07T20:31:45.6169987Z if scale_ub is not None: 2025-05-07T20:31:45.6170099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6170239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6170315Z ) 2025-05-07T20:31:45.6170397Z else: 2025-05-07T20:31:45.6170491Z scale_ub_tensor = None 2025-05-07T20:31:45.6170563Z 2025-05-07T20:31:45.6170697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6170789Z op = silu_mul_quant 2025-05-07T20:31:45.6170874Z if compiled: 2025-05-07T20:31:45.6170976Z op = torch.compile(op) 2025-05-07T20:31:45.6171173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6171249Z 2025-05-07T20:31:45.6171344Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6171349Z 2025-05-07T20:31:45.6171448Z moe/activation_test.py:117: 2025-05-07T20:31:45.6171583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6171682Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6171782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6172287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6172386Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6172741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6172964Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6173303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6173407Z kernel = self.compile( 2025-05-07T20:31:45.6173786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6173962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6174094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6174099Z 2025-05-07T20:31:45.6174379Z self = 2025-05-07T20:31:45.6175155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6175652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb85d4e0>} 2025-05-07T20:31:45.6176403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6176598Z context = 2025-05-07T20:31:45.6176603Z 2025-05-07T20:31:45.6176773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6177036Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6177144Z module_map=module_map) 2025-05-07T20:31:45.6177306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6177411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6177490Z E ^ 2025-05-07T20:31:45.6177850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6177859Z 2025-05-07T20:31:45.6178273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6178278Z 2025-05-07T20:31:45.6178382Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6178606Z self=, 2025-05-07T20:31:45.6178686Z T=2048, 2025-05-07T20:31:45.6178761Z D=5120, 2025-05-07T20:31:45.6178857Z scale_ub=None, 2025-05-07T20:31:45.6178941Z contiguous=True, 2025-05-07T20:31:45.6179030Z compiled=False, 2025-05-07T20:31:45.6184198Z ) 2025-05-07T20:31:45.6184455Z self = 2025-05-07T20:31:45.6184636Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6184641Z 2025-05-07T20:31:45.6184726Z @given( 2025-05-07T20:31:45.6184987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6185090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6185211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6185331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6185444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6185525Z ) 2025-05-07T20:31:45.6185777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6185883Z def test_silu_mul_quant( 2025-05-07T20:31:45.6185964Z self, 2025-05-07T20:31:45.6186044Z T: int, 2025-05-07T20:31:45.6186128Z D: int, 2025-05-07T20:31:45.6186231Z scale_ub: Optional[float], 2025-05-07T20:31:45.6186323Z contiguous: bool, 2025-05-07T20:31:45.6186420Z compiled: bool, 2025-05-07T20:31:45.6186502Z ) -> None: 2025-05-07T20:31:45.6186599Z torch.manual_seed(2025) 2025-05-07T20:31:45.6186680Z 2025-05-07T20:31:45.6186859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6186936Z 2025-05-07T20:31:45.6187034Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.6188917Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6188925Z 2025-05-07T20:31:45.6189051Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.6189056Z 2025-05-07T20:31:45.6189165Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6189397Z self=, 2025-05-07T20:31:45.6189486Z T=16384, 2025-05-07T20:31:45.6189565Z D=5120, 2025-05-07T20:31:45.6189659Z scale_ub=None, 2025-05-07T20:31:45.6189746Z contiguous=True, 2025-05-07T20:31:45.6189835Z compiled=False, 2025-05-07T20:31:45.6189918Z ) 2025-05-07T20:31:45.6190138Z self = 2025-05-07T20:31:45.6190316Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6190327Z 2025-05-07T20:31:45.6190413Z @given( 2025-05-07T20:31:45.6190535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6190640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6190756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6190874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6190993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6191076Z ) 2025-05-07T20:31:45.6191323Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6191423Z def test_silu_mul_quant( 2025-05-07T20:31:45.6191502Z self, 2025-05-07T20:31:45.6191581Z T: int, 2025-05-07T20:31:45.6191663Z D: int, 2025-05-07T20:31:45.6191763Z scale_ub: Optional[float], 2025-05-07T20:31:45.6191855Z contiguous: bool, 2025-05-07T20:31:45.6191947Z compiled: bool, 2025-05-07T20:31:45.6192029Z ) -> None: 2025-05-07T20:31:45.6192135Z torch.manual_seed(2025) 2025-05-07T20:31:45.6192210Z 2025-05-07T20:31:45.6192380Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6194160Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6194264Z 2025-05-07T20:31:45.6194385Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6194390Z 2025-05-07T20:31:45.6194501Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6194728Z self=, 2025-05-07T20:31:45.6194808Z T=4096, 2025-05-07T20:31:45.6194891Z D=5120, 2025-05-07T20:31:45.6194977Z scale_ub=None, 2025-05-07T20:31:45.6195064Z contiguous=True, 2025-05-07T20:31:45.6195161Z compiled=False, 2025-05-07T20:31:45.6195236Z ) 2025-05-07T20:31:45.6195456Z self = 2025-05-07T20:31:45.6195628Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6195637Z 2025-05-07T20:31:45.6195715Z @given( 2025-05-07T20:31:45.6195841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6195944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6196058Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6196181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6196294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6196447Z ) 2025-05-07T20:31:45.6196699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6196796Z def test_silu_mul_quant( 2025-05-07T20:31:45.6196879Z self, 2025-05-07T20:31:45.6196958Z T: int, 2025-05-07T20:31:45.6197037Z D: int, 2025-05-07T20:31:45.6197141Z scale_ub: Optional[float], 2025-05-07T20:31:45.6197232Z contiguous: bool, 2025-05-07T20:31:45.6197328Z compiled: bool, 2025-05-07T20:31:45.6197415Z ) -> None: 2025-05-07T20:31:45.6197511Z torch.manual_seed(2025) 2025-05-07T20:31:45.6197586Z 2025-05-07T20:31:45.6197759Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6199534Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6199540Z 2025-05-07T20:31:45.6199665Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6199670Z 2025-05-07T20:31:45.6199782Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6200007Z self=, 2025-05-07T20:31:45.6200087Z T=2048, 2025-05-07T20:31:45.6200165Z D=5120, 2025-05-07T20:31:45.6200253Z scale_ub=None, 2025-05-07T20:31:45.6200341Z contiguous=False, 2025-05-07T20:31:45.6200430Z compiled=False, 2025-05-07T20:31:45.6200511Z ) 2025-05-07T20:31:45.6200726Z self = 2025-05-07T20:31:45.6200904Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.6200909Z 2025-05-07T20:31:45.6200991Z @given( 2025-05-07T20:31:45.6201113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6201221Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6201338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6201455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6201661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6201739Z ) 2025-05-07T20:31:45.6201986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6202087Z def test_silu_mul_quant( 2025-05-07T20:31:45.6202166Z self, 2025-05-07T20:31:45.6202245Z T: int, 2025-05-07T20:31:45.6202326Z D: int, 2025-05-07T20:31:45.6202428Z scale_ub: Optional[float], 2025-05-07T20:31:45.6202521Z contiguous: bool, 2025-05-07T20:31:45.6202618Z compiled: bool, 2025-05-07T20:31:45.6202699Z ) -> None: 2025-05-07T20:31:45.6202800Z torch.manual_seed(2025) 2025-05-07T20:31:45.6202877Z 2025-05-07T20:31:45.6203047Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6204814Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6204827Z 2025-05-07T20:31:45.6204947Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6204952Z 2025-05-07T20:31:45.6205205Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6205430Z self=, 2025-05-07T20:31:45.6205510Z T=4096, 2025-05-07T20:31:45.6207867Z D=7168, 2025-05-07T20:31:45.6208005Z scale_ub=None, 2025-05-07T20:31:45.6208088Z contiguous=True, 2025-05-07T20:31:45.6208174Z compiled=True, 2025-05-07T20:31:45.6208251Z ) 2025-05-07T20:31:45.6208479Z self = 2025-05-07T20:31:45.6208663Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.6208669Z 2025-05-07T20:31:45.6208747Z @given( 2025-05-07T20:31:45.6208870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6208967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6209084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6209202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6209317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6209389Z ) 2025-05-07T20:31:45.6209636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6209728Z def test_silu_mul_quant( 2025-05-07T20:31:45.6209808Z self, 2025-05-07T20:31:45.6209883Z T: int, 2025-05-07T20:31:45.6209958Z D: int, 2025-05-07T20:31:45.6210057Z scale_ub: Optional[float], 2025-05-07T20:31:45.6210149Z contiguous: bool, 2025-05-07T20:31:45.6210231Z compiled: bool, 2025-05-07T20:31:45.6210313Z ) -> None: 2025-05-07T20:31:45.6210406Z torch.manual_seed(2025) 2025-05-07T20:31:45.6210478Z 2025-05-07T20:31:45.6210645Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6212478Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6212485Z 2025-05-07T20:31:45.6212603Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6212862Z 2025-05-07T20:31:45.6212968Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6213190Z self=, 2025-05-07T20:31:45.6213266Z T=2048, 2025-05-07T20:31:45.6213344Z D=5120, 2025-05-07T20:31:45.6213428Z scale_ub=1200.0, 2025-05-07T20:31:45.6213514Z contiguous=False, 2025-05-07T20:31:45.6213594Z compiled=False, 2025-05-07T20:31:45.6213672Z ) 2025-05-07T20:31:45.6213889Z self = 2025-05-07T20:31:45.6214060Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.6214064Z 2025-05-07T20:31:45.6214146Z @given( 2025-05-07T20:31:45.6214261Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6214360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6214472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6214592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6214704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6214775Z ) 2025-05-07T20:31:45.6215016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6215111Z def test_silu_mul_quant( 2025-05-07T20:31:45.6215189Z self, 2025-05-07T20:31:45.6215266Z T: int, 2025-05-07T20:31:45.6215347Z D: int, 2025-05-07T20:31:45.6215441Z scale_ub: Optional[float], 2025-05-07T20:31:45.6215643Z contiguous: bool, 2025-05-07T20:31:45.6215730Z compiled: bool, 2025-05-07T20:31:45.6215807Z ) -> None: 2025-05-07T20:31:45.6215901Z torch.manual_seed(2025) 2025-05-07T20:31:45.6215974Z 2025-05-07T20:31:45.6216138Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6217902Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6217916Z 2025-05-07T20:31:45.6218036Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6218043Z 2025-05-07T20:31:45.6218141Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6218358Z self=, 2025-05-07T20:31:45.6218436Z T=4096, 2025-05-07T20:31:45.6218510Z D=7168, 2025-05-07T20:31:45.6218591Z scale_ub=1200.0, 2025-05-07T20:31:45.6218681Z contiguous=True, 2025-05-07T20:31:45.6218763Z compiled=False, 2025-05-07T20:31:45.6218841Z ) 2025-05-07T20:31:45.6219055Z self = 2025-05-07T20:31:45.6219222Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6219226Z 2025-05-07T20:31:45.6219304Z @given( 2025-05-07T20:31:45.6219423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6219520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6219631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6219750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6219860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6219938Z ) 2025-05-07T20:31:45.6220175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6220266Z def test_silu_mul_quant( 2025-05-07T20:31:45.6220344Z self, 2025-05-07T20:31:45.6220415Z T: int, 2025-05-07T20:31:45.6220490Z D: int, 2025-05-07T20:31:45.6220671Z scale_ub: Optional[float], 2025-05-07T20:31:45.6220758Z contiguous: bool, 2025-05-07T20:31:45.6220842Z compiled: bool, 2025-05-07T20:31:45.6220920Z ) -> None: 2025-05-07T20:31:45.6221012Z torch.manual_seed(2025) 2025-05-07T20:31:45.6221091Z 2025-05-07T20:31:45.6221254Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6223027Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6223041Z 2025-05-07T20:31:45.6223156Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6223160Z 2025-05-07T20:31:45.6223259Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6223482Z self=, 2025-05-07T20:31:45.6223559Z T=16384, 2025-05-07T20:31:45.6223634Z D=7168, 2025-05-07T20:31:45.6223720Z scale_ub=None, 2025-05-07T20:31:45.6223805Z contiguous=False, 2025-05-07T20:31:45.6223886Z compiled=True, 2025-05-07T20:31:45.6224041Z ) 2025-05-07T20:31:45.6224255Z self = 2025-05-07T20:31:45.6224431Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.6224436Z 2025-05-07T20:31:45.6224514Z @given( 2025-05-07T20:31:45.6224627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6224727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6224843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6224956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6225066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6225140Z ) 2025-05-07T20:31:45.6225383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6225475Z def test_silu_mul_quant( 2025-05-07T20:31:45.6225553Z self, 2025-05-07T20:31:45.6225631Z T: int, 2025-05-07T20:31:45.6225710Z D: int, 2025-05-07T20:31:45.6225808Z scale_ub: Optional[float], 2025-05-07T20:31:45.6225900Z contiguous: bool, 2025-05-07T20:31:45.6225985Z compiled: bool, 2025-05-07T20:31:45.6226063Z ) -> None: 2025-05-07T20:31:45.6226160Z torch.manual_seed(2025) 2025-05-07T20:31:45.6226232Z 2025-05-07T20:31:45.6226396Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6228165Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6228176Z 2025-05-07T20:31:45.6228293Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6228301Z 2025-05-07T20:31:45.6228404Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6228624Z self=, 2025-05-07T20:31:45.6228705Z T=4096, 2025-05-07T20:31:45.6228782Z D=7168, 2025-05-07T20:31:45.6228865Z scale_ub=None, 2025-05-07T20:31:45.6229039Z contiguous=True, 2025-05-07T20:31:45.6229124Z compiled=False, 2025-05-07T20:31:45.6229198Z ) 2025-05-07T20:31:45.6229419Z self = 2025-05-07T20:31:45.6229587Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6229592Z 2025-05-07T20:31:45.6229670Z @given( 2025-05-07T20:31:45.6229791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6229893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6230016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6230131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6230240Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6230318Z ) 2025-05-07T20:31:45.6230560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6230653Z def test_silu_mul_quant( 2025-05-07T20:31:45.6230736Z self, 2025-05-07T20:31:45.6230820Z T: int, 2025-05-07T20:31:45.6230897Z D: int, 2025-05-07T20:31:45.6230999Z scale_ub: Optional[float], 2025-05-07T20:31:45.6231088Z contiguous: bool, 2025-05-07T20:31:45.6231175Z compiled: bool, 2025-05-07T20:31:45.6231254Z ) -> None: 2025-05-07T20:31:45.6231346Z torch.manual_seed(2025) 2025-05-07T20:31:45.6231424Z 2025-05-07T20:31:45.6231589Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6233433Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6233450Z 2025-05-07T20:31:45.6233568Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6233572Z 2025-05-07T20:31:45.6233674Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6233897Z self=, 2025-05-07T20:31:45.6233977Z T=16384, 2025-05-07T20:31:45.6234054Z D=7168, 2025-05-07T20:31:45.6234144Z scale_ub=None, 2025-05-07T20:31:45.6234233Z contiguous=True, 2025-05-07T20:31:45.6234316Z compiled=False, 2025-05-07T20:31:45.6234391Z ) 2025-05-07T20:31:45.6234604Z self = 2025-05-07T20:31:45.6234780Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.6234785Z 2025-05-07T20:31:45.6234861Z @given( 2025-05-07T20:31:45.6234978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6235088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6235200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6235313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6235427Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6235500Z ) 2025-05-07T20:31:45.6235744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6235837Z def test_silu_mul_quant( 2025-05-07T20:31:45.6235920Z self, 2025-05-07T20:31:45.6236006Z T: int, 2025-05-07T20:31:45.6236083Z D: int, 2025-05-07T20:31:45.6236180Z scale_ub: Optional[float], 2025-05-07T20:31:45.6236270Z contiguous: bool, 2025-05-07T20:31:45.6236355Z compiled: bool, 2025-05-07T20:31:45.6236433Z ) -> None: 2025-05-07T20:31:45.6236534Z torch.manual_seed(2025) 2025-05-07T20:31:45.6236606Z 2025-05-07T20:31:45.6236770Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6238630Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6238636Z 2025-05-07T20:31:45.6238751Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6238759Z 2025-05-07T20:31:45.6238860Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6239077Z self=, 2025-05-07T20:31:45.6239156Z T=16384, 2025-05-07T20:31:45.6239237Z D=7168, 2025-05-07T20:31:45.6239324Z scale_ub=1200.0, 2025-05-07T20:31:45.6239410Z contiguous=True, 2025-05-07T20:31:45.6239493Z compiled=False, 2025-05-07T20:31:45.6239566Z ) 2025-05-07T20:31:45.6239780Z self = 2025-05-07T20:31:45.6239951Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6239956Z 2025-05-07T20:31:45.6240033Z @given( 2025-05-07T20:31:45.6240258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6240359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6240474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6240588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6240700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6240780Z ) 2025-05-07T20:31:45.6241023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6241122Z def test_silu_mul_quant( 2025-05-07T20:31:45.6241201Z self, 2025-05-07T20:31:45.6241278Z T: int, 2025-05-07T20:31:45.6241354Z D: int, 2025-05-07T20:31:45.6241454Z scale_ub: Optional[float], 2025-05-07T20:31:45.6241541Z contiguous: bool, 2025-05-07T20:31:45.6241634Z compiled: bool, 2025-05-07T20:31:45.6241714Z ) -> None: 2025-05-07T20:31:45.6241807Z torch.manual_seed(2025) 2025-05-07T20:31:45.6241885Z 2025-05-07T20:31:45.6242055Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6243823Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6243840Z 2025-05-07T20:31:45.6243958Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6243963Z 2025-05-07T20:31:45.6244064Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6244288Z self=, 2025-05-07T20:31:45.6244367Z T=128, 2025-05-07T20:31:45.6244448Z D=5120, 2025-05-07T20:31:45.6244535Z scale_ub=1200.0, 2025-05-07T20:31:45.6244621Z contiguous=False, 2025-05-07T20:31:45.6244709Z compiled=False, 2025-05-07T20:31:45.6244782Z ) 2025-05-07T20:31:45.6244995Z self = 2025-05-07T20:31:45.6245172Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.6245177Z 2025-05-07T20:31:45.6245253Z @given( 2025-05-07T20:31:45.6245451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6245553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6245665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6245788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6245902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6245976Z ) 2025-05-07T20:31:45.6246223Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6246323Z def test_silu_mul_quant( 2025-05-07T20:31:45.6246399Z self, 2025-05-07T20:31:45.6246480Z T: int, 2025-05-07T20:31:45.6246558Z D: int, 2025-05-07T20:31:45.6246658Z scale_ub: Optional[float], 2025-05-07T20:31:45.6246750Z contiguous: bool, 2025-05-07T20:31:45.6246835Z compiled: bool, 2025-05-07T20:31:45.6246918Z ) -> None: 2025-05-07T20:31:45.6247013Z torch.manual_seed(2025) 2025-05-07T20:31:45.6247093Z 2025-05-07T20:31:45.6247262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6247335Z 2025-05-07T20:31:45.6247427Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6247674Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6247769Z x = x_sign * x_clamp 2025-05-07T20:31:45.6247853Z x0 = x[:, :D] 2025-05-07T20:31:45.6247940Z x1 = x[:, D:] 2025-05-07T20:31:45.6248016Z 2025-05-07T20:31:45.6248107Z if contiguous: 2025-05-07T20:31:45.6248291Z x0 = x0.contiguous() 2025-05-07T20:31:45.6248386Z x1 = x1.contiguous() 2025-05-07T20:31:45.6248469Z 2025-05-07T20:31:45.6248564Z if scale_ub is not None: 2025-05-07T20:31:45.6248676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6248819Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6248899Z ) 2025-05-07T20:31:45.6248978Z else: 2025-05-07T20:31:45.6249086Z scale_ub_tensor = None 2025-05-07T20:31:45.6249161Z 2025-05-07T20:31:45.6249296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6249394Z op = silu_mul_quant 2025-05-07T20:31:45.6249480Z if compiled: 2025-05-07T20:31:45.6249583Z op = torch.compile(op) 2025-05-07T20:31:45.6249696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6249770Z 2025-05-07T20:31:45.6249870Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6249881Z 2025-05-07T20:31:45.6249982Z moe/activation_test.py:117: 2025-05-07T20:31:45.6250116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6250226Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6250327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6250855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6250976Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6251354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6251582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6251924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6252022Z kernel = self.compile( 2025-05-07T20:31:45.6252421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6252597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6252726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6252733Z 2025-05-07T20:31:45.6252939Z self = 2025-05-07T20:31:45.6253719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6254314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb368220>} 2025-05-07T20:31:45.6255068Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6255262Z context = 2025-05-07T20:31:45.6255267Z 2025-05-07T20:31:45.6255430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6255696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6255815Z module_map=module_map) 2025-05-07T20:31:45.6255981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6256088Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6256168Z E ^ 2025-05-07T20:31:45.6256527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6256532Z 2025-05-07T20:31:45.6257027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6257032Z 2025-05-07T20:31:45.6257137Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6257360Z self=, 2025-05-07T20:31:45.6257451Z T=2048, 2025-05-07T20:31:45.6257530Z D=7168, 2025-05-07T20:31:45.6257618Z scale_ub=None, 2025-05-07T20:31:45.6257708Z contiguous=False, 2025-05-07T20:31:45.6257798Z compiled=False, 2025-05-07T20:31:45.6257876Z ) 2025-05-07T20:31:45.6258093Z self = 2025-05-07T20:31:45.6258270Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.6258274Z 2025-05-07T20:31:45.6258361Z @given( 2025-05-07T20:31:45.6258483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6258587Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6258711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6258828Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6258948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6259029Z ) 2025-05-07T20:31:45.6259276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6259377Z def test_silu_mul_quant( 2025-05-07T20:31:45.6259458Z self, 2025-05-07T20:31:45.6259540Z T: int, 2025-05-07T20:31:45.6259627Z D: int, 2025-05-07T20:31:45.6259730Z scale_ub: Optional[float], 2025-05-07T20:31:45.6259824Z contiguous: bool, 2025-05-07T20:31:45.6259917Z compiled: bool, 2025-05-07T20:31:45.6259999Z ) -> None: 2025-05-07T20:31:45.6260098Z torch.manual_seed(2025) 2025-05-07T20:31:45.6260182Z 2025-05-07T20:31:45.6260354Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6262134Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6262222Z 2025-05-07T20:31:45.6262345Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6262350Z 2025-05-07T20:31:45.6262459Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6262682Z self=, 2025-05-07T20:31:45.6262761Z T=128, 2025-05-07T20:31:45.6262848Z D=7168, 2025-05-07T20:31:45.6262934Z scale_ub=1200.0, 2025-05-07T20:31:45.6263027Z contiguous=True, 2025-05-07T20:31:45.6263119Z compiled=True, 2025-05-07T20:31:45.6263195Z ) 2025-05-07T20:31:45.6263411Z self = 2025-05-07T20:31:45.6263583Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.6263588Z 2025-05-07T20:31:45.6263668Z @given( 2025-05-07T20:31:45.6263791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6263891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6264014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6264134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6264247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6264325Z ) 2025-05-07T20:31:45.6264572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6264669Z def test_silu_mul_quant( 2025-05-07T20:31:45.6264749Z self, 2025-05-07T20:31:45.6264912Z T: int, 2025-05-07T20:31:45.6264992Z D: int, 2025-05-07T20:31:45.6265098Z scale_ub: Optional[float], 2025-05-07T20:31:45.6265189Z contiguous: bool, 2025-05-07T20:31:45.6265283Z compiled: bool, 2025-05-07T20:31:45.6265367Z ) -> None: 2025-05-07T20:31:45.6265465Z torch.manual_seed(2025) 2025-05-07T20:31:45.6265543Z 2025-05-07T20:31:45.6265717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6265802Z 2025-05-07T20:31:45.6265896Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6266029Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6266121Z x = x_sign * x_clamp 2025-05-07T20:31:45.6266204Z x0 = x[:, :D] 2025-05-07T20:31:45.6266294Z x1 = x[:, D:] 2025-05-07T20:31:45.6266369Z 2025-05-07T20:31:45.6266461Z if contiguous: 2025-05-07T20:31:45.6266557Z x0 = x0.contiguous() 2025-05-07T20:31:45.6266649Z x1 = x1.contiguous() 2025-05-07T20:31:45.6266733Z 2025-05-07T20:31:45.6266828Z if scale_ub is not None: 2025-05-07T20:31:45.6266935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.6267078Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.6267160Z ) 2025-05-07T20:31:45.6267239Z else: 2025-05-07T20:31:45.6267344Z scale_ub_tensor = None 2025-05-07T20:31:45.6267420Z 2025-05-07T20:31:45.6267553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.6267654Z op = silu_mul_quant 2025-05-07T20:31:45.6267744Z if compiled: 2025-05-07T20:31:45.6267852Z op = torch.compile(op) 2025-05-07T20:31:45.6267961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6268037Z 2025-05-07T20:31:45.6268135Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.6268140Z 2025-05-07T20:31:45.6268241Z moe/activation_test.py:117: 2025-05-07T20:31:45.6268378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6268487Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.6268587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.6268958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.6269056Z return fn(*args, **kwargs) 2025-05-07T20:31:45.6269549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.6269759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.6270116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.6270339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.6270683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.6270787Z kernel = self.compile( 2025-05-07T20:31:45.6271213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.6271399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.6271531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.6271536Z 2025-05-07T20:31:45.6271749Z self = 2025-05-07T20:31:45.6272529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.6273034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f51bb368860>} 2025-05-07T20:31:45.6273860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.6274054Z context = 2025-05-07T20:31:45.6274059Z 2025-05-07T20:31:45.6274228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.6274492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.6274611Z module_map=module_map) 2025-05-07T20:31:45.6274777Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.6274878Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.6274962Z E ^ 2025-05-07T20:31:45.6275316Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.6275321Z 2025-05-07T20:31:45.6275740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.6275748Z 2025-05-07T20:31:45.6275855Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6276077Z self=, 2025-05-07T20:31:45.6276160Z T=128, 2025-05-07T20:31:45.6276240Z D=7168, 2025-05-07T20:31:45.6276325Z scale_ub=1200.0, 2025-05-07T20:31:45.6276423Z contiguous=True, 2025-05-07T20:31:45.6276510Z compiled=False, 2025-05-07T20:31:45.6276586Z ) 2025-05-07T20:31:45.6276808Z self = 2025-05-07T20:31:45.6276984Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.6276989Z 2025-05-07T20:31:45.6277075Z @given( 2025-05-07T20:31:45.6277196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6277301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6277423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6277542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6277657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6277736Z ) 2025-05-07T20:31:45.6277981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6278080Z def test_silu_mul_quant( 2025-05-07T20:31:45.6278245Z self, 2025-05-07T20:31:45.6278327Z T: int, 2025-05-07T20:31:45.6278407Z D: int, 2025-05-07T20:31:45.6278513Z scale_ub: Optional[float], 2025-05-07T20:31:45.6278605Z contiguous: bool, 2025-05-07T20:31:45.6278699Z compiled: bool, 2025-05-07T20:31:45.6278781Z ) -> None: 2025-05-07T20:31:45.6278879Z torch.manual_seed(2025) 2025-05-07T20:31:45.6278957Z 2025-05-07T20:31:45.6279126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6279210Z 2025-05-07T20:31:45.6279311Z x_sign = torch.sign(x) 2025-05-07T20:31:45.6279437Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.6281208Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6281222Z 2025-05-07T20:31:45.6281363Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.6281368Z 2025-05-07T20:31:45.6281493Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6281798Z self=, 2025-05-07T20:31:45.6281884Z T=128, 2025-05-07T20:31:45.6281967Z D=5120, 2025-05-07T20:31:45.6282055Z scale_ub=1200.0, 2025-05-07T20:31:45.6282147Z contiguous=True, 2025-05-07T20:31:45.6282238Z compiled=True, 2025-05-07T20:31:45.6282317Z ) 2025-05-07T20:31:45.6282533Z self = 2025-05-07T20:31:45.6282703Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.6282713Z 2025-05-07T20:31:45.6282794Z @given( 2025-05-07T20:31:45.6282915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6283021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6283138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6283258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6283372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6283453Z ) 2025-05-07T20:31:45.6283709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6283809Z def test_silu_mul_quant( 2025-05-07T20:31:45.6283890Z self, 2025-05-07T20:31:45.6283973Z T: int, 2025-05-07T20:31:45.6284054Z D: int, 2025-05-07T20:31:45.6284156Z scale_ub: Optional[float], 2025-05-07T20:31:45.6284254Z contiguous: bool, 2025-05-07T20:31:45.6284343Z compiled: bool, 2025-05-07T20:31:45.6284430Z ) -> None: 2025-05-07T20:31:45.6284533Z torch.manual_seed(2025) 2025-05-07T20:31:45.6284611Z 2025-05-07T20:31:45.6284786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6284865Z 2025-05-07T20:31:45.6284962Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.6286730Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6286736Z 2025-05-07T20:31:45.6286855Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.6286941Z 2025-05-07T20:31:45.6287051Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.6287273Z self=, 2025-05-07T20:31:45.6287355Z T=128, 2025-05-07T20:31:45.6287440Z D=7168, 2025-05-07T20:31:45.6287597Z scale_ub=None, 2025-05-07T20:31:45.6287688Z contiguous=True, 2025-05-07T20:31:45.6287777Z compiled=True, 2025-05-07T20:31:45.6287851Z ) 2025-05-07T20:31:45.6288079Z self = 2025-05-07T20:31:45.6288247Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.6288252Z 2025-05-07T20:31:45.6288333Z @given( 2025-05-07T20:31:45.6288456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.6288557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.6288673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.6288794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.6288914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.6288991Z ) 2025-05-07T20:31:45.6289239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.6289335Z def test_silu_mul_quant( 2025-05-07T20:31:45.6289417Z self, 2025-05-07T20:31:45.6289497Z T: int, 2025-05-07T20:31:45.6289576Z D: int, 2025-05-07T20:31:45.6289680Z scale_ub: Optional[float], 2025-05-07T20:31:45.6289852Z contiguous: bool, 2025-05-07T20:31:45.6289943Z compiled: bool, 2025-05-07T20:31:45.6290027Z ) -> None: 2025-05-07T20:31:45.6290125Z torch.manual_seed(2025) 2025-05-07T20:31:45.6290200Z 2025-05-07T20:31:45.6290370Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.6292125Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.6292136Z 2025-05-07T20:31:45.6292265Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.6292400Z =============================== warnings summary =============================== 2025-05-07T20:31:45.6292713Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:45.6293015Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:45.6293312Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:45.6294199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:31:45.6294428Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:31:45.6294433Z 2025-05-07T20:31:45.6294621Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:31:45.6295893Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:31:45.6296081Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:31:45.6296169Z 2025-05-07T20:31:45.6296382Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:31:45.6296550Z ================== 1 failed, 1 passed, 13 warnings in 21.87s =================== 2025-05-07T20:31:47.3461972Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:31:47.4079862Z 2025-05-07T20:31:47.4080303Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:31:47.4080728Z 2025-05-07T20:31:47.4080740Z 2025-05-07T20:31:47.4101037Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:31:49.5359591Z ============================= test session starts ============================== 2025-05-07T20:31:49.5360254Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:49.5360781Z cachedir: .pytest_cache 2025-05-07T20:31:49.5361354Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:49.5362081Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:49.5362799Z plugins: hypothesis-6.131.14 2025-05-07T20:31:51.1270204Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:51.2789655Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:31:51.2790443Z run-last-failure: rerun previous 1 failure 2025-05-07T20:31:51.2790884Z 2025-05-07T20:31:53.4194697Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:53.4195824Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:53.4197170Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:53.4198727Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:53.4199710Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.4201014Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:53.4202397Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4203381Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.4204600Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:53.4206299Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4216268Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.4217754Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:53.4219035Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:53.4220268Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:53.4221485Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:53.4222327Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.4223355Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:53.4224581Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:53.4225392Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:53.4226612Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:53.4227906Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:53.4229026Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:53.4230079Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:53.4231259Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:53.4232620Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:53.4233824Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4234764Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.4235512Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:53.4236542Z W0507 20:31:53.417000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4365773Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:53.4366837Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:53.4368501Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:53.4369940Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:53.4370920Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.4372230Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:53.4373609Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4374585Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.4375939Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:53.4377309Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4378368Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.4379652Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:53.4380882Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:53.4382098Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:53.4383297Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:53.4384121Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.4385134Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:53.4386149Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:53.4386939Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:53.4388144Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:53.4389419Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:53.4390604Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:53.4391636Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:53.4392818Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:53.4394163Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:53.4395222Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4396126Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.4396861Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:53.4397952Z W0507 20:31:53.435000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.9530227Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.9530912Z self=, 2025-05-07T20:31:53.9531349Z T=1, 2025-05-07T20:31:53.9531547Z D=5120, 2025-05-07T20:31:53.9531755Z scale_ub=None, 2025-05-07T20:31:53.9531982Z contiguous=True, 2025-05-07T20:31:53.9532210Z compiled=True, 2025-05-07T20:31:53.9532452Z ) 2025-05-07T20:31:53.9532781Z self = 2025-05-07T20:31:53.9533272Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:53.9533544Z 2025-05-07T20:31:53.9533630Z @given( 2025-05-07T20:31:53.9533875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.9534192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.9534509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.9534862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.9535204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.9535502Z ) 2025-05-07T20:31:53.9535862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.9536319Z def test_silu_mul_quant( 2025-05-07T20:31:53.9536578Z self, 2025-05-07T20:31:53.9536795Z T: int, 2025-05-07T20:31:53.9537013Z D: int, 2025-05-07T20:31:53.9537238Z scale_ub: Optional[float], 2025-05-07T20:31:53.9537526Z contiguous: bool, 2025-05-07T20:31:53.9537784Z compiled: bool, 2025-05-07T20:31:53.9538018Z ) -> None: 2025-05-07T20:31:53.9538248Z torch.manual_seed(2025) 2025-05-07T20:31:53.9538507Z 2025-05-07T20:31:53.9538794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.9539152Z 2025-05-07T20:31:53.9539363Z x_sign = torch.sign(x) 2025-05-07T20:31:53.9539666Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.9539989Z x = x_sign * x_clamp 2025-05-07T20:31:53.9540243Z x0 = x[:, :D] 2025-05-07T20:31:53.9540471Z x1 = x[:, D:] 2025-05-07T20:31:53.9540686Z 2025-05-07T20:31:53.9540885Z if contiguous: 2025-05-07T20:31:53.9541132Z x0 = x0.contiguous() 2025-05-07T20:31:53.9541399Z x1 = x1.contiguous() 2025-05-07T20:31:53.9541655Z 2025-05-07T20:31:53.9542198Z if scale_ub is not None: 2025-05-07T20:31:53.9542475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.9542823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.9543144Z ) 2025-05-07T20:31:53.9543347Z else: 2025-05-07T20:31:53.9543570Z scale_ub_tensor = None 2025-05-07T20:31:53.9543830Z 2025-05-07T20:31:53.9544064Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.9544389Z op = silu_mul_quant 2025-05-07T20:31:53.9544653Z if compiled: 2025-05-07T20:31:53.9544903Z op = torch.compile(op) 2025-05-07T20:31:53.9545208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.9545492Z 2025-05-07T20:31:53.9545695Z y_fp8, y_scale = fn() 2025-05-07T20:31:53.9545995Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:53.9546296Z 2025-05-07T20:31:53.9546542Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.9546881Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:53.9547183Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:53.9547509Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:53.9547876Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.9548187Z 2025-05-07T20:31:53.9548402Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:53.9548600Z 2025-05-07T20:31:53.9548710Z moe/activation_test.py:126: 2025-05-07T20:31:53.9549154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.9549501Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:53.9549835Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.9550630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:53.9551391Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:53.9551951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.9552639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.9553369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:53.9554305Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.9555064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:53.9555818Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.9556541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:53.9557185Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:53.9557797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:53.9558322Z fn() 2025-05-07T20:31:53.9558826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:53.9559414Z self.fn.run( 2025-05-07T20:31:53.9559888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.9560422Z kernel = self.compile( 2025-05-07T20:31:53.9560968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.9561619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.9562018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.9562401Z 2025-05-07T20:31:53.9562610Z self = 2025-05-07T20:31:53.9563789Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.9565180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a97697ce0>} 2025-05-07T20:31:53.9566536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.9567623Z context = 2025-05-07T20:31:53.9567920Z 2025-05-07T20:31:53.9568088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.9568618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.9569094Z module_map=module_map) 2025-05-07T20:31:53.9569457Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.9569819Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:53.9570093Z E ^ 2025-05-07T20:31:53.9570556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.9571097Z 2025-05-07T20:31:53.9571516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.9572038Z 2025-05-07T20:31:53.9572147Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.9572563Z self=, 2025-05-07T20:31:53.9572965Z T=2048, 2025-05-07T20:31:53.9573162Z D=5120, 2025-05-07T20:31:53.9573367Z scale_ub=1200.0, 2025-05-07T20:31:53.9573585Z contiguous=True, 2025-05-07T20:31:53.9573812Z compiled=False, 2025-05-07T20:31:53.9574024Z ) 2025-05-07T20:31:54.4995232Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.4997391Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:54.5000164Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.5003452Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.5004490Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.5006059Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.5007448Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.5008542Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.5009778Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.5011507Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.5012589Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.5013872Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.5015126Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:54.5016354Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.5017561Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:54.5018537Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.5019568Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:54.5020589Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:54.5021383Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:54.5022594Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.5023930Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.5025064Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.5026108Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:54.5027290Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.5028655Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.5029722Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.5030640Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.5031389Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:54.5032413Z W0507 20:31:54.496000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.6055620Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.6056688Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:54.6058041Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.6059477Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.6060458Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.6061780Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.6063164Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.6064414Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.6065652Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.6067025Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.6068094Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.6069382Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.6070620Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:54.6071840Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.6073051Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:54.6073879Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.6074911Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:54.6075923Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:54.6076719Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:54.6077930Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.6079357Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.6080473Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.6081512Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:54.6082691Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.6084055Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.6085120Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.6086023Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.6086872Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:54.6087990Z W0507 20:31:54.603000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.0528150Z self = 2025-05-07T20:31:55.0528685Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:55.0529003Z 2025-05-07T20:31:55.0529086Z @given( 2025-05-07T20:31:55.0529327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.0529647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.0529961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.0530297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.0530627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.0530936Z ) 2025-05-07T20:31:55.0531291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.0531740Z def test_silu_mul_quant( 2025-05-07T20:31:55.0531983Z self, 2025-05-07T20:31:55.0532188Z T: int, 2025-05-07T20:31:55.0532395Z D: int, 2025-05-07T20:31:55.0532654Z scale_ub: Optional[float], 2025-05-07T20:31:55.0532925Z contiguous: bool, 2025-05-07T20:31:55.0533173Z compiled: bool, 2025-05-07T20:31:55.0533414Z ) -> None: 2025-05-07T20:31:55.0533629Z torch.manual_seed(2025) 2025-05-07T20:31:55.0533883Z 2025-05-07T20:31:55.0534168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.0534523Z 2025-05-07T20:31:55.0534719Z x_sign = torch.sign(x) 2025-05-07T20:31:55.0535021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.0535337Z x = x_sign * x_clamp 2025-05-07T20:31:55.0535583Z x0 = x[:, :D] 2025-05-07T20:31:55.0535804Z x1 = x[:, D:] 2025-05-07T20:31:55.0536019Z 2025-05-07T20:31:55.0536207Z if contiguous: 2025-05-07T20:31:55.0536449Z x0 = x0.contiguous() 2025-05-07T20:31:55.0536710Z x1 = x1.contiguous() 2025-05-07T20:31:55.0536947Z 2025-05-07T20:31:55.0537145Z if scale_ub is not None: 2025-05-07T20:31:55.0537421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.0537753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.0538381Z ) 2025-05-07T20:31:55.0538580Z else: 2025-05-07T20:31:55.0538791Z scale_ub_tensor = None 2025-05-07T20:31:55.0539050Z 2025-05-07T20:31:55.0539289Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.0539613Z op = silu_mul_quant 2025-05-07T20:31:55.0539863Z if compiled: 2025-05-07T20:31:55.0540114Z op = torch.compile(op) 2025-05-07T20:31:55.0540419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.0540693Z 2025-05-07T20:31:55.0540897Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.0541063Z 2025-05-07T20:31:55.0541177Z moe/activation_test.py:117: 2025-05-07T20:31:55.0541475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.0541813Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.0542098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.0542790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.0543501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.0544091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.0544779Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.0545577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.0546121Z kernel = self.compile( 2025-05-07T20:31:55.0546672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.0547335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.0547732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.0547978Z 2025-05-07T20:31:55.0548187Z self = 2025-05-07T20:31:55.0549276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.0550686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a97911da0>} 2025-05-07T20:31:55.0552037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.0553067Z context = 2025-05-07T20:31:55.0553364Z 2025-05-07T20:31:55.0553531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.0554065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.0554530Z module_map=module_map) 2025-05-07T20:31:55.0554903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.0555261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.0555521Z E ^ 2025-05-07T20:31:55.0556000Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.0556463Z 2025-05-07T20:31:55.0556888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.0557403Z 2025-05-07T20:31:55.0557514Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.0557926Z self=, 2025-05-07T20:31:55.0558343Z T=2048, 2025-05-07T20:31:55.0558629Z D=5120, 2025-05-07T20:31:55.0558827Z scale_ub=1200.0, 2025-05-07T20:31:55.0559059Z contiguous=True, 2025-05-07T20:31:55.0559288Z compiled=True, 2025-05-07T20:31:55.0559504Z ) 2025-05-07T20:31:55.0559827Z self = 2025-05-07T20:31:55.0560328Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:55.0560602Z 2025-05-07T20:31:55.0560693Z @given( 2025-05-07T20:31:55.0560929Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.0561249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.0561564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.0561894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.0562229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.0562525Z ) 2025-05-07T20:31:55.0562881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.0563330Z def test_silu_mul_quant( 2025-05-07T20:31:55.0563583Z self, 2025-05-07T20:31:55.0563787Z T: int, 2025-05-07T20:31:55.0563989Z D: int, 2025-05-07T20:31:55.0564216Z scale_ub: Optional[float], 2025-05-07T20:31:55.0564494Z contiguous: bool, 2025-05-07T20:31:55.0564734Z compiled: bool, 2025-05-07T20:31:55.0564970Z ) -> None: 2025-05-07T20:31:55.0565197Z torch.manual_seed(2025) 2025-05-07T20:31:55.0565439Z 2025-05-07T20:31:55.0565801Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.0566158Z 2025-05-07T20:31:55.0566353Z x_sign = torch.sign(x) 2025-05-07T20:31:55.0566654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.0566974Z x = x_sign * x_clamp 2025-05-07T20:31:55.0567214Z x0 = x[:, :D] 2025-05-07T20:31:55.0567440Z x1 = x[:, D:] 2025-05-07T20:31:55.0567738Z 2025-05-07T20:31:55.0567926Z if contiguous: 2025-05-07T20:31:55.0568177Z x0 = x0.contiguous() 2025-05-07T20:31:55.0568445Z x1 = x1.contiguous() 2025-05-07T20:31:55.0568699Z 2025-05-07T20:31:55.0568895Z if scale_ub is not None: 2025-05-07T20:31:55.0569171Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.0569511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.0569825Z ) 2025-05-07T20:31:55.0570025Z else: 2025-05-07T20:31:55.0570246Z scale_ub_tensor = None 2025-05-07T20:31:55.0570503Z 2025-05-07T20:31:55.0570740Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.0571057Z op = silu_mul_quant 2025-05-07T20:31:55.0571306Z if compiled: 2025-05-07T20:31:55.0571558Z op = torch.compile(op) 2025-05-07T20:31:55.0571860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.0572135Z 2025-05-07T20:31:55.0572337Z y_fp8, y_scale = fn() 2025-05-07T20:31:55.0572632Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:55.0572924Z 2025-05-07T20:31:55.0573167Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.0573510Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:55.0573852Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:55.0574168Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:55.0574529Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.0574848Z 2025-05-07T20:31:55.0575050Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:55.0575250Z 2025-05-07T20:31:55.0575349Z moe/activation_test.py:126: 2025-05-07T20:31:55.0575651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.0575986Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:55.0576314Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.0577104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:55.0577953Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:55.0578498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.0579186Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.0579881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:55.0580615Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.0589277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:55.0590256Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.0591186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:55.0591977Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:55.0592711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:55.0593347Z fn() 2025-05-07T20:31:55.0594022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:55.0594845Z self.fn.run( 2025-05-07T20:31:55.0595332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.0595878Z kernel = self.compile( 2025-05-07T20:31:55.0596433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.0597091Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.0597508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.0597744Z 2025-05-07T20:31:55.0597965Z self = 2025-05-07T20:31:55.0599058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.0600443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a9642f2e0>} 2025-05-07T20:31:55.0601800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.0602835Z context = 2025-05-07T20:31:55.0603132Z 2025-05-07T20:31:55.0603311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.0603840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.0604315Z module_map=module_map) 2025-05-07T20:31:55.0604698Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.0605067Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:55.0605342Z E ^ 2025-05-07T20:31:55.0606153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.0606608Z 2025-05-07T20:31:55.0607037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.0607604Z 2025-05-07T20:31:55.0607714Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.0608138Z self=, 2025-05-07T20:31:55.0608716Z T=16384, 2025-05-07T20:31:55.0608925Z D=7168, 2025-05-07T20:31:55.0609128Z scale_ub=1200.0, 2025-05-07T20:31:55.0609370Z contiguous=False, 2025-05-07T20:31:55.0609607Z compiled=False, 2025-05-07T20:31:55.0609819Z ) 2025-05-07T20:31:55.3726105Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.3728454Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:55.3731159Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.3733873Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.3734892Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.3736563Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.3737957Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.3738950Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.3740184Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.3741564Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.3742643Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.3743928Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.3745180Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:55.3746391Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.3747602Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:55.3748436Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.3749466Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:55.3750491Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:55.3751434Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:55.3752649Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.3753948Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.3755071Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:55.3756113Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:55.3757289Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.3758651Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.3759799Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.3760717Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.3761459Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:55.3762479Z W0507 20:31:55.370000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.4482286Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.4483556Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:55.4485272Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.4486717Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.4487810Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.4489128Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.4490519Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.4491514Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.4492753Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.4494509Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.4495590Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.4496870Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.4498123Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:55.4499362Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.4500573Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:55.4501541Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.4502573Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:55.4503595Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:55.4504396Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:55.4505888Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.4507188Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.4508311Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:55.4509352Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:55.4510530Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.4511892Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.4512955Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.4513892Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.4514663Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:55.4515684Z W0507 20:31:55.445000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.1322791Z self = 2025-05-07T20:31:56.1323359Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:56.1323644Z 2025-05-07T20:31:56.1323730Z @given( 2025-05-07T20:31:56.1323973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.1324296Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.1324639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.1324984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.1325317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.1325617Z ) 2025-05-07T20:31:56.1325970Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.1326421Z def test_silu_mul_quant( 2025-05-07T20:31:56.1326668Z self, 2025-05-07T20:31:56.1326865Z T: int, 2025-05-07T20:31:56.1327085Z D: int, 2025-05-07T20:31:56.1327310Z scale_ub: Optional[float], 2025-05-07T20:31:56.1327717Z contiguous: bool, 2025-05-07T20:31:56.1327966Z compiled: bool, 2025-05-07T20:31:56.1328194Z ) -> None: 2025-05-07T20:31:56.1328419Z torch.manual_seed(2025) 2025-05-07T20:31:56.1328673Z 2025-05-07T20:31:56.1328944Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.1329295Z 2025-05-07T20:31:56.1329497Z x_sign = torch.sign(x) 2025-05-07T20:31:56.1330116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.1330445Z x = x_sign * x_clamp 2025-05-07T20:31:56.1330695Z x0 = x[:, :D] 2025-05-07T20:31:56.1330914Z x1 = x[:, D:] 2025-05-07T20:31:56.1331131Z 2025-05-07T20:31:56.1331323Z if contiguous: 2025-05-07T20:31:56.1331555Z x0 = x0.contiguous() 2025-05-07T20:31:56.1331819Z x1 = x1.contiguous() 2025-05-07T20:31:56.1332062Z 2025-05-07T20:31:56.1332261Z if scale_ub is not None: 2025-05-07T20:31:56.1332538Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.1332883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.1333198Z ) 2025-05-07T20:31:56.1333394Z else: 2025-05-07T20:31:56.1333614Z scale_ub_tensor = None 2025-05-07T20:31:56.1333872Z 2025-05-07T20:31:56.1334106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.1334432Z op = silu_mul_quant 2025-05-07T20:31:56.1334686Z if compiled: 2025-05-07T20:31:56.1334934Z op = torch.compile(op) 2025-05-07T20:31:56.1335236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.1335519Z 2025-05-07T20:31:56.1335711Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.1335882Z 2025-05-07T20:31:56.1335984Z moe/activation_test.py:117: 2025-05-07T20:31:56.1336286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1336626Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.1336912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.1337609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.1338311Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.1338846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.1339541Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.1340211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.1340751Z kernel = self.compile( 2025-05-07T20:31:56.1341297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.1341960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.1342543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1342775Z 2025-05-07T20:31:56.1342985Z self = 2025-05-07T20:31:56.1344080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.1345479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a96238860>} 2025-05-07T20:31:56.1346827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.1347859Z context = 2025-05-07T20:31:56.1348148Z 2025-05-07T20:31:56.1348315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.1348846Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.1349319Z module_map=module_map) 2025-05-07T20:31:56.1349689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.1350122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.1350399Z E ^ 2025-05-07T20:31:56.1350869Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.1351320Z 2025-05-07T20:31:56.1351740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.1352261Z 2025-05-07T20:31:56.1352369Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.1352795Z self=, 2025-05-07T20:31:56.1353206Z T=1, 2025-05-07T20:31:56.1353397Z D=7168, 2025-05-07T20:31:56.1353605Z scale_ub=None, 2025-05-07T20:31:56.1353827Z contiguous=True, 2025-05-07T20:31:56.1354050Z compiled=True, 2025-05-07T20:31:56.1354266Z ) 2025-05-07T20:31:56.1354594Z self = 2025-05-07T20:31:56.1355099Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:56.1355362Z 2025-05-07T20:31:56.1355443Z @given( 2025-05-07T20:31:56.1355686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.1356008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.1356312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.1356647Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.1356982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.1357283Z ) 2025-05-07T20:31:56.1357631Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.1358080Z def test_silu_mul_quant( 2025-05-07T20:31:56.1358331Z self, 2025-05-07T20:31:56.1358528Z T: int, 2025-05-07T20:31:56.1358736Z D: int, 2025-05-07T20:31:56.1358961Z scale_ub: Optional[float], 2025-05-07T20:31:56.1359233Z contiguous: bool, 2025-05-07T20:31:56.1359481Z compiled: bool, 2025-05-07T20:31:56.1359719Z ) -> None: 2025-05-07T20:31:56.1359936Z torch.manual_seed(2025) 2025-05-07T20:31:56.1360185Z 2025-05-07T20:31:56.1360464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.1360808Z 2025-05-07T20:31:56.1361009Z x_sign = torch.sign(x) 2025-05-07T20:31:56.1361304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.1361611Z x = x_sign * x_clamp 2025-05-07T20:31:56.1361949Z x0 = x[:, :D] 2025-05-07T20:31:56.1362169Z x1 = x[:, D:] 2025-05-07T20:31:56.1362378Z 2025-05-07T20:31:56.1362572Z if contiguous: 2025-05-07T20:31:56.1362813Z x0 = x0.contiguous() 2025-05-07T20:31:56.1363080Z x1 = x1.contiguous() 2025-05-07T20:31:56.1363320Z 2025-05-07T20:31:56.1363518Z if scale_ub is not None: 2025-05-07T20:31:56.1363795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.1364138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.1364456Z ) 2025-05-07T20:31:56.1364661Z else: 2025-05-07T20:31:56.1364873Z scale_ub_tensor = None 2025-05-07T20:31:56.1365132Z 2025-05-07T20:31:56.1365369Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.1365682Z op = silu_mul_quant 2025-05-07T20:31:56.1365938Z if compiled: 2025-05-07T20:31:56.1366192Z op = torch.compile(op) 2025-05-07T20:31:56.1366493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.1366780Z 2025-05-07T20:31:56.1366983Z y_fp8, y_scale = fn() 2025-05-07T20:31:56.1367271Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:56.1367639Z 2025-05-07T20:31:56.1367894Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.1368239Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:56.1368533Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:56.1368934Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:56.1369302Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.1369617Z 2025-05-07T20:31:56.1369827Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:56.1370025Z 2025-05-07T20:31:56.1370134Z moe/activation_test.py:126: 2025-05-07T20:31:56.1370430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1370771Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:56.1371108Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.1371897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:56.1372654Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:56.1373210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.1373937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.1374655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:56.1375376Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.1376139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:56.1376896Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.1377623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:56.1378268Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:56.1378876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:56.1379402Z fn() 2025-05-07T20:31:56.1379909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:56.1380497Z self.fn.run( 2025-05-07T20:31:56.1380972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.1381503Z kernel = self.compile( 2025-05-07T20:31:56.1382048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.1382792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.1383195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.1383425Z 2025-05-07T20:31:56.1383633Z self = 2025-05-07T20:31:56.1384719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.1386094Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a96239080>} 2025-05-07T20:31:56.1387439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.1388472Z context = 2025-05-07T20:31:56.1388763Z 2025-05-07T20:31:56.1388928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.1389454Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.1389925Z module_map=module_map) 2025-05-07T20:31:56.1390366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.1390734Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:56.1391012Z E ^ 2025-05-07T20:31:56.1391481Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.1391932Z 2025-05-07T20:31:56.1392349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.1392874Z 2025-05-07T20:31:56.1392982Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.1393400Z self=, 2025-05-07T20:31:56.1393807Z T=4096, 2025-05-07T20:31:56.1394000Z D=5120, 2025-05-07T20:31:56.1394197Z scale_ub=None, 2025-05-07T20:31:56.1394422Z contiguous=False, 2025-05-07T20:31:56.1394648Z compiled=False, 2025-05-07T20:31:56.1394860Z ) 2025-05-07T20:31:56.5055054Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.5056129Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:31:56.5057468Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.5058894Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.5059867Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.5061180Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.5062562Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.5063719Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.5064948Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.5066323Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.5067391Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.5068673Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.5069933Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:31:56.5071255Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.5072451Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:31:56.5073275Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.5074294Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:56.5075313Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:31:56.5076094Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:56.5077305Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.5078581Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.5079695Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.5080736Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:31:56.5081902Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.5083253Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.5084315Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.5085224Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.5086078Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:31:56.5087087Z W0507 20:31:56.503000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.7730244Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.7731309Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:31:56.7732645Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.7734068Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.7735036Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.7736506Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.7737890Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.7738870Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.7740102Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.7741468Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.7742533Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.7743810Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.7745106Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:31:56.7746326Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.7747532Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:31:56.7748356Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.7749382Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:56.7750396Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:31:56.7751317Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:56.7752526Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.7753814Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.7754930Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.7755971Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:31:56.7757152Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.7758498Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.7759639Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.7760556Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.7761294Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:31:56.7762302Z W0507 20:31:56.770000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.4596407Z self = 2025-05-07T20:31:57.4596963Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.4597251Z 2025-05-07T20:31:57.4597342Z @given( 2025-05-07T20:31:57.4597595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.4597915Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.4598228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.4598562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.4598894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.4599179Z ) 2025-05-07T20:31:57.4599530Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.4599979Z def test_silu_mul_quant( 2025-05-07T20:31:57.4600220Z self, 2025-05-07T20:31:57.4600420Z T: int, 2025-05-07T20:31:57.4600622Z D: int, 2025-05-07T20:31:57.4600839Z scale_ub: Optional[float], 2025-05-07T20:31:57.4601116Z contiguous: bool, 2025-05-07T20:31:57.4601360Z compiled: bool, 2025-05-07T20:31:57.4601586Z ) -> None: 2025-05-07T20:31:57.4601807Z torch.manual_seed(2025) 2025-05-07T20:31:57.4602056Z 2025-05-07T20:31:57.4609795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.4610196Z 2025-05-07T20:31:57.4610415Z x_sign = torch.sign(x) 2025-05-07T20:31:57.4610722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.4611057Z x = x_sign * x_clamp 2025-05-07T20:31:57.4611314Z x0 = x[:, :D] 2025-05-07T20:31:57.4611537Z x1 = x[:, D:] 2025-05-07T20:31:57.4611762Z 2025-05-07T20:31:57.4611970Z if contiguous: 2025-05-07T20:31:57.4612393Z x0 = x0.contiguous() 2025-05-07T20:31:57.4612667Z x1 = x1.contiguous() 2025-05-07T20:31:57.4612919Z 2025-05-07T20:31:57.4613122Z if scale_ub is not None: 2025-05-07T20:31:57.4613410Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.4613762Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.4614079Z ) 2025-05-07T20:31:57.4614289Z else: 2025-05-07T20:31:57.4614517Z scale_ub_tensor = None 2025-05-07T20:31:57.4614813Z 2025-05-07T20:31:57.4615079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.4615416Z op = silu_mul_quant 2025-05-07T20:31:57.4615680Z if compiled: 2025-05-07T20:31:57.4615937Z op = torch.compile(op) 2025-05-07T20:31:57.4616246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4616531Z 2025-05-07T20:31:57.4616730Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.4616916Z 2025-05-07T20:31:57.4617022Z moe/activation_test.py:117: 2025-05-07T20:31:57.4617334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4617710Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.4617999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4618710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.4619424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.4620094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.4620791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.4621469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.4622019Z kernel = self.compile( 2025-05-07T20:31:57.4622568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.4623244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.4623660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4623896Z 2025-05-07T20:31:57.4624114Z self = 2025-05-07T20:31:57.4625218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.4626623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85f2f6a0>} 2025-05-07T20:31:57.4627970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.4629004Z context = 2025-05-07T20:31:57.4629289Z 2025-05-07T20:31:57.4629462Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.4629982Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.4630460Z module_map=module_map) 2025-05-07T20:31:57.4630834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.4631187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.4631455Z E ^ 2025-05-07T20:31:57.4631931Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.4632382Z 2025-05-07T20:31:57.4632812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.4633407Z 2025-05-07T20:31:57.4633516Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.4633938Z self=, 2025-05-07T20:31:57.4634350Z T=4096, 2025-05-07T20:31:57.4634573Z D=7168, 2025-05-07T20:31:57.4634795Z scale_ub=None, 2025-05-07T20:31:57.4635022Z contiguous=False, 2025-05-07T20:31:57.4635250Z compiled=False, 2025-05-07T20:31:57.4635467Z ) 2025-05-07T20:31:57.4635802Z self = 2025-05-07T20:31:57.4636304Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.4636753Z 2025-05-07T20:31:57.4636836Z @given( 2025-05-07T20:31:57.4637078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.4637401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.4637711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.4638058Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.4638394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.4638679Z ) 2025-05-07T20:31:57.4639146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.4639660Z def test_silu_mul_quant( 2025-05-07T20:31:57.4639917Z self, 2025-05-07T20:31:57.4640118Z T: int, 2025-05-07T20:31:57.4640330Z D: int, 2025-05-07T20:31:57.4640653Z scale_ub: Optional[float], 2025-05-07T20:31:57.4640929Z contiguous: bool, 2025-05-07T20:31:57.4641177Z compiled: bool, 2025-05-07T20:31:57.4641407Z ) -> None: 2025-05-07T20:31:57.4641627Z torch.manual_seed(2025) 2025-05-07T20:31:57.4641880Z 2025-05-07T20:31:57.4642166Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.4642515Z 2025-05-07T20:31:57.4642719Z x_sign = torch.sign(x) 2025-05-07T20:31:57.4643026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.4643339Z x = x_sign * x_clamp 2025-05-07T20:31:57.4643589Z x0 = x[:, :D] 2025-05-07T20:31:57.4643817Z x1 = x[:, D:] 2025-05-07T20:31:57.4644029Z 2025-05-07T20:31:57.4644222Z if contiguous: 2025-05-07T20:31:57.4644460Z x0 = x0.contiguous() 2025-05-07T20:31:57.4644715Z x1 = x1.contiguous() 2025-05-07T20:31:57.4644959Z 2025-05-07T20:31:57.4645162Z if scale_ub is not None: 2025-05-07T20:31:57.4645444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.4645781Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.4646095Z ) 2025-05-07T20:31:57.4646301Z else: 2025-05-07T20:31:57.4646517Z scale_ub_tensor = None 2025-05-07T20:31:57.4646783Z 2025-05-07T20:31:57.4647022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.4647344Z op = silu_mul_quant 2025-05-07T20:31:57.4647651Z if compiled: 2025-05-07T20:31:57.4647904Z op = torch.compile(op) 2025-05-07T20:31:57.4648199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4648470Z 2025-05-07T20:31:57.4648668Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.4648835Z 2025-05-07T20:31:57.4648947Z moe/activation_test.py:117: 2025-05-07T20:31:57.4649248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4649601Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.4649889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4650586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.4651274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.4651817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.4652593Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.4653258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.4653791Z kernel = self.compile( 2025-05-07T20:31:57.4654337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.4655002Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.4655393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4655627Z 2025-05-07T20:31:57.4655834Z self = 2025-05-07T20:31:57.4656920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.4658308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85f2dda0>} 2025-05-07T20:31:57.4659660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.4660762Z context = 2025-05-07T20:31:57.4661061Z 2025-05-07T20:31:57.4661231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.4661762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.4662229Z module_map=module_map) 2025-05-07T20:31:57.4662593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.4662956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.4663222Z E ^ 2025-05-07T20:31:57.4663690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.4664145Z 2025-05-07T20:31:57.4664563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.4665126Z 2025-05-07T20:31:57.4665233Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.4665654Z self=, 2025-05-07T20:31:57.4666060Z T=128, 2025-05-07T20:31:57.4666259Z D=7168, 2025-05-07T20:31:57.4666461Z scale_ub=None, 2025-05-07T20:31:57.4666676Z contiguous=False, 2025-05-07T20:31:57.4666907Z compiled=True, 2025-05-07T20:31:57.4667116Z ) 2025-05-07T20:31:57.5104100Z self = 2025-05-07T20:31:57.5104655Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.5104945Z 2025-05-07T20:31:57.5105032Z @given( 2025-05-07T20:31:57.5105275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.5105758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.5106134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.5106511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.5106886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.5107210Z ) 2025-05-07T20:31:57.5107614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.5108135Z def test_silu_mul_quant( 2025-05-07T20:31:57.5108398Z self, 2025-05-07T20:31:57.5108610Z T: int, 2025-05-07T20:31:57.5108821Z D: int, 2025-05-07T20:31:57.5109052Z scale_ub: Optional[float], 2025-05-07T20:31:57.5109355Z contiguous: bool, 2025-05-07T20:31:57.5109618Z compiled: bool, 2025-05-07T20:31:57.5110021Z ) -> None: 2025-05-07T20:31:57.5110240Z torch.manual_seed(2025) 2025-05-07T20:31:57.5110488Z 2025-05-07T20:31:57.5110758Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.5111107Z 2025-05-07T20:31:57.5111305Z x_sign = torch.sign(x) 2025-05-07T20:31:57.5111598Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.5111904Z x = x_sign * x_clamp 2025-05-07T20:31:57.5112153Z x0 = x[:, :D] 2025-05-07T20:31:57.5112373Z x1 = x[:, D:] 2025-05-07T20:31:57.5112579Z 2025-05-07T20:31:57.5112773Z if contiguous: 2025-05-07T20:31:57.5113009Z x0 = x0.contiguous() 2025-05-07T20:31:57.5113269Z x1 = x1.contiguous() 2025-05-07T20:31:57.5113516Z 2025-05-07T20:31:57.5113714Z if scale_ub is not None: 2025-05-07T20:31:57.5113984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.5114325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.5114647Z ) 2025-05-07T20:31:57.5114844Z else: 2025-05-07T20:31:57.5115063Z scale_ub_tensor = None 2025-05-07T20:31:57.5115322Z 2025-05-07T20:31:57.5115553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.5115877Z op = silu_mul_quant 2025-05-07T20:31:57.5116130Z if compiled: 2025-05-07T20:31:57.5116386Z op = torch.compile(op) 2025-05-07T20:31:57.5116826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.5117109Z 2025-05-07T20:31:57.5117309Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.5117592Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.5117886Z 2025-05-07T20:31:57.5118140Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.5118479Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.5118774Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.5119097Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.5119462Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.5119776Z 2025-05-07T20:31:57.5119985Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.5120183Z 2025-05-07T20:31:57.5120289Z moe/activation_test.py:126: 2025-05-07T20:31:57.5120586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.5120927Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.5121261Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.5122055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.5122799Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.5123343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.5124032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.5124715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.5125434Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.5126182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:57.5126928Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.5127745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.5128387Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.5128997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.5129673Z fn() 2025-05-07T20:31:57.5130180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.5130763Z self.fn.run( 2025-05-07T20:31:57.5131229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.5131755Z kernel = self.compile( 2025-05-07T20:31:57.5132302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.5132953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.5133356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.5133585Z 2025-05-07T20:31:57.5133790Z self = 2025-05-07T20:31:57.5134866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.5136238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c81620>} 2025-05-07T20:31:57.5137818Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.5138860Z context = 2025-05-07T20:31:57.5139147Z 2025-05-07T20:31:57.5139311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.5139838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.5140300Z module_map=module_map) 2025-05-07T20:31:57.5140671Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.5141025Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.5141301Z E ^ 2025-05-07T20:31:57.5141766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.5142227Z 2025-05-07T20:31:57.5142649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.5143170Z 2025-05-07T20:31:57.5143275Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.5143691Z self=, 2025-05-07T20:31:57.5144094Z T=128, 2025-05-07T20:31:57.5144284Z D=7168, 2025-05-07T20:31:57.5144481Z scale_ub=None, 2025-05-07T20:31:57.5144704Z contiguous=False, 2025-05-07T20:31:57.5144924Z compiled=False, 2025-05-07T20:31:57.5145141Z ) 2025-05-07T20:31:57.8185399Z self = 2025-05-07T20:31:57.8185920Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.8186243Z 2025-05-07T20:31:57.8186337Z @given( 2025-05-07T20:31:57.8186571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.8186893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.8187204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.8187540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.8187874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.8188170Z ) 2025-05-07T20:31:57.8188525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.8188964Z def test_silu_mul_quant( 2025-05-07T20:31:57.8189214Z self, 2025-05-07T20:31:57.8189417Z T: int, 2025-05-07T20:31:57.8189621Z D: int, 2025-05-07T20:31:57.8189844Z scale_ub: Optional[float], 2025-05-07T20:31:57.8190300Z contiguous: bool, 2025-05-07T20:31:57.8190538Z compiled: bool, 2025-05-07T20:31:57.8190770Z ) -> None: 2025-05-07T20:31:57.8190993Z torch.manual_seed(2025) 2025-05-07T20:31:57.8191237Z 2025-05-07T20:31:57.8191512Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.8191861Z 2025-05-07T20:31:57.8192058Z x_sign = torch.sign(x) 2025-05-07T20:31:57.8192365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.8192678Z x = x_sign * x_clamp 2025-05-07T20:31:57.8192917Z x0 = x[:, :D] 2025-05-07T20:31:57.8193141Z x1 = x[:, D:] 2025-05-07T20:31:57.8193352Z 2025-05-07T20:31:57.8193543Z if contiguous: 2025-05-07T20:31:57.8193784Z x0 = x0.contiguous() 2025-05-07T20:31:57.8194045Z x1 = x1.contiguous() 2025-05-07T20:31:57.8194290Z 2025-05-07T20:31:57.8194482Z if scale_ub is not None: 2025-05-07T20:31:57.8194768Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.8195106Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.8195419Z ) 2025-05-07T20:31:57.8195618Z else: 2025-05-07T20:31:57.8195834Z scale_ub_tensor = None 2025-05-07T20:31:57.8196085Z 2025-05-07T20:31:57.8196320Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.8196638Z op = silu_mul_quant 2025-05-07T20:31:57.8197008Z if compiled: 2025-05-07T20:31:57.8197271Z op = torch.compile(op) 2025-05-07T20:31:57.8197568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.8197844Z 2025-05-07T20:31:57.8198042Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.8198212Z 2025-05-07T20:31:57.8198312Z moe/activation_test.py:117: 2025-05-07T20:31:57.8198615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.8198946Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.8199242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.8199939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.8200637Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.8201185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.8201881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.8202547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.8203081Z kernel = self.compile( 2025-05-07T20:31:57.8203627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.8204288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.8204686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.8204918Z 2025-05-07T20:31:57.8205126Z self = 2025-05-07T20:31:57.8206466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.8207924Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c825c0>} 2025-05-07T20:31:57.8209271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.8210293Z context = 2025-05-07T20:31:57.8210733Z 2025-05-07T20:31:57.8210899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.8211424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.8211889Z module_map=module_map) 2025-05-07T20:31:57.8212250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.8212608Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.8212877Z E ^ 2025-05-07T20:31:57.8213341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.8213795Z 2025-05-07T20:31:57.8214211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.8214775Z 2025-05-07T20:31:57.8214880Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.8215294Z self=, 2025-05-07T20:31:57.8215699Z T=4096, 2025-05-07T20:31:57.8215895Z D=5120, 2025-05-07T20:31:57.8216095Z scale_ub=1200.0, 2025-05-07T20:31:57.8216317Z contiguous=True, 2025-05-07T20:31:57.8216545Z compiled=False, 2025-05-07T20:31:57.8216762Z ) 2025-05-07T20:31:57.8217078Z self = 2025-05-07T20:31:57.8217572Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.8217968Z 2025-05-07T20:31:57.8218050Z @given( 2025-05-07T20:31:57.8218286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.8218597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.8218905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.8219237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.8219561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.8219860Z ) 2025-05-07T20:31:57.8220208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.8220645Z def test_silu_mul_quant( 2025-05-07T20:31:57.8220896Z self, 2025-05-07T20:31:57.8221100Z T: int, 2025-05-07T20:31:57.8221298Z D: int, 2025-05-07T20:31:57.8221521Z scale_ub: Optional[float], 2025-05-07T20:31:57.8221795Z contiguous: bool, 2025-05-07T20:31:57.8222032Z compiled: bool, 2025-05-07T20:31:57.8222260Z ) -> None: 2025-05-07T20:31:57.8222485Z torch.manual_seed(2025) 2025-05-07T20:31:57.8222729Z 2025-05-07T20:31:57.8222999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.8223344Z 2025-05-07T20:31:57.8223543Z x_sign = torch.sign(x) 2025-05-07T20:31:57.8223829Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.8224143Z x = x_sign * x_clamp 2025-05-07T20:31:57.8224389Z x0 = x[:, :D] 2025-05-07T20:31:57.8224607Z x1 = x[:, D:] 2025-05-07T20:31:57.8224822Z 2025-05-07T20:31:57.8225015Z if contiguous: 2025-05-07T20:31:57.8225245Z x0 = x0.contiguous() 2025-05-07T20:31:57.8225506Z x1 = x1.contiguous() 2025-05-07T20:31:57.8225749Z 2025-05-07T20:31:57.8225941Z if scale_ub is not None: 2025-05-07T20:31:57.8226214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.8226552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.8226870Z ) 2025-05-07T20:31:57.8227072Z else: 2025-05-07T20:31:57.8227292Z scale_ub_tensor = None 2025-05-07T20:31:57.8227548Z 2025-05-07T20:31:57.8227777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.8228092Z op = silu_mul_quant 2025-05-07T20:31:57.8228344Z if compiled: 2025-05-07T20:31:57.8228590Z op = torch.compile(op) 2025-05-07T20:31:57.8228890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.8229291Z 2025-05-07T20:31:57.8229486Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.8229658Z 2025-05-07T20:31:57.8229759Z moe/activation_test.py:117: 2025-05-07T20:31:57.8230056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.8230385Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.8230667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.8231358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.8232048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.8232577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.8233255Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.8233915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.8234447Z kernel = self.compile( 2025-05-07T20:31:57.8241259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.8241938Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.8242346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.8242583Z 2025-05-07T20:31:57.8242905Z self = 2025-05-07T20:31:57.8243997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.8245427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c839c0>} 2025-05-07T20:31:57.8246780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.8247904Z context = 2025-05-07T20:31:57.8248249Z 2025-05-07T20:31:57.8248440Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.8249036Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.8249507Z module_map=module_map) 2025-05-07T20:31:57.8249886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.8250246Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.8250507Z E ^ 2025-05-07T20:31:57.8250983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.8251444Z 2025-05-07T20:31:57.8251863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.8252375Z 2025-05-07T20:31:57.8252487Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.8252904Z self=, 2025-05-07T20:31:57.8253320Z T=1, 2025-05-07T20:31:57.8253521Z D=5120, 2025-05-07T20:31:57.8253717Z scale_ub=None, 2025-05-07T20:31:57.8253948Z contiguous=True, 2025-05-07T20:31:57.8254178Z compiled=True, 2025-05-07T20:31:57.8254381Z ) 2025-05-07T20:31:58.1603297Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.1604633Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:31:58.1606324Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.1607848Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.1608828Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.1610119Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.1611502Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.1612482Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.1613818Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.1615189Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.1616240Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.1617519Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.1618753Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:31:58.1619962Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.1621153Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:31:58.1621964Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.1622977Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:58.1623992Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:31:58.1624788Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:58.1625983Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.1627262Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.1628490Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.1629528Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:31:58.1630700Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.1632045Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.1633102Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.1634012Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.1634749Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:31:58.1635757Z W0507 20:31:58.157000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.2457883Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.2458952Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:31:58.2460283Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.2461697Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.2462672Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.2463955Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.2465324Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.2466308Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.2467527Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.2468893Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.2469944Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.2471220Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.2472583Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:31:58.2473797Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.2474997Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:31:58.2475805Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.2476823Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:58.2477832Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:31:58.2478618Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:58.2479893Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.2481157Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.2482260Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.2483292Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:31:58.2484454Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.2485800Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.2486844Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.2487845Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.2488579Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:31:58.2489588Z W0507 20:31:58.243000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.5460392Z self = 2025-05-07T20:31:58.5460993Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:58.5461266Z 2025-05-07T20:31:58.5461353Z @given( 2025-05-07T20:31:58.5461598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.5461917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.5462232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.5462569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.5462903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.5463421Z ) 2025-05-07T20:31:58.5463778Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.5464222Z def test_silu_mul_quant( 2025-05-07T20:31:58.5464464Z self, 2025-05-07T20:31:58.5464664Z T: int, 2025-05-07T20:31:58.5464868Z D: int, 2025-05-07T20:31:58.5465087Z scale_ub: Optional[float], 2025-05-07T20:31:58.5465365Z contiguous: bool, 2025-05-07T20:31:58.5465607Z compiled: bool, 2025-05-07T20:31:58.5465843Z ) -> None: 2025-05-07T20:31:58.5466069Z torch.manual_seed(2025) 2025-05-07T20:31:58.5466321Z 2025-05-07T20:31:58.5466596Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.5466949Z 2025-05-07T20:31:58.5467151Z x_sign = torch.sign(x) 2025-05-07T20:31:58.5467443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.5467760Z x = x_sign * x_clamp 2025-05-07T20:31:58.5468015Z x0 = x[:, :D] 2025-05-07T20:31:58.5468237Z x1 = x[:, D:] 2025-05-07T20:31:58.5468448Z 2025-05-07T20:31:58.5468637Z if contiguous: 2025-05-07T20:31:58.5468870Z x0 = x0.contiguous() 2025-05-07T20:31:58.5469122Z x1 = x1.contiguous() 2025-05-07T20:31:58.5469366Z 2025-05-07T20:31:58.5469566Z if scale_ub is not None: 2025-05-07T20:31:58.5469838Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.5470308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.5470628Z ) 2025-05-07T20:31:58.5470821Z else: 2025-05-07T20:31:58.5471040Z scale_ub_tensor = None 2025-05-07T20:31:58.5471308Z 2025-05-07T20:31:58.5471544Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.5471864Z op = silu_mul_quant 2025-05-07T20:31:58.5472120Z if compiled: 2025-05-07T20:31:58.5472366Z op = torch.compile(op) 2025-05-07T20:31:58.5472670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.5472949Z 2025-05-07T20:31:58.5473141Z y_fp8, y_scale = fn() 2025-05-07T20:31:58.5473425Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:58.5473720Z 2025-05-07T20:31:58.5473964Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.5474297Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:58.5474592Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:58.5474914Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:58.5475271Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.5475586Z 2025-05-07T20:31:58.5475790Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:58.5475984Z 2025-05-07T20:31:58.5476085Z moe/activation_test.py:126: 2025-05-07T20:31:58.5476385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.5476725Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:58.5477061Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.5477847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:58.5478607Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:58.5479153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.5479840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.5480524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:58.5481250Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.5482002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:58.5482838Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.5483568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:58.5484207Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:58.5484809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:58.5485330Z fn() 2025-05-07T20:31:58.5485839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:58.5486424Z self.fn.run( 2025-05-07T20:31:58.5486894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.5487429Z kernel = self.compile( 2025-05-07T20:31:58.5488050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.5488711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.5489106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.5489344Z 2025-05-07T20:31:58.5489553Z self = 2025-05-07T20:31:58.5490731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.5492134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c8c7c0>} 2025-05-07T20:31:58.5493483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.5494513Z context = 2025-05-07T20:31:58.5494853Z 2025-05-07T20:31:58.5495025Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.5495553Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.5496030Z module_map=module_map) 2025-05-07T20:31:58.5496397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.5496758Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:58.5497029Z E ^ 2025-05-07T20:31:58.5497504Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.5497957Z 2025-05-07T20:31:58.5498375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.5498904Z 2025-05-07T20:31:58.5499010Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.5499430Z self=, 2025-05-07T20:31:58.5499832Z T=2048, 2025-05-07T20:31:58.5500027Z D=5120, 2025-05-07T20:31:58.5500227Z scale_ub=None, 2025-05-07T20:31:58.5500443Z contiguous=True, 2025-05-07T20:31:58.5500670Z compiled=True, 2025-05-07T20:31:58.5500878Z ) 2025-05-07T20:31:58.8705455Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.8708575Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:31:58.8711266Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.8714451Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.8715626Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.8716931Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.8718308Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.8719289Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.8720508Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.8721991Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.8723052Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.8724329Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.8725562Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:31:58.8726783Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.8728099Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:31:58.8728925Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.8729949Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:58.8730961Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:31:58.8731751Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:58.8732970Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.8734248Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.8735357Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.8736481Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:31:58.8737655Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.8739005Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.8740071Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.8740971Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.8741710Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:31:58.8742732Z W0507 20:31:58.868000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.9555479Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.9556597Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:31:58.9557929Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.9559340Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.9560316Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.9561629Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.9562994Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.9563964Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.9565182Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.9566530Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.9567664Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.9568936Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.9570291Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:31:58.9571494Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.9572690Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:31:58.9573505Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.9574518Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:58.9575524Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:31:58.9576309Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:58.9577499Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.9578840Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.9579945Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.9580970Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:31:58.9582131Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.9583467Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.9584524Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.9585419Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.9586143Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:31:58.9587398Z W0507 20:31:58.953000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.2584174Z self = 2025-05-07T20:31:59.2584737Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:59.2585021Z 2025-05-07T20:31:59.2585104Z @given( 2025-05-07T20:31:59.2585378Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.2585694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.2586009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.2586346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.2586673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.2586970Z ) 2025-05-07T20:31:59.2587327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.2588148Z def test_silu_mul_quant( 2025-05-07T20:31:59.2588399Z self, 2025-05-07T20:31:59.2588604Z T: int, 2025-05-07T20:31:59.2588803Z D: int, 2025-05-07T20:31:59.2589032Z scale_ub: Optional[float], 2025-05-07T20:31:59.2589311Z contiguous: bool, 2025-05-07T20:31:59.2589564Z compiled: bool, 2025-05-07T20:31:59.2589795Z ) -> None: 2025-05-07T20:31:59.2590015Z torch.manual_seed(2025) 2025-05-07T20:31:59.2590261Z 2025-05-07T20:31:59.2590540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.2590888Z 2025-05-07T20:31:59.2591085Z x_sign = torch.sign(x) 2025-05-07T20:31:59.2591374Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.2591689Z x = x_sign * x_clamp 2025-05-07T20:31:59.2591932Z x0 = x[:, :D] 2025-05-07T20:31:59.2592146Z x1 = x[:, D:] 2025-05-07T20:31:59.2592357Z 2025-05-07T20:31:59.2592550Z if contiguous: 2025-05-07T20:31:59.2592786Z x0 = x0.contiguous() 2025-05-07T20:31:59.2593049Z x1 = x1.contiguous() 2025-05-07T20:31:59.2593289Z 2025-05-07T20:31:59.2593479Z if scale_ub is not None: 2025-05-07T20:31:59.2593755Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.2594095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.2594409Z ) 2025-05-07T20:31:59.2594601Z else: 2025-05-07T20:31:59.2594835Z scale_ub_tensor = None 2025-05-07T20:31:59.2595258Z 2025-05-07T20:31:59.2595492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.2595811Z op = silu_mul_quant 2025-05-07T20:31:59.2596064Z if compiled: 2025-05-07T20:31:59.2596309Z op = torch.compile(op) 2025-05-07T20:31:59.2596609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.2596888Z 2025-05-07T20:31:59.2597080Z y_fp8, y_scale = fn() 2025-05-07T20:31:59.2597376Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:59.2597674Z 2025-05-07T20:31:59.2597912Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.2598257Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:59.2598558Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:59.2598871Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:59.2599239Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:59.2599560Z 2025-05-07T20:31:59.2599779Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:59.2599977Z 2025-05-07T20:31:59.2600081Z moe/activation_test.py:126: 2025-05-07T20:31:59.2600389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.2600731Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:59.2601056Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:59.2601854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:59.2602626Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:59.2603181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.2603867Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.2604565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:59.2605293Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:59.2606413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:59.2607158Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:59.2607949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:59.2608739Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:59.2609335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:59.2609860Z fn() 2025-05-07T20:31:59.2610372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:59.2610965Z self.fn.run( 2025-05-07T20:31:59.2611430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.2611973Z kernel = self.compile( 2025-05-07T20:31:59.2612517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.2613166Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.2613577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.2613814Z 2025-05-07T20:31:59.2614024Z self = 2025-05-07T20:31:59.2615103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.2616615Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c8d4e0>} 2025-05-07T20:31:59.2617958Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.2618985Z context = 2025-05-07T20:31:59.2619285Z 2025-05-07T20:31:59.2619450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.2619973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.2620434Z module_map=module_map) 2025-05-07T20:31:59.2620802Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.2621166Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:59.2621434Z E ^ 2025-05-07T20:31:59.2621917Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.2622373Z 2025-05-07T20:31:59.2622791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.2623301Z 2025-05-07T20:31:59.2623414Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.2623824Z self=, 2025-05-07T20:31:59.2624239Z T=128, 2025-05-07T20:31:59.2624436Z D=5120, 2025-05-07T20:31:59.2624639Z scale_ub=None, 2025-05-07T20:31:59.2624864Z contiguous=True, 2025-05-07T20:31:59.2625096Z compiled=True, 2025-05-07T20:31:59.2625306Z ) 2025-05-07T20:31:59.6043139Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.6044260Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:31:59.6045622Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.6047301Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.6059005Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6060400Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.6061802Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.6062790Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6064028Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.6065450Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.6066712Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6067994Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.6069238Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:31:59.6070469Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.6071680Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:31:59.6072503Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6073527Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:59.6074540Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:31:59.6075336Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:59.6076540Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.6077818Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.6078935Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:59.6079973Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:31:59.6081242Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.6082590Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.6083650Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.6084560Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.6085302Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:31:59.6086322Z W0507 20:31:59.602000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.6904971Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.6906356Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:31:59.6908030Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.6909475Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.6910465Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6911778Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.6913166Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.6914148Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6915375Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.6916751Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.6917815Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6919084Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.6920321Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:31:59.6921742Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.6922947Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:31:59.6923780Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6924804Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:59.6925823Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:31:59.6926618Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:59.6927936Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.6929219Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.6930413Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:59.6931456Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:31:59.6932636Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.6934150Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.6935221Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.6936144Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.6936893Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:31:59.6937919Z W0507 20:31:59.688000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2102572Z self = 2025-05-07T20:32:00.2104034Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:00.2104734Z 2025-05-07T20:32:00.2104918Z @given( 2025-05-07T20:32:00.2105242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.2105579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.2106150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.2106500Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.2106843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.2107141Z ) 2025-05-07T20:32:00.2107494Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.2107948Z def test_silu_mul_quant( 2025-05-07T20:32:00.2108193Z self, 2025-05-07T20:32:00.2108395Z T: int, 2025-05-07T20:32:00.2108593Z D: int, 2025-05-07T20:32:00.2109063Z scale_ub: Optional[float], 2025-05-07T20:32:00.2109340Z contiguous: bool, 2025-05-07T20:32:00.2109579Z compiled: bool, 2025-05-07T20:32:00.2109813Z ) -> None: 2025-05-07T20:32:00.2110037Z torch.manual_seed(2025) 2025-05-07T20:32:00.2110273Z 2025-05-07T20:32:00.2110554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.2110902Z 2025-05-07T20:32:00.2111094Z x_sign = torch.sign(x) 2025-05-07T20:32:00.2111398Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.2111716Z x = x_sign * x_clamp 2025-05-07T20:32:00.2111955Z x0 = x[:, :D] 2025-05-07T20:32:00.2112176Z x1 = x[:, D:] 2025-05-07T20:32:00.2112391Z 2025-05-07T20:32:00.2112574Z if contiguous: 2025-05-07T20:32:00.2112806Z x0 = x0.contiguous() 2025-05-07T20:32:00.2113068Z x1 = x1.contiguous() 2025-05-07T20:32:00.2113303Z 2025-05-07T20:32:00.2113504Z if scale_ub is not None: 2025-05-07T20:32:00.2113790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.2114130Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.2114431Z ) 2025-05-07T20:32:00.2114635Z else: 2025-05-07T20:32:00.2114852Z scale_ub_tensor = None 2025-05-07T20:32:00.2115110Z 2025-05-07T20:32:00.2115343Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2115651Z op = silu_mul_quant 2025-05-07T20:32:00.2116078Z if compiled: 2025-05-07T20:32:00.2116332Z op = torch.compile(op) 2025-05-07T20:32:00.2116627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2116903Z 2025-05-07T20:32:00.2117096Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.2117381Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.2117668Z 2025-05-07T20:32:00.2117911Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2118250Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.2118540Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.2118857Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.2119215Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.2119524Z 2025-05-07T20:32:00.2119729Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.2119923Z 2025-05-07T20:32:00.2120029Z moe/activation_test.py:126: 2025-05-07T20:32:00.2120330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2120666Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.2120991Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.2121781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.2122531Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.2123082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.2123769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.2124460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.2125209Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.2125996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:00.2126749Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.2127474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.2128201Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.2128897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.2129427Z fn() 2025-05-07T20:32:00.2129935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.2130529Z self.fn.run( 2025-05-07T20:32:00.2131006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.2131551Z kernel = self.compile( 2025-05-07T20:32:00.2132096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.2132754Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.2133159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2133394Z 2025-05-07T20:32:00.2133604Z self = 2025-05-07T20:32:00.2134702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.2136100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84e442c0>} 2025-05-07T20:32:00.2137534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.2138574Z context = 2025-05-07T20:32:00.2138864Z 2025-05-07T20:32:00.2139031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.2139558Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.2140036Z module_map=module_map) 2025-05-07T20:32:00.2140400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.2140762Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.2141037Z E ^ 2025-05-07T20:32:00.2141507Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2141964Z 2025-05-07T20:32:00.2142393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.2142914Z 2025-05-07T20:32:00.2143020Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.2143440Z self=, 2025-05-07T20:32:00.2143855Z T=4096, 2025-05-07T20:32:00.2144048Z D=5120, 2025-05-07T20:32:00.2144250Z scale_ub=None, 2025-05-07T20:32:00.2144475Z contiguous=True, 2025-05-07T20:32:00.2144695Z compiled=True, 2025-05-07T20:32:00.2144909Z ) 2025-05-07T20:32:00.5597083Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.5598181Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:00.5599557Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.5601014Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.5602399Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.5603713Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.5605115Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.5606401Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.5607717Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.5609099Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.5610173Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.5611605Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.5612860Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:00.5614094Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.5615334Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:00.5616193Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.5617231Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:32:00.5618263Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:00.5619065Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:00.5620285Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.5621574Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.5622707Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:32:00.5623758Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:00.5624958Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.5626445Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.5627512Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.5628432Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.5629178Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:00.5630199Z W0507 20:32:00.557000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.6456426Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.6457522Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:00.6459243Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.6461071Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.6462050Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.6463377Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.6464750Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.6465746Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.6466978Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.6468349Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.6469416Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.6470693Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.6471938Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:00.6473159Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.6474521Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:00.6475377Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.6476427Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:32:00.6477445Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:00.6478243Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:00.6479457Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.6480738Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.6481930Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:32:00.6482979Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:00.6484162Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.6485515Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.6486575Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.6487481Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.6488357Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:00.6489375Z W0507 20:32:00.643000 87443 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.0029527Z self = 2025-05-07T20:32:01.0030101Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.0030412Z 2025-05-07T20:32:01.0030493Z @given( 2025-05-07T20:32:01.0030758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.0031199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.0031520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.0031861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.0032304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.0032596Z ) 2025-05-07T20:32:01.0032960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.0033411Z def test_silu_mul_quant( 2025-05-07T20:32:01.0033652Z self, 2025-05-07T20:32:01.0033853Z T: int, 2025-05-07T20:32:01.0034055Z D: int, 2025-05-07T20:32:01.0034270Z scale_ub: Optional[float], 2025-05-07T20:32:01.0034546Z contiguous: bool, 2025-05-07T20:32:01.0034790Z compiled: bool, 2025-05-07T20:32:01.0035469Z ) -> None: 2025-05-07T20:32:01.0035694Z torch.manual_seed(2025) 2025-05-07T20:32:01.0035941Z 2025-05-07T20:32:01.0036214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.0036564Z 2025-05-07T20:32:01.0036771Z x_sign = torch.sign(x) 2025-05-07T20:32:01.0037064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.0037382Z x = x_sign * x_clamp 2025-05-07T20:32:01.0037626Z x0 = x[:, :D] 2025-05-07T20:32:01.0037858Z x1 = x[:, D:] 2025-05-07T20:32:01.0038064Z 2025-05-07T20:32:01.0038256Z if contiguous: 2025-05-07T20:32:01.0038498Z x0 = x0.contiguous() 2025-05-07T20:32:01.0038754Z x1 = x1.contiguous() 2025-05-07T20:32:01.0038996Z 2025-05-07T20:32:01.0039191Z if scale_ub is not None: 2025-05-07T20:32:01.0039455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.0039798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.0040115Z ) 2025-05-07T20:32:01.0040303Z else: 2025-05-07T20:32:01.0040517Z scale_ub_tensor = None 2025-05-07T20:32:01.0040779Z 2025-05-07T20:32:01.0041010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.0041336Z op = silu_mul_quant 2025-05-07T20:32:01.0041591Z if compiled: 2025-05-07T20:32:01.0041844Z op = torch.compile(op) 2025-05-07T20:32:01.0042319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.0042632Z 2025-05-07T20:32:01.0042917Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.0043260Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.0043565Z 2025-05-07T20:32:01.0043816Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.0044161Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.0044467Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.0044790Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.0045163Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.0045484Z 2025-05-07T20:32:01.0045698Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.0045896Z 2025-05-07T20:32:01.0046001Z moe/activation_test.py:126: 2025-05-07T20:32:01.0046305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.0046650Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.0046992Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.0047909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.0048672Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.0049217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.0049901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.0050590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.0051308Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.0052063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.0052807Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.0053534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.0054170Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.0054770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.0055390Z fn() 2025-05-07T20:32:01.0055947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.0056535Z self.fn.run( 2025-05-07T20:32:01.0056994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.0057525Z kernel = self.compile( 2025-05-07T20:32:01.0058064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.0058722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.0059113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.0059351Z 2025-05-07T20:32:01.0059559Z self = 2025-05-07T20:32:01.0060648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.0062044Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84e46e80>} 2025-05-07T20:32:01.0063468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.0064487Z context = 2025-05-07T20:32:01.0064783Z 2025-05-07T20:32:01.0064950Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.0065471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.0065960Z module_map=module_map) 2025-05-07T20:32:01.0066364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.0066729Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.0067001Z E ^ 2025-05-07T20:32:01.0067467Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.0067924Z 2025-05-07T20:32:01.0068338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.0068846Z 2025-05-07T20:32:01.0068966Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.0069379Z self=, 2025-05-07T20:32:01.0069782Z T=16384, 2025-05-07T20:32:01.0069982Z D=5120, 2025-05-07T20:32:01.0070190Z scale_ub=None, 2025-05-07T20:32:01.0070399Z contiguous=True, 2025-05-07T20:32:01.0070640Z compiled=True, 2025-05-07T20:32:01.0070852Z ) 2025-05-07T20:32:01.0322469Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:01.0323733Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:01.0325077Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:01.0326074Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:01.0327181Z W0507 20:32:01.031000 87443 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:01.1009570Z self = 2025-05-07T20:32:01.1010532Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.1010821Z 2025-05-07T20:32:01.1010903Z @given( 2025-05-07T20:32:01.1011140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.1011456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.1011764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.1012101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.1012449Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.1012739Z ) 2025-05-07T20:32:01.1013093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.1013539Z def test_silu_mul_quant( 2025-05-07T20:32:01.1013786Z self, 2025-05-07T20:32:01.1013988Z T: int, 2025-05-07T20:32:01.1014191Z D: int, 2025-05-07T20:32:01.1014407Z scale_ub: Optional[float], 2025-05-07T20:32:01.1014696Z contiguous: bool, 2025-05-07T20:32:01.1014943Z compiled: bool, 2025-05-07T20:32:01.1015173Z ) -> None: 2025-05-07T20:32:01.1015396Z torch.manual_seed(2025) 2025-05-07T20:32:01.1015645Z 2025-05-07T20:32:01.1015919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.1016271Z 2025-05-07T20:32:01.1016482Z x_sign = torch.sign(x) 2025-05-07T20:32:01.1016779Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.1017232Z x = x_sign * x_clamp 2025-05-07T20:32:01.1017483Z x0 = x[:, :D] 2025-05-07T20:32:01.1017706Z x1 = x[:, D:] 2025-05-07T20:32:01.1017916Z 2025-05-07T20:32:01.1018115Z if contiguous: 2025-05-07T20:32:01.1018354Z x0 = x0.contiguous() 2025-05-07T20:32:01.1018607Z x1 = x1.contiguous() 2025-05-07T20:32:01.1018851Z 2025-05-07T20:32:01.1019054Z if scale_ub is not None: 2025-05-07T20:32:01.1019334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.1019675Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.1019987Z ) 2025-05-07T20:32:01.1020178Z else: 2025-05-07T20:32:01.1020394Z scale_ub_tensor = None 2025-05-07T20:32:01.1020659Z 2025-05-07T20:32:01.1020893Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.1021225Z op = silu_mul_quant 2025-05-07T20:32:01.1021479Z if compiled: 2025-05-07T20:32:01.1029755Z op = torch.compile(op) 2025-05-07T20:32:01.1030119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.1030417Z 2025-05-07T20:32:01.1030630Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.1030925Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.1031232Z 2025-05-07T20:32:01.1031486Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.1031839Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.1032146Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.1032472Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.1032845Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.1033161Z 2025-05-07T20:32:01.1033378Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.1033585Z 2025-05-07T20:32:01.1033698Z moe/activation_test.py:126: 2025-05-07T20:32:01.1034011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.1034371Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.1034715Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.1035532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.1036304Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.1036870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.1037708Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.1038568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.1039428Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.1040208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.1040972Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.1041700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.1042343Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.1042946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.1043475Z fn() 2025-05-07T20:32:01.1043978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.1044565Z self.fn.run( 2025-05-07T20:32:01.1045036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.1045564Z kernel = self.compile( 2025-05-07T20:32:01.1046212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.1046872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.1047275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.1047505Z 2025-05-07T20:32:01.1047845Z self = 2025-05-07T20:32:01.1048948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.1050334Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a854ea5c0>} 2025-05-07T20:32:01.1051686Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.1052714Z context = 2025-05-07T20:32:01.1053002Z 2025-05-07T20:32:01.1053171Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.1053699Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.1054178Z module_map=module_map) 2025-05-07T20:32:01.1054546Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.1054911Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.1055190Z E ^ 2025-05-07T20:32:01.1055656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.1056114Z 2025-05-07T20:32:01.1056540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.1057058Z 2025-05-07T20:32:01.1057167Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.1057585Z self=, 2025-05-07T20:32:01.1057987Z T=1, 2025-05-07T20:32:01.1058184Z D=5120, 2025-05-07T20:32:01.1058389Z scale_ub=1200.0, 2025-05-07T20:32:01.1058625Z contiguous=True, 2025-05-07T20:32:01.1058979Z compiled=True, 2025-05-07T20:32:01.1059204Z ) 2025-05-07T20:32:01.2128515Z self = 2025-05-07T20:32:01.2129629Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.2130151Z 2025-05-07T20:32:01.2130314Z @given( 2025-05-07T20:32:01.2130780Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2131400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2132027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2132685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2133334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2133891Z ) 2025-05-07T20:32:01.2134584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2135458Z def test_silu_mul_quant( 2025-05-07T20:32:01.2135711Z self, 2025-05-07T20:32:01.2135909Z T: int, 2025-05-07T20:32:01.2136130Z D: int, 2025-05-07T20:32:01.2136366Z scale_ub: Optional[float], 2025-05-07T20:32:01.2136638Z contiguous: bool, 2025-05-07T20:32:01.2136886Z compiled: bool, 2025-05-07T20:32:01.2137122Z ) -> None: 2025-05-07T20:32:01.2137337Z torch.manual_seed(2025) 2025-05-07T20:32:01.2137591Z 2025-05-07T20:32:01.2137875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2138221Z 2025-05-07T20:32:01.2138779Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2139085Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2139395Z x = x_sign * x_clamp 2025-05-07T20:32:01.2139648Z x0 = x[:, :D] 2025-05-07T20:32:01.2139876Z x1 = x[:, D:] 2025-05-07T20:32:01.2140084Z 2025-05-07T20:32:01.2140284Z if contiguous: 2025-05-07T20:32:01.2140531Z x0 = x0.contiguous() 2025-05-07T20:32:01.2140791Z x1 = x1.contiguous() 2025-05-07T20:32:01.2141048Z 2025-05-07T20:32:01.2141254Z if scale_ub is not None: 2025-05-07T20:32:01.2141537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2141874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2142194Z ) 2025-05-07T20:32:01.2142399Z else: 2025-05-07T20:32:01.2142610Z scale_ub_tensor = None 2025-05-07T20:32:01.2142868Z 2025-05-07T20:32:01.2143112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2143429Z op = silu_mul_quant 2025-05-07T20:32:01.2143682Z if compiled: 2025-05-07T20:32:01.2143933Z op = torch.compile(op) 2025-05-07T20:32:01.2144233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2144506Z 2025-05-07T20:32:01.2144707Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2144872Z 2025-05-07T20:32:01.2144980Z moe/activation_test.py:117: 2025-05-07T20:32:01.2145274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2145653Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2145957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2146510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.2147076Z return fn(*args, **kwargs) 2025-05-07T20:32:01.2147739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2148433Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2148964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2149648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2150310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2150992Z kernel = self.compile( 2025-05-07T20:32:01.2151537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2152192Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2152593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2152821Z 2025-05-07T20:32:01.2153028Z self = 2025-05-07T20:32:01.2154110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2155503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8455df80>} 2025-05-07T20:32:01.2156905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2157935Z context = 2025-05-07T20:32:01.2158223Z 2025-05-07T20:32:01.2158390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2158999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2159482Z module_map=module_map) 2025-05-07T20:32:01.2159848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2160215Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2160487Z E ^ 2025-05-07T20:32:01.2160960Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2161411Z 2025-05-07T20:32:01.2161833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2162351Z 2025-05-07T20:32:01.2162460Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2162881Z self=, 2025-05-07T20:32:01.2163295Z T=1, 2025-05-07T20:32:01.2163486Z D=5120, 2025-05-07T20:32:01.2163692Z scale_ub=None, 2025-05-07T20:32:01.2163920Z contiguous=False, 2025-05-07T20:32:01.2164154Z compiled=True, 2025-05-07T20:32:01.2164378Z ) 2025-05-07T20:32:01.4326785Z self = 2025-05-07T20:32:01.4327629Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.4327984Z 2025-05-07T20:32:01.4328085Z @given( 2025-05-07T20:32:01.4328325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.4328650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.4328989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.4329321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.4329656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.4329955Z ) 2025-05-07T20:32:01.4330308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.4330757Z def test_silu_mul_quant( 2025-05-07T20:32:01.4331006Z self, 2025-05-07T20:32:01.4331214Z T: int, 2025-05-07T20:32:01.4331433Z D: int, 2025-05-07T20:32:01.4331661Z scale_ub: Optional[float], 2025-05-07T20:32:01.4331940Z contiguous: bool, 2025-05-07T20:32:01.4332178Z compiled: bool, 2025-05-07T20:32:01.4332416Z ) -> None: 2025-05-07T20:32:01.4332640Z torch.manual_seed(2025) 2025-05-07T20:32:01.4332882Z 2025-05-07T20:32:01.4333163Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.4333852Z 2025-05-07T20:32:01.4334049Z x_sign = torch.sign(x) 2025-05-07T20:32:01.4334344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.4334658Z x = x_sign * x_clamp 2025-05-07T20:32:01.4334899Z x0 = x[:, :D] 2025-05-07T20:32:01.4335124Z x1 = x[:, D:] 2025-05-07T20:32:01.4335341Z 2025-05-07T20:32:01.4335528Z if contiguous: 2025-05-07T20:32:01.4335769Z x0 = x0.contiguous() 2025-05-07T20:32:01.4336033Z x1 = x1.contiguous() 2025-05-07T20:32:01.4336282Z 2025-05-07T20:32:01.4336512Z if scale_ub is not None: 2025-05-07T20:32:01.4336813Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.4337149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.4337465Z ) 2025-05-07T20:32:01.4337667Z else: 2025-05-07T20:32:01.4337883Z scale_ub_tensor = None 2025-05-07T20:32:01.4338155Z 2025-05-07T20:32:01.4338391Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.4338715Z op = silu_mul_quant 2025-05-07T20:32:01.4338975Z if compiled: 2025-05-07T20:32:01.4339229Z op = torch.compile(op) 2025-05-07T20:32:01.4339524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.4339814Z 2025-05-07T20:32:01.4340014Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.4340297Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.4340600Z 2025-05-07T20:32:01.4340986Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.4341324Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.4341626Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.4341947Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.4342313Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.4342627Z 2025-05-07T20:32:01.4342838Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.4343039Z 2025-05-07T20:32:01.4343146Z moe/activation_test.py:126: 2025-05-07T20:32:01.4343441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.4343786Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.4344119Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.4344915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.4345690Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.4346236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.4346919Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.4347608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.4348330Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.4349083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.4349833Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.4350561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.4351199Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.4351800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.4352322Z fn() 2025-05-07T20:32:01.4352825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.4353408Z self.fn.run( 2025-05-07T20:32:01.4353878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.4354499Z kernel = self.compile( 2025-05-07T20:32:01.4355035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.4355695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.4356097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.4356330Z 2025-05-07T20:32:01.4356543Z self = 2025-05-07T20:32:01.4357633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.4359022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8455d4e0>} 2025-05-07T20:32:01.4360369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.4361395Z context = 2025-05-07T20:32:01.4361682Z 2025-05-07T20:32:01.4361926Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.4362452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.4362922Z module_map=module_map) 2025-05-07T20:32:01.4363292Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.4363647Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.4363920Z E ^ 2025-05-07T20:32:01.4364389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.4364842Z 2025-05-07T20:32:01.4365260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.4365780Z 2025-05-07T20:32:01.4365888Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.4366344Z self=, 2025-05-07T20:32:01.4366761Z T=1, 2025-05-07T20:32:01.4366950Z D=5120, 2025-05-07T20:32:01.4367162Z scale_ub=None, 2025-05-07T20:32:01.4367388Z contiguous=True, 2025-05-07T20:32:01.4367669Z compiled=False, 2025-05-07T20:32:01.4367886Z ) 2025-05-07T20:32:01.5549628Z self = 2025-05-07T20:32:01.5551051Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.5551577Z 2025-05-07T20:32:01.5551751Z @given( 2025-05-07T20:32:01.5552238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.5552859Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.5553474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.5554123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.5554782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.5555358Z ) 2025-05-07T20:32:01.5555989Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.5556491Z def test_silu_mul_quant( 2025-05-07T20:32:01.5556738Z self, 2025-05-07T20:32:01.5556939Z T: int, 2025-05-07T20:32:01.5557135Z D: int, 2025-05-07T20:32:01.5557361Z scale_ub: Optional[float], 2025-05-07T20:32:01.5557640Z contiguous: bool, 2025-05-07T20:32:01.5557877Z compiled: bool, 2025-05-07T20:32:01.5558116Z ) -> None: 2025-05-07T20:32:01.5558337Z torch.manual_seed(2025) 2025-05-07T20:32:01.5558576Z 2025-05-07T20:32:01.5559245Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.5559595Z 2025-05-07T20:32:01.5559788Z x_sign = torch.sign(x) 2025-05-07T20:32:01.5560082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.5560397Z x = x_sign * x_clamp 2025-05-07T20:32:01.5560643Z x0 = x[:, :D] 2025-05-07T20:32:01.5560860Z x1 = x[:, D:] 2025-05-07T20:32:01.5561078Z 2025-05-07T20:32:01.5561273Z if contiguous: 2025-05-07T20:32:01.5561510Z x0 = x0.contiguous() 2025-05-07T20:32:01.5561771Z x1 = x1.contiguous() 2025-05-07T20:32:01.5562018Z 2025-05-07T20:32:01.5562212Z if scale_ub is not None: 2025-05-07T20:32:01.5562488Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.5562828Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.5563139Z ) 2025-05-07T20:32:01.5563343Z else: 2025-05-07T20:32:01.5563564Z scale_ub_tensor = None 2025-05-07T20:32:01.5563823Z 2025-05-07T20:32:01.5564055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.5564378Z op = silu_mul_quant 2025-05-07T20:32:01.5564625Z if compiled: 2025-05-07T20:32:01.5564877Z op = torch.compile(op) 2025-05-07T20:32:01.5565178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.5565454Z 2025-05-07T20:32:01.5565656Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.5565829Z 2025-05-07T20:32:01.5566077Z moe/activation_test.py:117: 2025-05-07T20:32:01.5566382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.5566717Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.5567006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.5567821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.5568515Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.5569049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.5569730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.5570389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.5570920Z kernel = self.compile( 2025-05-07T20:32:01.5571468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.5572122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.5572518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.5572753Z 2025-05-07T20:32:01.5572959Z self = 2025-05-07T20:32:01.5574038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.5575436Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8455cea0>} 2025-05-07T20:32:01.5576835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.5577852Z context = 2025-05-07T20:32:01.5578145Z 2025-05-07T20:32:01.5578311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.5578836Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.5579395Z module_map=module_map) 2025-05-07T20:32:01.5579757Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.5580116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.5580385Z E ^ 2025-05-07T20:32:01.5580846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.5581301Z 2025-05-07T20:32:01.5581721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.5582239Z 2025-05-07T20:32:01.5582346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.5582760Z self=, 2025-05-07T20:32:01.5583163Z T=128, 2025-05-07T20:32:01.5583364Z D=5120, 2025-05-07T20:32:01.5583566Z scale_ub=None, 2025-05-07T20:32:01.5583787Z contiguous=False, 2025-05-07T20:32:01.5584019Z compiled=True, 2025-05-07T20:32:01.5584241Z ) 2025-05-07T20:32:01.5584559Z self = 2025-05-07T20:32:01.5585053Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.5585328Z 2025-05-07T20:32:01.5585414Z @given( 2025-05-07T20:32:01.5585661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.5585976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.5586370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.5586710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.5587037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.5587331Z ) 2025-05-07T20:32:01.5587684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.5588119Z def test_silu_mul_quant( 2025-05-07T20:32:01.5588366Z self, 2025-05-07T20:32:01.5588569Z T: int, 2025-05-07T20:32:01.5588772Z D: int, 2025-05-07T20:32:01.5588998Z scale_ub: Optional[float], 2025-05-07T20:32:01.5589276Z contiguous: bool, 2025-05-07T20:32:01.5589525Z compiled: bool, 2025-05-07T20:32:01.5589751Z ) -> None: 2025-05-07T20:32:01.5589971Z torch.manual_seed(2025) 2025-05-07T20:32:01.5590219Z 2025-05-07T20:32:01.5590489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.5590834Z 2025-05-07T20:32:01.5591033Z x_sign = torch.sign(x) 2025-05-07T20:32:01.5591325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.5591641Z x = x_sign * x_clamp 2025-05-07T20:32:01.5591884Z x0 = x[:, :D] 2025-05-07T20:32:01.5592100Z x1 = x[:, D:] 2025-05-07T20:32:01.5592312Z 2025-05-07T20:32:01.5592504Z if contiguous: 2025-05-07T20:32:01.5592733Z x0 = x0.contiguous() 2025-05-07T20:32:01.5592995Z x1 = x1.contiguous() 2025-05-07T20:32:01.5596615Z 2025-05-07T20:32:01.5596812Z if scale_ub is not None: 2025-05-07T20:32:01.5597096Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.5597439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.5597750Z ) 2025-05-07T20:32:01.5597954Z else: 2025-05-07T20:32:01.5598177Z scale_ub_tensor = None 2025-05-07T20:32:01.5598429Z 2025-05-07T20:32:01.5598663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.5598984Z op = silu_mul_quant 2025-05-07T20:32:01.5599225Z if compiled: 2025-05-07T20:32:01.5599471Z op = torch.compile(op) 2025-05-07T20:32:01.5599765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.5600045Z 2025-05-07T20:32:01.5600251Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.5600425Z 2025-05-07T20:32:01.5600525Z moe/activation_test.py:117: 2025-05-07T20:32:01.5600829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.5609677Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.5610005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.5610578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.5611150Z return fn(*args, **kwargs) 2025-05-07T20:32:01.5611822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.5612520Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.5613064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.5613751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.5614417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.5614950Z kernel = self.compile( 2025-05-07T20:32:01.5615506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.5616217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.5616616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.5616858Z 2025-05-07T20:32:01.5617068Z self = 2025-05-07T20:32:01.5618317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.5619699Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9c09c60>} 2025-05-07T20:32:01.5621045Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.5622067Z context = 2025-05-07T20:32:01.5622363Z 2025-05-07T20:32:01.5622531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.5623055Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.5623534Z module_map=module_map) 2025-05-07T20:32:01.5623900Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.5624260Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.5624527Z E ^ 2025-05-07T20:32:01.5624992Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.5625448Z 2025-05-07T20:32:01.5625981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.5626504Z 2025-05-07T20:32:01.5626611Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.5627028Z self=, 2025-05-07T20:32:01.5627430Z T=128, 2025-05-07T20:32:01.5627628Z D=7168, 2025-05-07T20:32:01.5627829Z scale_ub=1200.0, 2025-05-07T20:32:01.5628053Z contiguous=False, 2025-05-07T20:32:01.5628289Z compiled=False, 2025-05-07T20:32:01.5628502Z ) 2025-05-07T20:32:01.6493639Z self = 2025-05-07T20:32:01.6494433Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.6494764Z 2025-05-07T20:32:01.6494850Z @given( 2025-05-07T20:32:01.6495082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.6495399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.6495995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.6496331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.6496669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.6496953Z ) 2025-05-07T20:32:01.6497310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.6497758Z def test_silu_mul_quant( 2025-05-07T20:32:01.6498004Z self, 2025-05-07T20:32:01.6498209Z T: int, 2025-05-07T20:32:01.6498431Z D: int, 2025-05-07T20:32:01.6498652Z scale_ub: Optional[float], 2025-05-07T20:32:01.6498932Z contiguous: bool, 2025-05-07T20:32:01.6499179Z compiled: bool, 2025-05-07T20:32:01.6499408Z ) -> None: 2025-05-07T20:32:01.6499633Z torch.manual_seed(2025) 2025-05-07T20:32:01.6499901Z 2025-05-07T20:32:01.6500193Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.6500544Z 2025-05-07T20:32:01.6500751Z x_sign = torch.sign(x) 2025-05-07T20:32:01.6501053Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.6501363Z x = x_sign * x_clamp 2025-05-07T20:32:01.6501617Z x0 = x[:, :D] 2025-05-07T20:32:01.6501842Z x1 = x[:, D:] 2025-05-07T20:32:01.6502051Z 2025-05-07T20:32:01.6502248Z if contiguous: 2025-05-07T20:32:01.6502486Z x0 = x0.contiguous() 2025-05-07T20:32:01.6502745Z x1 = x1.contiguous() 2025-05-07T20:32:01.6502993Z 2025-05-07T20:32:01.6503333Z if scale_ub is not None: 2025-05-07T20:32:01.6503612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.6503956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.6504272Z ) 2025-05-07T20:32:01.6504466Z else: 2025-05-07T20:32:01.6504688Z scale_ub_tensor = None 2025-05-07T20:32:01.6504951Z 2025-05-07T20:32:01.6505198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.6505519Z op = silu_mul_quant 2025-05-07T20:32:01.6506080Z if compiled: 2025-05-07T20:32:01.6506569Z op = torch.compile(op) 2025-05-07T20:32:01.6506868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.6507150Z 2025-05-07T20:32:01.6507353Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.6507519Z 2025-05-07T20:32:01.6507621Z moe/activation_test.py:117: 2025-05-07T20:32:01.6507930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.6508275Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.6508553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.6509246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.6509945Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.6510491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.6511302Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.6511970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.6512509Z kernel = self.compile( 2025-05-07T20:32:01.6513060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.6513717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.6514120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.6514358Z 2025-05-07T20:32:01.6514565Z self = 2025-05-07T20:32:01.6515652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.6517110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9c09800>} 2025-05-07T20:32:01.6518446Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.6519472Z context = 2025-05-07T20:32:01.6519767Z 2025-05-07T20:32:01.6519931Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.6520456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.6520917Z module_map=module_map) 2025-05-07T20:32:01.6521286Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.6521647Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.6521904Z E ^ 2025-05-07T20:32:01.6522368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.6522822Z 2025-05-07T20:32:01.6523234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.6523739Z 2025-05-07T20:32:01.6523961Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.6524368Z self=, 2025-05-07T20:32:01.6524773Z T=128, 2025-05-07T20:32:01.6524970Z D=5120, 2025-05-07T20:32:01.6525169Z scale_ub=None, 2025-05-07T20:32:01.6525388Z contiguous=False, 2025-05-07T20:32:01.6525620Z compiled=False, 2025-05-07T20:32:01.6525837Z ) 2025-05-07T20:32:01.6526176Z self = 2025-05-07T20:32:01.6526703Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.6526972Z 2025-05-07T20:32:01.6527060Z @given( 2025-05-07T20:32:01.6527290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.6527732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.6528047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.6528374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.6528718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.6529011Z ) 2025-05-07T20:32:01.6529363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.6529800Z def test_silu_mul_quant( 2025-05-07T20:32:01.6530048Z self, 2025-05-07T20:32:01.6530249Z T: int, 2025-05-07T20:32:01.6530448Z D: int, 2025-05-07T20:32:01.6530672Z scale_ub: Optional[float], 2025-05-07T20:32:01.6530947Z contiguous: bool, 2025-05-07T20:32:01.6531249Z compiled: bool, 2025-05-07T20:32:01.6531480Z ) -> None: 2025-05-07T20:32:01.6531697Z torch.manual_seed(2025) 2025-05-07T20:32:01.6531937Z 2025-05-07T20:32:01.6532212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.6532558Z 2025-05-07T20:32:01.6532751Z x_sign = torch.sign(x) 2025-05-07T20:32:01.6533074Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.6533384Z x = x_sign * x_clamp 2025-05-07T20:32:01.6533632Z x0 = x[:, :D] 2025-05-07T20:32:01.6533849Z x1 = x[:, D:] 2025-05-07T20:32:01.6534060Z 2025-05-07T20:32:01.6534249Z if contiguous: 2025-05-07T20:32:01.6534480Z x0 = x0.contiguous() 2025-05-07T20:32:01.6534748Z x1 = x1.contiguous() 2025-05-07T20:32:01.6534995Z 2025-05-07T20:32:01.6535186Z if scale_ub is not None: 2025-05-07T20:32:01.6535463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.6535854Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.6536159Z ) 2025-05-07T20:32:01.6536358Z else: 2025-05-07T20:32:01.6536575Z scale_ub_tensor = None 2025-05-07T20:32:01.6536825Z 2025-05-07T20:32:01.6537062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.6537381Z op = silu_mul_quant 2025-05-07T20:32:01.6537635Z if compiled: 2025-05-07T20:32:01.6537882Z op = torch.compile(op) 2025-05-07T20:32:01.6538185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.6538467Z 2025-05-07T20:32:01.6538660Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.6538830Z 2025-05-07T20:32:01.6538933Z moe/activation_test.py:117: 2025-05-07T20:32:01.6539232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.6539564Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.6539851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.6540545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.6541239Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.6541770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.6542454Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.6543267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.6543799Z kernel = self.compile( 2025-05-07T20:32:01.6544341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.6544995Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.6545392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.6545624Z 2025-05-07T20:32:01.6545830Z self = 2025-05-07T20:32:01.6546909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.6548288Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9c0bce0>} 2025-05-07T20:32:01.6549630Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.6550653Z context = 2025-05-07T20:32:01.6550937Z 2025-05-07T20:32:01.6551184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.6551709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.6552176Z module_map=module_map) 2025-05-07T20:32:01.6552557Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.6552908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.6553179Z E ^ 2025-05-07T20:32:01.6553653Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.6554101Z 2025-05-07T20:32:01.6554521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.6555031Z 2025-05-07T20:32:01.6555139Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.6555563Z self=, 2025-05-07T20:32:01.6556060Z T=128, 2025-05-07T20:32:01.6556250Z D=5120, 2025-05-07T20:32:01.6556453Z scale_ub=1200.0, 2025-05-07T20:32:01.6556682Z contiguous=True, 2025-05-07T20:32:01.6556910Z compiled=False, 2025-05-07T20:32:01.6557128Z ) 2025-05-07T20:32:01.7933584Z self = 2025-05-07T20:32:01.7934310Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.7934589Z 2025-05-07T20:32:01.7934671Z @given( 2025-05-07T20:32:01.7934929Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.7935237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.7935544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.7935875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.7936195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.7936484Z ) 2025-05-07T20:32:01.7936830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.7937275Z def test_silu_mul_quant( 2025-05-07T20:32:01.7937520Z self, 2025-05-07T20:32:01.7937712Z T: int, 2025-05-07T20:32:01.7937901Z D: int, 2025-05-07T20:32:01.7938118Z scale_ub: Optional[float], 2025-05-07T20:32:01.7938390Z contiguous: bool, 2025-05-07T20:32:01.7938621Z compiled: bool, 2025-05-07T20:32:01.7938852Z ) -> None: 2025-05-07T20:32:01.7939066Z torch.manual_seed(2025) 2025-05-07T20:32:01.7939635Z 2025-05-07T20:32:01.7939904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.7940247Z 2025-05-07T20:32:01.7940441Z x_sign = torch.sign(x) 2025-05-07T20:32:01.7940723Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.7941029Z x = x_sign * x_clamp 2025-05-07T20:32:01.7941268Z x0 = x[:, :D] 2025-05-07T20:32:01.7941477Z x1 = x[:, D:] 2025-05-07T20:32:01.7941689Z 2025-05-07T20:32:01.7941882Z if contiguous: 2025-05-07T20:32:01.7942105Z x0 = x0.contiguous() 2025-05-07T20:32:01.7942368Z x1 = x1.contiguous() 2025-05-07T20:32:01.7942608Z 2025-05-07T20:32:01.7942794Z if scale_ub is not None: 2025-05-07T20:32:01.7943071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.7943406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.7943713Z ) 2025-05-07T20:32:01.7943912Z else: 2025-05-07T20:32:01.7944132Z scale_ub_tensor = None 2025-05-07T20:32:01.7944378Z 2025-05-07T20:32:01.7944610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.7944927Z op = silu_mul_quant 2025-05-07T20:32:01.7945182Z if compiled: 2025-05-07T20:32:01.7945428Z op = torch.compile(op) 2025-05-07T20:32:01.7945726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.7946000Z 2025-05-07T20:32:01.7946288Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.7946455Z 2025-05-07T20:32:01.7946556Z moe/activation_test.py:117: 2025-05-07T20:32:01.7946859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.7947191Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.7947481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.7948169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.7948871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.7949399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.7950075Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.7950734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.7951263Z kernel = self.compile( 2025-05-07T20:32:01.7951883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.7952537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.7952934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.7953161Z 2025-05-07T20:32:01.7953368Z self = 2025-05-07T20:32:01.7954455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.7955891Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9fc3920>} 2025-05-07T20:32:01.7957242Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.7958274Z context = 2025-05-07T20:32:01.7958560Z 2025-05-07T20:32:01.7958724Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.7959327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.7959794Z module_map=module_map) 2025-05-07T20:32:01.7960154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.7960504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.7960763Z E ^ 2025-05-07T20:32:01.7961224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.7961670Z 2025-05-07T20:32:01.7962088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.7962597Z 2025-05-07T20:32:01.7962701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.7963111Z self=, 2025-05-07T20:32:01.7963507Z T=1, 2025-05-07T20:32:01.7963691Z D=7168, 2025-05-07T20:32:01.7963889Z scale_ub=1200.0, 2025-05-07T20:32:01.7964112Z contiguous=True, 2025-05-07T20:32:01.7964334Z compiled=True, 2025-05-07T20:32:01.7964542Z ) 2025-05-07T20:32:01.7964857Z self = 2025-05-07T20:32:01.7965333Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.7965596Z 2025-05-07T20:32:01.7965674Z @given( 2025-05-07T20:32:01.7965906Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.7966211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.7966573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.7966898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.7967220Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.7967504Z ) 2025-05-07T20:32:01.7967972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.7968417Z def test_silu_mul_quant( 2025-05-07T20:32:01.7968661Z self, 2025-05-07T20:32:01.7968865Z T: int, 2025-05-07T20:32:01.7969074Z D: int, 2025-05-07T20:32:01.7969289Z scale_ub: Optional[float], 2025-05-07T20:32:01.7969568Z contiguous: bool, 2025-05-07T20:32:01.7969813Z compiled: bool, 2025-05-07T20:32:01.7970037Z ) -> None: 2025-05-07T20:32:01.7970258Z torch.manual_seed(2025) 2025-05-07T20:32:01.7970508Z 2025-05-07T20:32:01.7970781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.7971131Z 2025-05-07T20:32:01.7971381Z x_sign = torch.sign(x) 2025-05-07T20:32:01.7971669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.7971990Z x = x_sign * x_clamp 2025-05-07T20:32:01.7972233Z x0 = x[:, :D] 2025-05-07T20:32:01.7972450Z x1 = x[:, D:] 2025-05-07T20:32:01.7972660Z 2025-05-07T20:32:01.7972851Z if contiguous: 2025-05-07T20:32:01.7973086Z x0 = x0.contiguous() 2025-05-07T20:32:01.7973343Z x1 = x1.contiguous() 2025-05-07T20:32:01.7973596Z 2025-05-07T20:32:01.7973796Z if scale_ub is not None: 2025-05-07T20:32:01.7974066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.7974405Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.7974717Z ) 2025-05-07T20:32:01.7974907Z else: 2025-05-07T20:32:01.7975126Z scale_ub_tensor = None 2025-05-07T20:32:01.7975380Z 2025-05-07T20:32:01.7975612Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.7975965Z op = silu_mul_quant 2025-05-07T20:32:01.7976251Z if compiled: 2025-05-07T20:32:01.7976497Z op = torch.compile(op) 2025-05-07T20:32:01.7976797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.7977081Z 2025-05-07T20:32:01.7977275Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.7977447Z 2025-05-07T20:32:01.7977547Z moe/activation_test.py:117: 2025-05-07T20:32:01.7977935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.7978282Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.7978566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.7979126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.7979689Z return fn(*args, **kwargs) 2025-05-07T20:32:01.7980346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.7981048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.7981595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.7982280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.7982937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.7983486Z kernel = self.compile( 2025-05-07T20:32:01.7984031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.7984687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.7985081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.7985318Z 2025-05-07T20:32:01.7985523Z self = 2025-05-07T20:32:01.7986653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.7988017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84950220>} 2025-05-07T20:32:01.7989354Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.7990380Z context = 2025-05-07T20:32:01.7990673Z 2025-05-07T20:32:01.7990837Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.7991359Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.7991867Z module_map=module_map) 2025-05-07T20:32:01.7992234Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.7992590Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.7992846Z E ^ 2025-05-07T20:32:01.7993313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.7993766Z 2025-05-07T20:32:01.7994186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.7994692Z 2025-05-07T20:32:01.7994802Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.7995209Z self=, 2025-05-07T20:32:01.7995616Z T=1, 2025-05-07T20:32:01.7995807Z D=7168, 2025-05-07T20:32:01.7996001Z scale_ub=1200.0, 2025-05-07T20:32:01.7996240Z contiguous=False, 2025-05-07T20:32:01.7996466Z compiled=True, 2025-05-07T20:32:01.7996670Z ) 2025-05-07T20:32:02.0779136Z self = 2025-05-07T20:32:02.0779668Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.0779962Z 2025-05-07T20:32:02.0780051Z @given( 2025-05-07T20:32:02.0792452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.0792816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.0793725Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.0794094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.0794434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.0794719Z ) 2025-05-07T20:32:02.0795085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.0795573Z def test_silu_mul_quant( 2025-05-07T20:32:02.0795850Z self, 2025-05-07T20:32:02.0796061Z T: int, 2025-05-07T20:32:02.0796269Z D: int, 2025-05-07T20:32:02.0796489Z scale_ub: Optional[float], 2025-05-07T20:32:02.0796822Z contiguous: bool, 2025-05-07T20:32:02.0797110Z compiled: bool, 2025-05-07T20:32:02.0797345Z ) -> None: 2025-05-07T20:32:02.0797573Z torch.manual_seed(2025) 2025-05-07T20:32:02.0797827Z 2025-05-07T20:32:02.0798102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.0798458Z 2025-05-07T20:32:02.0798679Z x_sign = torch.sign(x) 2025-05-07T20:32:02.0798970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.0799293Z x = x_sign * x_clamp 2025-05-07T20:32:02.0799549Z x0 = x[:, :D] 2025-05-07T20:32:02.0799775Z x1 = x[:, D:] 2025-05-07T20:32:02.0799986Z 2025-05-07T20:32:02.0800181Z if contiguous: 2025-05-07T20:32:02.0800427Z x0 = x0.contiguous() 2025-05-07T20:32:02.0800836Z x1 = x1.contiguous() 2025-05-07T20:32:02.0801087Z 2025-05-07T20:32:02.0801289Z if scale_ub is not None: 2025-05-07T20:32:02.0801564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.0801920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.0802247Z ) 2025-05-07T20:32:02.0802448Z else: 2025-05-07T20:32:02.0802674Z scale_ub_tensor = None 2025-05-07T20:32:02.0802939Z 2025-05-07T20:32:02.0803179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.0803510Z op = silu_mul_quant 2025-05-07T20:32:02.0803771Z if compiled: 2025-05-07T20:32:02.0804023Z op = torch.compile(op) 2025-05-07T20:32:02.0804331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.0804617Z 2025-05-07T20:32:02.0804822Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.0804995Z 2025-05-07T20:32:02.0805101Z moe/activation_test.py:117: 2025-05-07T20:32:02.0805495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.0806114Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.0806397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.0806975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.0807671Z return fn(*args, **kwargs) 2025-05-07T20:32:02.0808345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.0809048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.0809593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.0810283Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.0810951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.0811502Z kernel = self.compile( 2025-05-07T20:32:02.0812053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.0812758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.0813157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.0813391Z 2025-05-07T20:32:02.0813736Z self = 2025-05-07T20:32:02.0814831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.0816346Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a849518a0>} 2025-05-07T20:32:02.0817751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.0818778Z context = 2025-05-07T20:32:02.0819078Z 2025-05-07T20:32:02.0819247Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.0819795Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.0820279Z module_map=module_map) 2025-05-07T20:32:02.0820649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.0821017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.0821286Z E ^ 2025-05-07T20:32:02.0821756Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.0822294Z 2025-05-07T20:32:02.0822713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.0823233Z 2025-05-07T20:32:02.0823343Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.0823772Z self=, 2025-05-07T20:32:02.0824175Z T=1, 2025-05-07T20:32:02.0824372Z D=7168, 2025-05-07T20:32:02.0824580Z scale_ub=None, 2025-05-07T20:32:02.0824808Z contiguous=False, 2025-05-07T20:32:02.0825049Z compiled=True, 2025-05-07T20:32:02.0825269Z ) 2025-05-07T20:32:02.1506416Z self = 2025-05-07T20:32:02.1506979Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:02.1507255Z 2025-05-07T20:32:02.1507334Z @given( 2025-05-07T20:32:02.1507570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.1508198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.1508503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.1508839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.1509174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.1509459Z ) 2025-05-07T20:32:02.1509815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.1510257Z def test_silu_mul_quant( 2025-05-07T20:32:02.1510495Z self, 2025-05-07T20:32:02.1510703Z T: int, 2025-05-07T20:32:02.1510903Z D: int, 2025-05-07T20:32:02.1511115Z scale_ub: Optional[float], 2025-05-07T20:32:02.1511391Z contiguous: bool, 2025-05-07T20:32:02.1511630Z compiled: bool, 2025-05-07T20:32:02.1511858Z ) -> None: 2025-05-07T20:32:02.1512069Z torch.manual_seed(2025) 2025-05-07T20:32:02.1512315Z 2025-05-07T20:32:02.1512589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.1512936Z 2025-05-07T20:32:02.1513131Z x_sign = torch.sign(x) 2025-05-07T20:32:02.1513425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.1513732Z x = x_sign * x_clamp 2025-05-07T20:32:02.1513975Z x0 = x[:, :D] 2025-05-07T20:32:02.1514194Z x1 = x[:, D:] 2025-05-07T20:32:02.1514398Z 2025-05-07T20:32:02.1514585Z if contiguous: 2025-05-07T20:32:02.1514816Z x0 = x0.contiguous() 2025-05-07T20:32:02.1515223Z x1 = x1.contiguous() 2025-05-07T20:32:02.1515468Z 2025-05-07T20:32:02.1515663Z if scale_ub is not None: 2025-05-07T20:32:02.1515933Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.1516271Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.1516590Z ) 2025-05-07T20:32:02.1516791Z else: 2025-05-07T20:32:02.1517006Z scale_ub_tensor = None 2025-05-07T20:32:02.1517260Z 2025-05-07T20:32:02.1517497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.1517807Z op = silu_mul_quant 2025-05-07T20:32:02.1518059Z if compiled: 2025-05-07T20:32:02.1518308Z op = torch.compile(op) 2025-05-07T20:32:02.1518599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.1518876Z 2025-05-07T20:32:02.1519067Z y_fp8, y_scale = fn() 2025-05-07T20:32:02.1519345Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:02.1519645Z 2025-05-07T20:32:02.1519881Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.1520215Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:02.1520510Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:02.1520822Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:02.1521182Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.1521489Z 2025-05-07T20:32:02.1521790Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:02.1521987Z 2025-05-07T20:32:02.1522095Z moe/activation_test.py:126: 2025-05-07T20:32:02.1522387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1522724Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:02.1523049Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.1523838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:02.1524595Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:02.1525142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.1525823Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.1526502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:02.1527275Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.1528114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:02.1528867Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.1529599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:02.1530243Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:02.1530848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:02.1531377Z fn() 2025-05-07T20:32:02.1531886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:02.1532470Z self.fn.run( 2025-05-07T20:32:02.1532947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.1533477Z kernel = self.compile( 2025-05-07T20:32:02.1534020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.1534680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.1535087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1535400Z 2025-05-07T20:32:02.1535611Z self = 2025-05-07T20:32:02.1536705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.1538103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84952e80>} 2025-05-07T20:32:02.1539451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.1540471Z context = 2025-05-07T20:32:02.1540766Z 2025-05-07T20:32:02.1540940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.1541464Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.1541933Z module_map=module_map) 2025-05-07T20:32:02.1542296Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.1542658Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:02.1542947Z E ^ 2025-05-07T20:32:02.1543477Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.1543929Z 2025-05-07T20:32:02.1544351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.1544868Z 2025-05-07T20:32:02.1544975Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.1545392Z self=, 2025-05-07T20:32:02.1545803Z T=1, 2025-05-07T20:32:02.1545996Z D=5120, 2025-05-07T20:32:02.1546199Z scale_ub=1200.0, 2025-05-07T20:32:02.1546435Z contiguous=False, 2025-05-07T20:32:02.1546664Z compiled=True, 2025-05-07T20:32:02.1546887Z ) 2025-05-07T20:32:02.2771955Z self = 2025-05-07T20:32:02.2773072Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.2773533Z 2025-05-07T20:32:02.2773896Z @given( 2025-05-07T20:32:02.2774303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.2774771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.2775154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.2775629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.2776155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.2776583Z ) 2025-05-07T20:32:02.2777104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.2777777Z def test_silu_mul_quant( 2025-05-07T20:32:02.2778155Z self, 2025-05-07T20:32:02.2778472Z T: int, 2025-05-07T20:32:02.2778825Z D: int, 2025-05-07T20:32:02.2779162Z scale_ub: Optional[float], 2025-05-07T20:32:02.2779546Z contiguous: bool, 2025-05-07T20:32:02.2779943Z compiled: bool, 2025-05-07T20:32:02.2780286Z ) -> None: 2025-05-07T20:32:02.2780615Z torch.manual_seed(2025) 2025-05-07T20:32:02.2781018Z 2025-05-07T20:32:02.2781410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.2781945Z 2025-05-07T20:32:02.2782222Z x_sign = torch.sign(x) 2025-05-07T20:32:02.2782631Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.2783134Z x = x_sign * x_clamp 2025-05-07T20:32:02.2783458Z x0 = x[:, :D] 2025-05-07T20:32:02.2783791Z x1 = x[:, D:] 2025-05-07T20:32:02.2784196Z 2025-05-07T20:32:02.2784623Z if contiguous: 2025-05-07T20:32:02.2785004Z x0 = x0.contiguous() 2025-05-07T20:32:02.2785444Z x1 = x1.contiguous() 2025-05-07T20:32:02.2785764Z 2025-05-07T20:32:02.2786098Z if scale_ub is not None: 2025-05-07T20:32:02.2786539Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.2786985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.2787403Z ) 2025-05-07T20:32:02.2787765Z else: 2025-05-07T20:32:02.2788113Z scale_ub_tensor = None 2025-05-07T20:32:02.2788444Z 2025-05-07T20:32:02.2788828Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.2789273Z op = silu_mul_quant 2025-05-07T20:32:02.2789597Z if compiled: 2025-05-07T20:32:02.2790000Z op = torch.compile(op) 2025-05-07T20:32:02.2790428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.2790777Z 2025-05-07T20:32:02.2791131Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.2791397Z 2025-05-07T20:32:02.2791541Z moe/activation_test.py:117: 2025-05-07T20:32:02.2791939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.2792406Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.2792819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.2793475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.2794334Z return fn(*args, **kwargs) 2025-05-07T20:32:02.2795067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.2795855Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.2796582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.2797337Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.2798097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.2798822Z kernel = self.compile( 2025-05-07T20:32:02.2799474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.2800189Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.2800787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.2801144Z 2025-05-07T20:32:02.2801392Z self = 2025-05-07T20:32:02.2802641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.2804121Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84953a60>} 2025-05-07T20:32:02.2805578Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.2807865Z context = 2025-05-07T20:32:02.2808215Z 2025-05-07T20:32:02.2808450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.2809053Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.2809643Z module_map=module_map) 2025-05-07T20:32:02.2810107Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.2810566Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.2810911Z E ^ 2025-05-07T20:32:02.2811640Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.2812177Z 2025-05-07T20:32:02.2812675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.2813198Z 2025-05-07T20:32:02.2813431Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.2813893Z self=, 2025-05-07T20:32:02.2814406Z T=1, 2025-05-07T20:32:02.2814723Z D=5120, 2025-05-07T20:32:02.2815026Z scale_ub=1200.0, 2025-05-07T20:32:02.2815302Z contiguous=False, 2025-05-07T20:32:02.2815655Z compiled=False, 2025-05-07T20:32:02.2815970Z ) 2025-05-07T20:32:02.2816335Z self = 2025-05-07T20:32:02.2816999Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.2817292Z 2025-05-07T20:32:02.2817463Z @given( 2025-05-07T20:32:02.2817787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.2818247Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.2818654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.2819053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.2819490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.2819878Z ) 2025-05-07T20:32:02.2820373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.2820947Z def test_silu_mul_quant( 2025-05-07T20:32:02.2821272Z self, 2025-05-07T20:32:02.2821539Z T: int, 2025-05-07T20:32:02.2821868Z D: int, 2025-05-07T20:32:02.2822171Z scale_ub: Optional[float], 2025-05-07T20:32:02.2822557Z contiguous: bool, 2025-05-07T20:32:02.2822960Z compiled: bool, 2025-05-07T20:32:02.2823236Z ) -> None: 2025-05-07T20:32:02.2823523Z torch.manual_seed(2025) 2025-05-07T20:32:02.2823927Z 2025-05-07T20:32:02.2824250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.2824684Z 2025-05-07T20:32:02.2825025Z x_sign = torch.sign(x) 2025-05-07T20:32:02.2825368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.2825770Z x = x_sign * x_clamp 2025-05-07T20:32:02.2826150Z x0 = x[:, :D] 2025-05-07T20:32:02.2826463Z x1 = x[:, D:] 2025-05-07T20:32:02.2826847Z 2025-05-07T20:32:02.2827178Z if contiguous: 2025-05-07T20:32:02.2827515Z x0 = x0.contiguous() 2025-05-07T20:32:02.2827811Z x1 = x1.contiguous() 2025-05-07T20:32:02.2828199Z 2025-05-07T20:32:02.2828498Z if scale_ub is not None: 2025-05-07T20:32:02.2828806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.2829289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.2829702Z ) 2025-05-07T20:32:02.2829940Z else: 2025-05-07T20:32:02.2830300Z scale_ub_tensor = None 2025-05-07T20:32:02.2830699Z 2025-05-07T20:32:02.2830972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.2831438Z op = silu_mul_quant 2025-05-07T20:32:02.2831792Z if compiled: 2025-05-07T20:32:02.2832156Z op = torch.compile(op) 2025-05-07T20:32:02.2832538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.2832901Z 2025-05-07T20:32:02.2833215Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.2833421Z 2025-05-07T20:32:02.2833569Z moe/activation_test.py:117: 2025-05-07T20:32:02.2833948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.2834410Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.2834850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.2835700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.2836589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.2837221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.2837982Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.2838744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.2839381Z kernel = self.compile( 2025-05-07T20:32:02.2840003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.2840805Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.2841251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.2841539Z 2025-05-07T20:32:02.2841792Z self = 2025-05-07T20:32:02.2843042Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.2844494Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8420c9a0>} 2025-05-07T20:32:02.2845962Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.2847163Z context = 2025-05-07T20:32:02.2847508Z 2025-05-07T20:32:02.2847783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.2848404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.2849018Z module_map=module_map) 2025-05-07T20:32:02.2849456Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.2849877Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.2850281Z E ^ 2025-05-07T20:32:02.2850819Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.2851377Z 2025-05-07T20:32:02.2851805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.2852472Z 2025-05-07T20:32:02.2852605Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.2853122Z self=, 2025-05-07T20:32:02.2853561Z T=16384, 2025-05-07T20:32:02.2853923Z D=5120, 2025-05-07T20:32:02.2854201Z scale_ub=1200.0, 2025-05-07T20:32:02.2854468Z contiguous=False, 2025-05-07T20:32:02.2854861Z compiled=True, 2025-05-07T20:32:02.2855151Z ) 2025-05-07T20:32:02.3530291Z self = 2025-05-07T20:32:02.3531098Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.3531495Z 2025-05-07T20:32:02.3531620Z @given( 2025-05-07T20:32:02.3531941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.3532350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.3544070Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.3544488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.3544844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.3545155Z ) 2025-05-07T20:32:02.3545523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.3545978Z def test_silu_mul_quant( 2025-05-07T20:32:02.3546241Z self, 2025-05-07T20:32:02.3546733Z T: int, 2025-05-07T20:32:02.3546942Z D: int, 2025-05-07T20:32:02.3547175Z scale_ub: Optional[float], 2025-05-07T20:32:02.3547460Z contiguous: bool, 2025-05-07T20:32:02.3547704Z compiled: bool, 2025-05-07T20:32:02.3547938Z ) -> None: 2025-05-07T20:32:02.3548160Z torch.manual_seed(2025) 2025-05-07T20:32:02.3548403Z 2025-05-07T20:32:02.3548689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.3549044Z 2025-05-07T20:32:02.3549237Z x_sign = torch.sign(x) 2025-05-07T20:32:02.3549538Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.3549856Z x = x_sign * x_clamp 2025-05-07T20:32:02.3550098Z x0 = x[:, :D] 2025-05-07T20:32:02.3550325Z x1 = x[:, D:] 2025-05-07T20:32:02.3550545Z 2025-05-07T20:32:02.3550743Z if contiguous: 2025-05-07T20:32:02.3550977Z x0 = x0.contiguous() 2025-05-07T20:32:02.3551251Z x1 = x1.contiguous() 2025-05-07T20:32:02.3551499Z 2025-05-07T20:32:02.3551694Z if scale_ub is not None: 2025-05-07T20:32:02.3551974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.3552318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.3552624Z ) 2025-05-07T20:32:02.3552828Z else: 2025-05-07T20:32:02.3553050Z scale_ub_tensor = None 2025-05-07T20:32:02.3553297Z 2025-05-07T20:32:02.3553641Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.3553963Z op = silu_mul_quant 2025-05-07T20:32:02.3554215Z if compiled: 2025-05-07T20:32:02.3554471Z op = torch.compile(op) 2025-05-07T20:32:02.3554778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.3555054Z 2025-05-07T20:32:02.3555260Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.3555434Z 2025-05-07T20:32:02.3555536Z moe/activation_test.py:117: 2025-05-07T20:32:02.3555882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.3556239Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.3556535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.3557103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.3557663Z return fn(*args, **kwargs) 2025-05-07T20:32:02.3558333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.3559115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.3559660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.3560343Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.3561018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.3561555Z kernel = self.compile( 2025-05-07T20:32:02.3562094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.3562757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.3563160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.3563390Z 2025-05-07T20:32:02.3563610Z self = 2025-05-07T20:32:02.3564688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.3566085Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8420de40>} 2025-05-07T20:32:02.3567605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.3568643Z context = 2025-05-07T20:32:02.3568933Z 2025-05-07T20:32:02.3569112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.3569634Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.3570115Z module_map=module_map) 2025-05-07T20:32:02.3570493Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.3570853Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.3571133Z E ^ 2025-05-07T20:32:02.3571606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.3572060Z 2025-05-07T20:32:02.3572477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.3572985Z 2025-05-07T20:32:02.3573092Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.3573523Z self=, 2025-05-07T20:32:02.3573944Z T=2048, 2025-05-07T20:32:02.3574150Z D=7168, 2025-05-07T20:32:02.3574431Z scale_ub=1200.0, 2025-05-07T20:32:02.3574680Z contiguous=False, 2025-05-07T20:32:02.3574914Z compiled=True, 2025-05-07T20:32:02.3575141Z ) 2025-05-07T20:32:02.3575475Z self = 2025-05-07T20:32:02.3575974Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.3576259Z 2025-05-07T20:32:02.3576343Z @given( 2025-05-07T20:32:02.3576578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.3576903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.3577208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.3577544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.3577876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.3578160Z ) 2025-05-07T20:32:02.3578511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.3578956Z def test_silu_mul_quant( 2025-05-07T20:32:02.3579249Z self, 2025-05-07T20:32:02.3579449Z T: int, 2025-05-07T20:32:02.3579655Z D: int, 2025-05-07T20:32:02.3579870Z scale_ub: Optional[float], 2025-05-07T20:32:02.3580147Z contiguous: bool, 2025-05-07T20:32:02.3580394Z compiled: bool, 2025-05-07T20:32:02.3580624Z ) -> None: 2025-05-07T20:32:02.3580839Z torch.manual_seed(2025) 2025-05-07T20:32:02.3581086Z 2025-05-07T20:32:02.3581369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.3581712Z 2025-05-07T20:32:02.3581912Z x_sign = torch.sign(x) 2025-05-07T20:32:02.3582212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.3582521Z x = x_sign * x_clamp 2025-05-07T20:32:02.3582767Z x0 = x[:, :D] 2025-05-07T20:32:02.3582991Z x1 = x[:, D:] 2025-05-07T20:32:02.3583199Z 2025-05-07T20:32:02.3583394Z if contiguous: 2025-05-07T20:32:02.3583633Z x0 = x0.contiguous() 2025-05-07T20:32:02.3583894Z x1 = x1.contiguous() 2025-05-07T20:32:02.3584141Z 2025-05-07T20:32:02.3584343Z if scale_ub is not None: 2025-05-07T20:32:02.3584617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.3584956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.3585272Z ) 2025-05-07T20:32:02.3585476Z else: 2025-05-07T20:32:02.3585690Z scale_ub_tensor = None 2025-05-07T20:32:02.3585949Z 2025-05-07T20:32:02.3586277Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.3586594Z op = silu_mul_quant 2025-05-07T20:32:02.3586851Z if compiled: 2025-05-07T20:32:02.3587106Z op = torch.compile(op) 2025-05-07T20:32:02.3587399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.3587680Z 2025-05-07T20:32:02.3587879Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.3588043Z 2025-05-07T20:32:02.3588150Z moe/activation_test.py:117: 2025-05-07T20:32:02.3588450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.3588809Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.3589095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.3589648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.3590209Z return fn(*args, **kwargs) 2025-05-07T20:32:02.3590876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.3591561Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.3592099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.3592791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.3593453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.3594065Z kernel = self.compile( 2025-05-07T20:32:02.3594609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.3595265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.3595669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.3595898Z 2025-05-07T20:32:02.3596111Z self = 2025-05-07T20:32:02.3597191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.3598560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8420e980>} 2025-05-07T20:32:02.3599948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.3600969Z context = 2025-05-07T20:32:02.3601262Z 2025-05-07T20:32:02.3601428Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.3601961Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.3602429Z module_map=module_map) 2025-05-07T20:32:02.3602790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.3603149Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.3603419Z E ^ 2025-05-07T20:32:02.3603881Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.3604342Z 2025-05-07T20:32:02.3604757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.3605273Z 2025-05-07T20:32:02.4506116Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.4507415Z self=, 2025-05-07T20:32:02.4508489Z T=1, 2025-05-07T20:32:02.4508887Z D=5120, 2025-05-07T20:32:02.4509791Z scale_ub=None, 2025-05-07T20:32:02.4510225Z contiguous=False, 2025-05-07T20:32:02.4510674Z compiled=False, 2025-05-07T20:32:02.4511076Z ) 2025-05-07T20:32:02.4511698Z self = 2025-05-07T20:32:02.4512669Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:02.4513203Z 2025-05-07T20:32:02.4513355Z @given( 2025-05-07T20:32:02.4513812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.4514445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.4515054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.4515709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.4516298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.4516642Z ) 2025-05-07T20:32:02.4517001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.4517445Z def test_silu_mul_quant( 2025-05-07T20:32:02.4517692Z self, 2025-05-07T20:32:02.4517889Z T: int, 2025-05-07T20:32:02.4518082Z D: int, 2025-05-07T20:32:02.4518301Z scale_ub: Optional[float], 2025-05-07T20:32:02.4518572Z contiguous: bool, 2025-05-07T20:32:02.4518808Z compiled: bool, 2025-05-07T20:32:02.4519037Z ) -> None: 2025-05-07T20:32:02.4519252Z torch.manual_seed(2025) 2025-05-07T20:32:02.4519496Z 2025-05-07T20:32:02.4519878Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.4520225Z 2025-05-07T20:32:02.4520422Z x_sign = torch.sign(x) 2025-05-07T20:32:02.4520706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.4521015Z x = x_sign * x_clamp 2025-05-07T20:32:02.4521257Z x0 = x[:, :D] 2025-05-07T20:32:02.4521468Z x1 = x[:, D:] 2025-05-07T20:32:02.4521678Z 2025-05-07T20:32:02.4521865Z if contiguous: 2025-05-07T20:32:02.4522100Z x0 = x0.contiguous() 2025-05-07T20:32:02.4522364Z x1 = x1.contiguous() 2025-05-07T20:32:02.4522609Z 2025-05-07T20:32:02.4522796Z if scale_ub is not None: 2025-05-07T20:32:02.4523068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.4523407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.4523711Z ) 2025-05-07T20:32:02.4523909Z else: 2025-05-07T20:32:02.4524126Z scale_ub_tensor = None 2025-05-07T20:32:02.4524470Z 2025-05-07T20:32:02.4524696Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.4525012Z op = silu_mul_quant 2025-05-07T20:32:02.4525261Z if compiled: 2025-05-07T20:32:02.4525506Z op = torch.compile(op) 2025-05-07T20:32:02.4525807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.4526086Z 2025-05-07T20:32:02.4526275Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.4526446Z 2025-05-07T20:32:02.4526553Z moe/activation_test.py:117: 2025-05-07T20:32:02.4526853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.4527182Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.4527463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.4528268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.4528970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.4529512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.4530201Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.4530869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.4531403Z kernel = self.compile( 2025-05-07T20:32:02.4532033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.4532693Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.4533093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.4533324Z 2025-05-07T20:32:02.4533531Z self = 2025-05-07T20:32:02.4534614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.4536042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9740220>} 2025-05-07T20:32:02.4537417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.4538438Z context = 2025-05-07T20:32:02.4538725Z 2025-05-07T20:32:02.4538891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.4539417Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.4539935Z module_map=module_map) 2025-05-07T20:32:02.4540297Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.4540654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.4540918Z E ^ 2025-05-07T20:32:02.4541386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.4541834Z 2025-05-07T20:32:02.4542255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.4542771Z 2025-05-07T20:32:02.4542877Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.4543293Z self=, 2025-05-07T20:32:02.4543704Z T=4096, 2025-05-07T20:32:02.4543895Z D=7168, 2025-05-07T20:32:02.4544095Z scale_ub=1200.0, 2025-05-07T20:32:02.4544328Z contiguous=False, 2025-05-07T20:32:02.4544552Z compiled=False, 2025-05-07T20:32:02.4544810Z ) 2025-05-07T20:32:02.4545132Z self = 2025-05-07T20:32:02.4545624Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.4545907Z 2025-05-07T20:32:02.4545990Z @given( 2025-05-07T20:32:02.4546226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.4546536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.4546851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.4547191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.4547524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.4547811Z ) 2025-05-07T20:32:02.4548164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.4548611Z def test_silu_mul_quant( 2025-05-07T20:32:02.4548859Z self, 2025-05-07T20:32:02.4549062Z T: int, 2025-05-07T20:32:02.4549271Z D: int, 2025-05-07T20:32:02.4549493Z scale_ub: Optional[float], 2025-05-07T20:32:02.4549775Z contiguous: bool, 2025-05-07T20:32:02.4550024Z compiled: bool, 2025-05-07T20:32:02.4550247Z ) -> None: 2025-05-07T20:32:02.4550470Z torch.manual_seed(2025) 2025-05-07T20:32:02.4550722Z 2025-05-07T20:32:02.4550996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.4551347Z 2025-05-07T20:32:02.4551551Z x_sign = torch.sign(x) 2025-05-07T20:32:02.4551925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.4552246Z x = x_sign * x_clamp 2025-05-07T20:32:02.4552496Z x0 = x[:, :D] 2025-05-07T20:32:02.4552716Z x1 = x[:, D:] 2025-05-07T20:32:02.4552933Z 2025-05-07T20:32:02.4553127Z if contiguous: 2025-05-07T20:32:02.4553363Z x0 = x0.contiguous() 2025-05-07T20:32:02.4553621Z x1 = x1.contiguous() 2025-05-07T20:32:02.4553866Z 2025-05-07T20:32:02.4554068Z if scale_ub is not None: 2025-05-07T20:32:02.4554341Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.4554682Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.4554998Z ) 2025-05-07T20:32:02.4555195Z else: 2025-05-07T20:32:02.4555410Z scale_ub_tensor = None 2025-05-07T20:32:02.4555663Z 2025-05-07T20:32:02.4555898Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.4556215Z op = silu_mul_quant 2025-05-07T20:32:02.4556477Z if compiled: 2025-05-07T20:32:02.4556725Z op = torch.compile(op) 2025-05-07T20:32:02.4557025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.4557309Z 2025-05-07T20:32:02.4557502Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.4557678Z 2025-05-07T20:32:02.4557780Z moe/activation_test.py:117: 2025-05-07T20:32:02.4558082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.4558524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.4558805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.4559499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.4560197Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.4560734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.4561431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.4562094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.4562636Z kernel = self.compile( 2025-05-07T20:32:02.4563176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.4563840Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.4564291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.4564521Z 2025-05-07T20:32:02.4564735Z self = 2025-05-07T20:32:02.4565810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.4567237Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9741440>} 2025-05-07T20:32:02.4568652Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.4569683Z context = 2025-05-07T20:32:02.4569971Z 2025-05-07T20:32:02.4570137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.4570662Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.4571131Z module_map=module_map) 2025-05-07T20:32:02.4571501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.4571935Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.4572207Z E ^ 2025-05-07T20:32:02.4572680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.4573130Z 2025-05-07T20:32:02.4573549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.4574060Z 2025-05-07T20:32:02.4574166Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.4574595Z self=, 2025-05-07T20:32:02.4575007Z T=16384, 2025-05-07T20:32:02.4575205Z D=7168, 2025-05-07T20:32:02.4575405Z scale_ub=None, 2025-05-07T20:32:02.4575622Z contiguous=True, 2025-05-07T20:32:02.4575848Z compiled=True, 2025-05-07T20:32:02.4576248Z ) 2025-05-07T20:32:02.7636120Z self = 2025-05-07T20:32:02.7637124Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:02.7637445Z 2025-05-07T20:32:02.7637540Z @given( 2025-05-07T20:32:02.7637773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.7638096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.7638410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.7638750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.7639080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.7639686Z ) 2025-05-07T20:32:02.7640046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.7640491Z def test_silu_mul_quant( 2025-05-07T20:32:02.7640744Z self, 2025-05-07T20:32:02.7640943Z T: int, 2025-05-07T20:32:02.7641141Z D: int, 2025-05-07T20:32:02.7641367Z scale_ub: Optional[float], 2025-05-07T20:32:02.7641644Z contiguous: bool, 2025-05-07T20:32:02.7641885Z compiled: bool, 2025-05-07T20:32:02.7642126Z ) -> None: 2025-05-07T20:32:02.7642348Z torch.manual_seed(2025) 2025-05-07T20:32:02.7642595Z 2025-05-07T20:32:02.7642876Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.7643227Z 2025-05-07T20:32:02.7643450Z x_sign = torch.sign(x) 2025-05-07T20:32:02.7643747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.7644062Z x = x_sign * x_clamp 2025-05-07T20:32:02.7644405Z x0 = x[:, :D] 2025-05-07T20:32:02.7644629Z x1 = x[:, D:] 2025-05-07T20:32:02.7644845Z 2025-05-07T20:32:02.7645040Z if contiguous: 2025-05-07T20:32:02.7645273Z x0 = x0.contiguous() 2025-05-07T20:32:02.7645539Z x1 = x1.contiguous() 2025-05-07T20:32:02.7645786Z 2025-05-07T20:32:02.7645981Z if scale_ub is not None: 2025-05-07T20:32:02.7646260Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.7646609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.7646970Z ) 2025-05-07T20:32:02.7647178Z else: 2025-05-07T20:32:02.7647398Z scale_ub_tensor = None 2025-05-07T20:32:02.7647773Z 2025-05-07T20:32:02.7648012Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.7648418Z op = silu_mul_quant 2025-05-07T20:32:02.7648838Z if compiled: 2025-05-07T20:32:02.7649089Z op = torch.compile(op) 2025-05-07T20:32:02.7649434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7660111Z 2025-05-07T20:32:02.7660352Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.7660534Z 2025-05-07T20:32:02.7660653Z moe/activation_test.py:117: 2025-05-07T20:32:02.7660971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7661320Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.7661621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7662415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.7662999Z return fn(*args, **kwargs) 2025-05-07T20:32:02.7663687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.7664407Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.7664968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.7665669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.7666351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.7666903Z kernel = self.compile( 2025-05-07T20:32:02.7667461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.7668149Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.7668565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7668800Z 2025-05-07T20:32:02.7669020Z self = 2025-05-07T20:32:02.7670129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.7671609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9742520>} 2025-05-07T20:32:02.7672983Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.7674038Z context = 2025-05-07T20:32:02.7674332Z 2025-05-07T20:32:02.7674511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.7675044Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.7675528Z module_map=module_map) 2025-05-07T20:32:02.7675911Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.7676328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.7676605Z E ^ 2025-05-07T20:32:02.7677087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.7677543Z 2025-05-07T20:32:02.7677972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.7678484Z 2025-05-07T20:32:02.7678600Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.7679024Z self=, 2025-05-07T20:32:02.7679438Z T=4096, 2025-05-07T20:32:02.7679638Z D=5120, 2025-05-07T20:32:02.7679844Z scale_ub=None, 2025-05-07T20:32:02.7680079Z contiguous=False, 2025-05-07T20:32:02.7680310Z compiled=True, 2025-05-07T20:32:02.7680533Z ) 2025-05-07T20:32:02.7680863Z self = 2025-05-07T20:32:02.7681376Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:02.7681652Z 2025-05-07T20:32:02.7681737Z @given( 2025-05-07T20:32:02.7681989Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.7682317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.7682630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.7682977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.7683404Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.7683699Z ) 2025-05-07T20:32:02.7684061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.7684516Z def test_silu_mul_quant( 2025-05-07T20:32:02.7684768Z self, 2025-05-07T20:32:02.7684986Z T: int, 2025-05-07T20:32:02.7685201Z D: int, 2025-05-07T20:32:02.7685436Z scale_ub: Optional[float], 2025-05-07T20:32:02.7685718Z contiguous: bool, 2025-05-07T20:32:02.7685978Z compiled: bool, 2025-05-07T20:32:02.7686218Z ) -> None: 2025-05-07T20:32:02.7686446Z torch.manual_seed(2025) 2025-05-07T20:32:02.7686745Z 2025-05-07T20:32:02.7687039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.7687388Z 2025-05-07T20:32:02.7687672Z x_sign = torch.sign(x) 2025-05-07T20:32:02.7687979Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.7688298Z x = x_sign * x_clamp 2025-05-07T20:32:02.7688561Z x0 = x[:, :D] 2025-05-07T20:32:02.7688796Z x1 = x[:, D:] 2025-05-07T20:32:02.7689010Z 2025-05-07T20:32:02.7689215Z if contiguous: 2025-05-07T20:32:02.7689463Z x0 = x0.contiguous() 2025-05-07T20:32:02.7689728Z x1 = x1.contiguous() 2025-05-07T20:32:02.7689984Z 2025-05-07T20:32:02.7690195Z if scale_ub is not None: 2025-05-07T20:32:02.7690478Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.7690878Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.7691200Z ) 2025-05-07T20:32:02.7691409Z else: 2025-05-07T20:32:02.7691632Z scale_ub_tensor = None 2025-05-07T20:32:02.7691897Z 2025-05-07T20:32:02.7692139Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.7692461Z op = silu_mul_quant 2025-05-07T20:32:02.7692720Z if compiled: 2025-05-07T20:32:02.7692980Z op = torch.compile(op) 2025-05-07T20:32:02.7693285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7693572Z 2025-05-07T20:32:02.7693776Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.7693944Z 2025-05-07T20:32:02.7694054Z moe/activation_test.py:117: 2025-05-07T20:32:02.7694352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7694697Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.7694992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7695603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.7696172Z return fn(*args, **kwargs) 2025-05-07T20:32:02.7696888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.7697582Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.7698128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.7698819Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.7699489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.7700027Z kernel = self.compile( 2025-05-07T20:32:02.7700577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.7701245Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.7701652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7701885Z 2025-05-07T20:32:02.7702093Z self = 2025-05-07T20:32:02.7703256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.7704639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9742c00>} 2025-05-07T20:32:02.7706341Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.7707376Z context = 2025-05-07T20:32:02.7707667Z 2025-05-07T20:32:02.7707837Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.7708369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.7708844Z module_map=module_map) 2025-05-07T20:32:02.7709217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.7709586Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.7709857Z E ^ 2025-05-07T20:32:02.7710333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.7710784Z 2025-05-07T20:32:02.7711201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.7711801Z 2025-05-07T20:32:02.8850581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.8851248Z self=, 2025-05-07T20:32:02.8851810Z T=4096, 2025-05-07T20:32:02.8852071Z D=5120, 2025-05-07T20:32:02.8852344Z scale_ub=1200.0, 2025-05-07T20:32:02.8852605Z contiguous=False, 2025-05-07T20:32:02.8852830Z compiled=False, 2025-05-07T20:32:02.8853048Z ) 2025-05-07T20:32:02.8853370Z self = 2025-05-07T20:32:02.8853891Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.8854179Z 2025-05-07T20:32:02.8854262Z @given( 2025-05-07T20:32:02.8854497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.8854815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.8855122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.8855456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.8856143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.8856435Z ) 2025-05-07T20:32:02.8856786Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.8857234Z def test_silu_mul_quant( 2025-05-07T20:32:02.8857478Z self, 2025-05-07T20:32:02.8857686Z T: int, 2025-05-07T20:32:02.8857891Z D: int, 2025-05-07T20:32:02.8858109Z scale_ub: Optional[float], 2025-05-07T20:32:02.8858392Z contiguous: bool, 2025-05-07T20:32:02.8858639Z compiled: bool, 2025-05-07T20:32:02.8858869Z ) -> None: 2025-05-07T20:32:02.8859083Z torch.manual_seed(2025) 2025-05-07T20:32:02.8859332Z 2025-05-07T20:32:02.8859613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.8859960Z 2025-05-07T20:32:02.8860159Z x_sign = torch.sign(x) 2025-05-07T20:32:02.8860455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.8860772Z x = x_sign * x_clamp 2025-05-07T20:32:02.8861018Z x0 = x[:, :D] 2025-05-07T20:32:02.8861245Z x1 = x[:, D:] 2025-05-07T20:32:02.8861458Z 2025-05-07T20:32:02.8861652Z if contiguous: 2025-05-07T20:32:02.8861898Z x0 = x0.contiguous() 2025-05-07T20:32:02.8862154Z x1 = x1.contiguous() 2025-05-07T20:32:02.8862401Z 2025-05-07T20:32:02.8862603Z if scale_ub is not None: 2025-05-07T20:32:02.8862876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.8863361Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.8863680Z ) 2025-05-07T20:32:02.8863888Z else: 2025-05-07T20:32:02.8864103Z scale_ub_tensor = None 2025-05-07T20:32:02.8864361Z 2025-05-07T20:32:02.8864605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.8864921Z op = silu_mul_quant 2025-05-07T20:32:02.8865206Z if compiled: 2025-05-07T20:32:02.8865474Z op = torch.compile(op) 2025-05-07T20:32:02.8865786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.8866062Z 2025-05-07T20:32:02.8866262Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.8866426Z 2025-05-07T20:32:02.8866537Z moe/activation_test.py:117: 2025-05-07T20:32:02.8866858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.8867192Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.8867489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.8868183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.8868875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.8869417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.8870105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.8870877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.8871407Z kernel = self.compile( 2025-05-07T20:32:02.8871956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.8872616Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.8873018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.8873254Z 2025-05-07T20:32:02.8873461Z self = 2025-05-07T20:32:02.8874543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.8875936Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a92d8400>} 2025-05-07T20:32:02.8877330Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.8878351Z context = 2025-05-07T20:32:02.8878645Z 2025-05-07T20:32:02.8878817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.8879341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.8879811Z module_map=module_map) 2025-05-07T20:32:02.8880174Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.8880627Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.8880933Z E ^ 2025-05-07T20:32:02.8881401Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.8881855Z 2025-05-07T20:32:02.8882272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.8882794Z 2025-05-07T20:32:02.8882902Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.8883325Z self=, 2025-05-07T20:32:02.8883807Z T=4096, 2025-05-07T20:32:02.8884012Z D=5120, 2025-05-07T20:32:02.8884212Z scale_ub=1200.0, 2025-05-07T20:32:02.8884438Z contiguous=False, 2025-05-07T20:32:02.8884672Z compiled=True, 2025-05-07T20:32:02.8884886Z ) 2025-05-07T20:32:02.8885206Z self = 2025-05-07T20:32:02.8885705Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.8885995Z 2025-05-07T20:32:02.8886078Z @given( 2025-05-07T20:32:02.8886319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.8886631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.8886948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.8887282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.8887701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.8887997Z ) 2025-05-07T20:32:02.8888360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.8888798Z def test_silu_mul_quant( 2025-05-07T20:32:02.8889047Z self, 2025-05-07T20:32:02.8889259Z T: int, 2025-05-07T20:32:02.8889459Z D: int, 2025-05-07T20:32:02.8889690Z scale_ub: Optional[float], 2025-05-07T20:32:02.8889977Z contiguous: bool, 2025-05-07T20:32:02.8890222Z compiled: bool, 2025-05-07T20:32:02.8890448Z ) -> None: 2025-05-07T20:32:02.8890726Z torch.manual_seed(2025) 2025-05-07T20:32:02.8890974Z 2025-05-07T20:32:02.8891254Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.8891607Z 2025-05-07T20:32:02.8891813Z x_sign = torch.sign(x) 2025-05-07T20:32:02.8892104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.8892418Z x = x_sign * x_clamp 2025-05-07T20:32:02.8892664Z x0 = x[:, :D] 2025-05-07T20:32:02.8892886Z x1 = x[:, D:] 2025-05-07T20:32:02.8893104Z 2025-05-07T20:32:02.8893307Z if contiguous: 2025-05-07T20:32:02.8893534Z x0 = x0.contiguous() 2025-05-07T20:32:02.8893804Z x1 = x1.contiguous() 2025-05-07T20:32:02.8894056Z 2025-05-07T20:32:02.8894249Z if scale_ub is not None: 2025-05-07T20:32:02.8894529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.8894876Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.8895194Z ) 2025-05-07T20:32:02.8895441Z else: 2025-05-07T20:32:02.8895657Z scale_ub_tensor = None 2025-05-07T20:32:02.8895916Z 2025-05-07T20:32:02.8896146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.8896463Z op = silu_mul_quant 2025-05-07T20:32:02.8896716Z if compiled: 2025-05-07T20:32:02.8896966Z op = torch.compile(op) 2025-05-07T20:32:02.8897267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.8897548Z 2025-05-07T20:32:02.8897745Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.8897917Z 2025-05-07T20:32:02.8898019Z moe/activation_test.py:117: 2025-05-07T20:32:02.8898320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.8898651Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.8898938Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.8899498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.8900066Z return fn(*args, **kwargs) 2025-05-07T20:32:02.8900723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.8901415Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.8901955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.8902715Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.8903382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.8903921Z kernel = self.compile( 2025-05-07T20:32:02.8904464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.8905114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.8905524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.8906059Z 2025-05-07T20:32:02.8906269Z self = 2025-05-07T20:32:02.8907348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.8908716Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a92d9620>} 2025-05-07T20:32:02.8910062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.8911089Z context = 2025-05-07T20:32:02.8911457Z 2025-05-07T20:32:02.8911634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.8912162Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.8912626Z module_map=module_map) 2025-05-07T20:32:02.8912994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.8913352Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.8913613Z E ^ 2025-05-07T20:32:02.8914087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.8914536Z 2025-05-07T20:32:02.8914956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.8915467Z 2025-05-07T20:32:02.9799600Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.9800287Z self=, 2025-05-07T20:32:02.9801122Z T=2048, 2025-05-07T20:32:02.9801372Z D=7168, 2025-05-07T20:32:02.9801567Z scale_ub=1200.0, 2025-05-07T20:32:02.9801796Z contiguous=False, 2025-05-07T20:32:02.9802029Z compiled=False, 2025-05-07T20:32:02.9802231Z ) 2025-05-07T20:32:02.9802555Z self = 2025-05-07T20:32:02.9803067Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.9803358Z 2025-05-07T20:32:02.9803452Z @given( 2025-05-07T20:32:02.9803686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.9804010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.9804325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.9804656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.9804990Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.9805290Z ) 2025-05-07T20:32:02.9806073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.9806529Z def test_silu_mul_quant( 2025-05-07T20:32:02.9806783Z self, 2025-05-07T20:32:02.9806978Z T: int, 2025-05-07T20:32:02.9807192Z D: int, 2025-05-07T20:32:02.9807423Z scale_ub: Optional[float], 2025-05-07T20:32:02.9807784Z contiguous: bool, 2025-05-07T20:32:02.9808023Z compiled: bool, 2025-05-07T20:32:02.9808257Z ) -> None: 2025-05-07T20:32:02.9808674Z torch.manual_seed(2025) 2025-05-07T20:32:02.9808928Z 2025-05-07T20:32:02.9809206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.9809554Z 2025-05-07T20:32:02.9809751Z x_sign = torch.sign(x) 2025-05-07T20:32:02.9810047Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.9810362Z x = x_sign * x_clamp 2025-05-07T20:32:02.9810602Z x0 = x[:, :D] 2025-05-07T20:32:02.9810829Z x1 = x[:, D:] 2025-05-07T20:32:02.9811040Z 2025-05-07T20:32:02.9811229Z if contiguous: 2025-05-07T20:32:02.9811469Z x0 = x0.contiguous() 2025-05-07T20:32:02.9811733Z x1 = x1.contiguous() 2025-05-07T20:32:02.9811973Z 2025-05-07T20:32:02.9812172Z if scale_ub is not None: 2025-05-07T20:32:02.9812450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.9812784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.9813110Z ) 2025-05-07T20:32:02.9813312Z else: 2025-05-07T20:32:02.9813527Z scale_ub_tensor = None 2025-05-07T20:32:02.9813777Z 2025-05-07T20:32:02.9814014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.9814333Z op = silu_mul_quant 2025-05-07T20:32:02.9814582Z if compiled: 2025-05-07T20:32:02.9814842Z op = torch.compile(op) 2025-05-07T20:32:02.9815141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.9815497Z 2025-05-07T20:32:02.9815698Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.9815867Z 2025-05-07T20:32:02.9815972Z moe/activation_test.py:117: 2025-05-07T20:32:02.9816266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.9816603Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.9816888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.9817587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.9818281Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.9818825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.9819512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.9820176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.9820788Z kernel = self.compile( 2025-05-07T20:32:02.9821340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.9822005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.9822401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.9822638Z 2025-05-07T20:32:02.9822851Z self = 2025-05-07T20:32:02.9823949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.9825338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a92da480>} 2025-05-07T20:32:02.9826692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.9827711Z context = 2025-05-07T20:32:02.9828004Z 2025-05-07T20:32:02.9828171Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.9828779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.9829253Z module_map=module_map) 2025-05-07T20:32:02.9829617Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.9829979Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.9830248Z E ^ 2025-05-07T20:32:02.9830713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.9831174Z 2025-05-07T20:32:02.9831589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.9832108Z 2025-05-07T20:32:02.9832218Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.9832637Z self=, 2025-05-07T20:32:02.9833036Z T=1, 2025-05-07T20:32:02.9833235Z D=7168, 2025-05-07T20:32:02.9833437Z scale_ub=None, 2025-05-07T20:32:02.9833665Z contiguous=True, 2025-05-07T20:32:02.9833896Z compiled=False, 2025-05-07T20:32:02.9834107Z ) 2025-05-07T20:32:02.9834424Z self = 2025-05-07T20:32:02.9844657Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:02.9844931Z 2025-05-07T20:32:02.9845018Z @given( 2025-05-07T20:32:02.9845276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.9845689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.9846010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.9846350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.9846691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.9846989Z ) 2025-05-07T20:32:02.9847343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.9847899Z def test_silu_mul_quant( 2025-05-07T20:32:02.9848161Z self, 2025-05-07T20:32:02.9848362Z T: int, 2025-05-07T20:32:02.9848575Z D: int, 2025-05-07T20:32:02.9848809Z scale_ub: Optional[float], 2025-05-07T20:32:02.9849086Z contiguous: bool, 2025-05-07T20:32:02.9849337Z compiled: bool, 2025-05-07T20:32:02.9849575Z ) -> None: 2025-05-07T20:32:02.9849800Z torch.manual_seed(2025) 2025-05-07T20:32:02.9850056Z 2025-05-07T20:32:02.9850342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.9850753Z 2025-05-07T20:32:02.9850979Z x_sign = torch.sign(x) 2025-05-07T20:32:02.9851283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.9851607Z x = x_sign * x_clamp 2025-05-07T20:32:02.9851854Z x0 = x[:, :D] 2025-05-07T20:32:02.9852084Z x1 = x[:, D:] 2025-05-07T20:32:02.9852304Z 2025-05-07T20:32:02.9852496Z if contiguous: 2025-05-07T20:32:02.9852741Z x0 = x0.contiguous() 2025-05-07T20:32:02.9853019Z x1 = x1.contiguous() 2025-05-07T20:32:02.9853264Z 2025-05-07T20:32:02.9853471Z if scale_ub is not None: 2025-05-07T20:32:02.9853761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.9854112Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.9854427Z ) 2025-05-07T20:32:02.9854639Z else: 2025-05-07T20:32:02.9854865Z scale_ub_tensor = None 2025-05-07T20:32:02.9855126Z 2025-05-07T20:32:02.9855379Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.9855709Z op = silu_mul_quant 2025-05-07T20:32:02.9855968Z if compiled: 2025-05-07T20:32:02.9856232Z op = torch.compile(op) 2025-05-07T20:32:02.9856544Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.9856828Z 2025-05-07T20:32:02.9857043Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.9857212Z 2025-05-07T20:32:02.9857327Z moe/activation_test.py:117: 2025-05-07T20:32:02.9857724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.9858068Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.9858364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.9859069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.9859770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.9860327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.9861026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.9861704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.9862245Z kernel = self.compile( 2025-05-07T20:32:02.9862809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.9863484Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.9863885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.9864126Z 2025-05-07T20:32:02.9864338Z self = 2025-05-07T20:32:02.9865428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.9866914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a92d9da0>} 2025-05-07T20:32:02.9868275Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.9869306Z context = 2025-05-07T20:32:02.9869606Z 2025-05-07T20:32:02.9869776Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.9870312Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.9870791Z module_map=module_map) 2025-05-07T20:32:02.9871208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.9871573Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.9871848Z E ^ 2025-05-07T20:32:02.9872318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.9872779Z 2025-05-07T20:32:02.9873197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.9873724Z 2025-05-07T20:32:02.9873833Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.9874261Z self=, 2025-05-07T20:32:02.9874668Z T=16384, 2025-05-07T20:32:02.9874876Z D=7168, 2025-05-07T20:32:02.9875087Z scale_ub=1200.0, 2025-05-07T20:32:02.9875318Z contiguous=False, 2025-05-07T20:32:02.9875555Z compiled=True, 2025-05-07T20:32:03.3497500Z ) 2025-05-07T20:32:03.3498096Z self = 2025-05-07T20:32:03.3498822Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:03.3499213Z 2025-05-07T20:32:03.3499306Z @given( 2025-05-07T20:32:03.3499545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.3499865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.3500178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.3500862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.3501200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.3501494Z ) 2025-05-07T20:32:03.3501843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.3502287Z def test_silu_mul_quant( 2025-05-07T20:32:03.3502536Z self, 2025-05-07T20:32:03.3502731Z T: int, 2025-05-07T20:32:03.3502939Z D: int, 2025-05-07T20:32:03.3503167Z scale_ub: Optional[float], 2025-05-07T20:32:03.3503445Z contiguous: bool, 2025-05-07T20:32:03.3503692Z compiled: bool, 2025-05-07T20:32:03.3503926Z ) -> None: 2025-05-07T20:32:03.3504146Z torch.manual_seed(2025) 2025-05-07T20:32:03.3504397Z 2025-05-07T20:32:03.3504679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.3505029Z 2025-05-07T20:32:03.3505226Z x_sign = torch.sign(x) 2025-05-07T20:32:03.3505532Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.3506131Z x = x_sign * x_clamp 2025-05-07T20:32:03.3506372Z x0 = x[:, :D] 2025-05-07T20:32:03.3506596Z x1 = x[:, D:] 2025-05-07T20:32:03.3506814Z 2025-05-07T20:32:03.3507005Z if contiguous: 2025-05-07T20:32:03.3507246Z x0 = x0.contiguous() 2025-05-07T20:32:03.3507512Z x1 = x1.contiguous() 2025-05-07T20:32:03.3507753Z 2025-05-07T20:32:03.3507953Z if scale_ub is not None: 2025-05-07T20:32:03.3508318Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.3508656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.3508984Z ) 2025-05-07T20:32:03.3509184Z else: 2025-05-07T20:32:03.3509397Z scale_ub_tensor = None 2025-05-07T20:32:03.3509657Z 2025-05-07T20:32:03.3509895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.3510214Z op = silu_mul_quant 2025-05-07T20:32:03.3510472Z if compiled: 2025-05-07T20:32:03.3510725Z op = torch.compile(op) 2025-05-07T20:32:03.3511024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.3511302Z 2025-05-07T20:32:03.3511499Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.3511664Z 2025-05-07T20:32:03.3511772Z moe/activation_test.py:117: 2025-05-07T20:32:03.3512066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.3512405Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.3512780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.3513337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.3513906Z return fn(*args, **kwargs) 2025-05-07T20:32:03.3514571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.3515266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.3515808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.3516493Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.3517161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.3517697Z kernel = self.compile( 2025-05-07T20:32:03.3518238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.3518901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.3519301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.3519529Z 2025-05-07T20:32:03.3519736Z self = 2025-05-07T20:32:03.3520933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.3522329Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9880a40>} 2025-05-07T20:32:03.3523674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.3524704Z context = 2025-05-07T20:32:03.3524990Z 2025-05-07T20:32:03.3525156Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.3525680Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.3526186Z module_map=module_map) 2025-05-07T20:32:03.3526580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.3526939Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.3527204Z E ^ 2025-05-07T20:32:03.3527778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.3528225Z 2025-05-07T20:32:03.3528640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.3529249Z 2025-05-07T20:32:03.3529357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.3529775Z self=, 2025-05-07T20:32:03.3530180Z T=1, 2025-05-07T20:32:03.3530367Z D=7168, 2025-05-07T20:32:03.3530573Z scale_ub=None, 2025-05-07T20:32:03.3530813Z contiguous=False, 2025-05-07T20:32:03.3531045Z compiled=False, 2025-05-07T20:32:03.3531250Z ) 2025-05-07T20:32:03.3531584Z self = 2025-05-07T20:32:03.3532078Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:03.3532342Z 2025-05-07T20:32:03.3532423Z @given( 2025-05-07T20:32:03.3532661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.3532979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.3533281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.3533683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.3534018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.3534310Z ) 2025-05-07T20:32:03.3534657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.3535109Z def test_silu_mul_quant( 2025-05-07T20:32:03.3535360Z self, 2025-05-07T20:32:03.3535560Z T: int, 2025-05-07T20:32:03.3535764Z D: int, 2025-05-07T20:32:03.3535993Z scale_ub: Optional[float], 2025-05-07T20:32:03.3536271Z contiguous: bool, 2025-05-07T20:32:03.3536514Z compiled: bool, 2025-05-07T20:32:03.3536743Z ) -> None: 2025-05-07T20:32:03.3536966Z torch.manual_seed(2025) 2025-05-07T20:32:03.3537208Z 2025-05-07T20:32:03.3537486Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.3537836Z 2025-05-07T20:32:03.3538030Z x_sign = torch.sign(x) 2025-05-07T20:32:03.3538335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.3538831Z x = x_sign * x_clamp 2025-05-07T20:32:03.3539074Z x0 = x[:, :D] 2025-05-07T20:32:03.3539295Z x1 = x[:, D:] 2025-05-07T20:32:03.3539508Z 2025-05-07T20:32:03.3539695Z if contiguous: 2025-05-07T20:32:03.3539929Z x0 = x0.contiguous() 2025-05-07T20:32:03.3540194Z x1 = x1.contiguous() 2025-05-07T20:32:03.3540434Z 2025-05-07T20:32:03.3540633Z if scale_ub is not None: 2025-05-07T20:32:03.3541050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.3541389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.3541703Z ) 2025-05-07T20:32:03.3541901Z else: 2025-05-07T20:32:03.3542118Z scale_ub_tensor = None 2025-05-07T20:32:03.3542364Z 2025-05-07T20:32:03.3542599Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.3542919Z op = silu_mul_quant 2025-05-07T20:32:03.3543175Z if compiled: 2025-05-07T20:32:03.3543434Z op = torch.compile(op) 2025-05-07T20:32:03.3543732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.3544004Z 2025-05-07T20:32:03.3544209Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.3544374Z 2025-05-07T20:32:03.3544483Z moe/activation_test.py:117: 2025-05-07T20:32:03.3544783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.3545131Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.3545417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.3546107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.3546796Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.3547337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.3548094Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.3548750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.3549291Z kernel = self.compile( 2025-05-07T20:32:03.3549834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.3550492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.3550895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.3551130Z 2025-05-07T20:32:03.3551338Z self = 2025-05-07T20:32:03.3552416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.3553833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a98818a0>} 2025-05-07T20:32:03.3555170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.3556255Z context = 2025-05-07T20:32:03.3556549Z 2025-05-07T20:32:03.3556715Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.3557242Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.3557706Z module_map=module_map) 2025-05-07T20:32:03.3558077Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.3558440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.3558712Z E ^ 2025-05-07T20:32:03.3559174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.3559627Z 2025-05-07T20:32:03.3560041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.3560548Z 2025-05-07T20:32:03.3560662Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.3561157Z self=, 2025-05-07T20:32:03.3561560Z T=2048, 2025-05-07T20:32:03.3561759Z D=7168, 2025-05-07T20:32:03.3561956Z scale_ub=None, 2025-05-07T20:32:03.3562172Z contiguous=False, 2025-05-07T20:32:03.3562401Z compiled=True, 2025-05-07T20:32:03.3562612Z ) 2025-05-07T20:32:03.4255563Z self = 2025-05-07T20:32:03.4256292Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.4256693Z 2025-05-07T20:32:03.4256805Z @given( 2025-05-07T20:32:03.4257124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.4257533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.4257846Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.4258176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.4258509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.4258810Z ) 2025-05-07T20:32:03.4259161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.4259613Z def test_silu_mul_quant( 2025-05-07T20:32:03.4259866Z self, 2025-05-07T20:32:03.4260066Z T: int, 2025-05-07T20:32:03.4260269Z D: int, 2025-05-07T20:32:03.4260491Z scale_ub: Optional[float], 2025-05-07T20:32:03.4260770Z contiguous: bool, 2025-05-07T20:32:03.4261010Z compiled: bool, 2025-05-07T20:32:03.4261461Z ) -> None: 2025-05-07T20:32:03.4261681Z torch.manual_seed(2025) 2025-05-07T20:32:03.4261927Z 2025-05-07T20:32:03.4262203Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.4262553Z 2025-05-07T20:32:03.4262750Z x_sign = torch.sign(x) 2025-05-07T20:32:03.4263048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.4263364Z x = x_sign * x_clamp 2025-05-07T20:32:03.4263607Z x0 = x[:, :D] 2025-05-07T20:32:03.4263832Z x1 = x[:, D:] 2025-05-07T20:32:03.4264051Z 2025-05-07T20:32:03.4264241Z if contiguous: 2025-05-07T20:32:03.4264481Z x0 = x0.contiguous() 2025-05-07T20:32:03.4264746Z x1 = x1.contiguous() 2025-05-07T20:32:03.4264987Z 2025-05-07T20:32:03.4265188Z if scale_ub is not None: 2025-05-07T20:32:03.4265467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.4265802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.4266224Z ) 2025-05-07T20:32:03.4266428Z else: 2025-05-07T20:32:03.4266646Z scale_ub_tensor = None 2025-05-07T20:32:03.4266930Z 2025-05-07T20:32:03.4267162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.4267483Z op = silu_mul_quant 2025-05-07T20:32:03.4267740Z if compiled: 2025-05-07T20:32:03.4267998Z op = torch.compile(op) 2025-05-07T20:32:03.4268306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.4268591Z 2025-05-07T20:32:03.4268792Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.4268957Z 2025-05-07T20:32:03.4269059Z moe/activation_test.py:117: 2025-05-07T20:32:03.4269363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.4269701Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.4269982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.4270545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.4271115Z return fn(*args, **kwargs) 2025-05-07T20:32:03.4271779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.4272472Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.4273009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.4273831Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.4274491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.4275030Z kernel = self.compile( 2025-05-07T20:32:03.4275576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.4276242Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.4276639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.4276879Z 2025-05-07T20:32:03.4277086Z self = 2025-05-07T20:32:03.4278172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.4279561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9882b60>} 2025-05-07T20:32:03.4280896Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.4281966Z context = 2025-05-07T20:32:03.4282258Z 2025-05-07T20:32:03.4282425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.4282949Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.4283411Z module_map=module_map) 2025-05-07T20:32:03.4283779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.4284145Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.4284414Z E ^ 2025-05-07T20:32:03.4284879Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.4285332Z 2025-05-07T20:32:03.4285746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.4286256Z 2025-05-07T20:32:03.4286368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.4286840Z self=, 2025-05-07T20:32:03.4287245Z T=4096, 2025-05-07T20:32:03.4287442Z D=7168, 2025-05-07T20:32:03.4287741Z scale_ub=None, 2025-05-07T20:32:03.4287957Z contiguous=False, 2025-05-07T20:32:03.4288189Z compiled=True, 2025-05-07T20:32:03.4288404Z ) 2025-05-07T20:32:03.4288725Z self = 2025-05-07T20:32:03.4289227Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.4289499Z 2025-05-07T20:32:03.4289587Z @given( 2025-05-07T20:32:03.4289819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.4290136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.4290449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.4290779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.4291114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.4291412Z ) 2025-05-07T20:32:03.4291777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.4292219Z def test_silu_mul_quant( 2025-05-07T20:32:03.4292467Z self, 2025-05-07T20:32:03.4292667Z T: int, 2025-05-07T20:32:03.4292864Z D: int, 2025-05-07T20:32:03.4293088Z scale_ub: Optional[float], 2025-05-07T20:32:03.4293364Z contiguous: bool, 2025-05-07T20:32:03.4293691Z compiled: bool, 2025-05-07T20:32:03.4293925Z ) -> None: 2025-05-07T20:32:03.4294148Z torch.manual_seed(2025) 2025-05-07T20:32:03.4294389Z 2025-05-07T20:32:03.4294670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.4295025Z 2025-05-07T20:32:03.4295217Z x_sign = torch.sign(x) 2025-05-07T20:32:03.4295532Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.4295849Z x = x_sign * x_clamp 2025-05-07T20:32:03.4296098Z x0 = x[:, :D] 2025-05-07T20:32:03.4296324Z x1 = x[:, D:] 2025-05-07T20:32:03.4296535Z 2025-05-07T20:32:03.4296731Z if contiguous: 2025-05-07T20:32:03.4296966Z x0 = x0.contiguous() 2025-05-07T20:32:03.4297227Z x1 = x1.contiguous() 2025-05-07T20:32:03.4297478Z 2025-05-07T20:32:03.4297683Z if scale_ub is not None: 2025-05-07T20:32:03.4297959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.4298313Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.4298632Z ) 2025-05-07T20:32:03.4298832Z else: 2025-05-07T20:32:03.4299054Z scale_ub_tensor = None 2025-05-07T20:32:03.4299314Z 2025-05-07T20:32:03.4299545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.4299865Z op = silu_mul_quant 2025-05-07T20:32:03.4300124Z if compiled: 2025-05-07T20:32:03.4310180Z op = torch.compile(op) 2025-05-07T20:32:03.4310640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.4310923Z 2025-05-07T20:32:03.4311140Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.4311313Z 2025-05-07T20:32:03.4311429Z moe/activation_test.py:117: 2025-05-07T20:32:03.4311734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.4312077Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.4312371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.4312959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.4313530Z return fn(*args, **kwargs) 2025-05-07T20:32:03.4314207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.4314911Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.4315454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.4316225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.4316896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.4317442Z kernel = self.compile( 2025-05-07T20:32:03.4317991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.4318668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.4319086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.4319320Z 2025-05-07T20:32:03.4319539Z self = 2025-05-07T20:32:03.4320632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.4322028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9883e20>} 2025-05-07T20:32:03.4323381Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.4324528Z context = 2025-05-07T20:32:03.4324820Z 2025-05-07T20:32:03.4324990Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.4325526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.4326004Z module_map=module_map) 2025-05-07T20:32:03.4326387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.4326755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.4327037Z E ^ 2025-05-07T20:32:03.4327596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.4328049Z 2025-05-07T20:32:03.4328467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.4328984Z 2025-05-07T20:32:03.5586454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.5587439Z self=, 2025-05-07T20:32:03.5588550Z T=16384, 2025-05-07T20:32:03.5589094Z D=5120, 2025-05-07T20:32:03.5589563Z scale_ub=1200.0, 2025-05-07T20:32:03.5590008Z contiguous=False, 2025-05-07T20:32:03.5590460Z compiled=False, 2025-05-07T20:32:03.5590865Z ) 2025-05-07T20:32:03.5591492Z self = 2025-05-07T20:32:03.5592842Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:03.5593396Z 2025-05-07T20:32:03.5593545Z @given( 2025-05-07T20:32:03.5594006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.5594625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.5595230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.5595874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.5596542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.5596957Z ) 2025-05-07T20:32:03.5597338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.5597788Z def test_silu_mul_quant( 2025-05-07T20:32:03.5598041Z self, 2025-05-07T20:32:03.5598236Z T: int, 2025-05-07T20:32:03.5598446Z D: int, 2025-05-07T20:32:03.5598675Z scale_ub: Optional[float], 2025-05-07T20:32:03.5598951Z contiguous: bool, 2025-05-07T20:32:03.5599290Z compiled: bool, 2025-05-07T20:32:03.5599526Z ) -> None: 2025-05-07T20:32:03.5599746Z torch.manual_seed(2025) 2025-05-07T20:32:03.5599999Z 2025-05-07T20:32:03.5600291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.5600643Z 2025-05-07T20:32:03.5600842Z x_sign = torch.sign(x) 2025-05-07T20:32:03.5601144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.5601470Z x = x_sign * x_clamp 2025-05-07T20:32:03.5601716Z x0 = x[:, :D] 2025-05-07T20:32:03.5601949Z x1 = x[:, D:] 2025-05-07T20:32:03.5602175Z 2025-05-07T20:32:03.5602368Z if contiguous: 2025-05-07T20:32:03.5602621Z x0 = x0.contiguous() 2025-05-07T20:32:03.5602892Z x1 = x1.contiguous() 2025-05-07T20:32:03.5603134Z 2025-05-07T20:32:03.5603346Z if scale_ub is not None: 2025-05-07T20:32:03.5603627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.5603974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.5604299Z ) 2025-05-07T20:32:03.5604507Z else: 2025-05-07T20:32:03.5604722Z scale_ub_tensor = None 2025-05-07T20:32:03.5604995Z 2025-05-07T20:32:03.5605243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.5605562Z op = silu_mul_quant 2025-05-07T20:32:03.5606140Z if compiled: 2025-05-07T20:32:03.5606570Z op = torch.compile(op) 2025-05-07T20:32:03.5606877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.5607167Z 2025-05-07T20:32:03.5607374Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.5607637Z 2025-05-07T20:32:03.5607747Z moe/activation_test.py:117: 2025-05-07T20:32:03.5608044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.5608385Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.5608674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.5609366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.5610064Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.5610606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.5611290Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.5611953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.5612490Z kernel = self.compile( 2025-05-07T20:32:03.5613038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.5613692Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.5614119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.5614429Z 2025-05-07T20:32:03.5614639Z self = 2025-05-07T20:32:03.5615728Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.5617125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a941cd60>} 2025-05-07T20:32:03.5618478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.5619505Z context = 2025-05-07T20:32:03.5619806Z 2025-05-07T20:32:03.5620048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.5620584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.5621062Z module_map=module_map) 2025-05-07T20:32:03.5621432Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.5621797Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.5622075Z E ^ 2025-05-07T20:32:03.5622555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.5623016Z 2025-05-07T20:32:03.5623435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.5623955Z 2025-05-07T20:32:03.5624067Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.5624495Z self=, 2025-05-07T20:32:03.5624907Z T=16384, 2025-05-07T20:32:03.5625117Z D=5120, 2025-05-07T20:32:03.5625328Z scale_ub=1200.0, 2025-05-07T20:32:03.5625559Z contiguous=True, 2025-05-07T20:32:03.5625796Z compiled=True, 2025-05-07T20:32:03.5626016Z ) 2025-05-07T20:32:03.5626342Z self = 2025-05-07T20:32:03.5626846Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:03.5627132Z 2025-05-07T20:32:03.5627217Z @given( 2025-05-07T20:32:03.5627546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.5627869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.5628189Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.5628527Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.5628862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.5629160Z ) 2025-05-07T20:32:03.5629519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.5629972Z def test_silu_mul_quant( 2025-05-07T20:32:03.5630227Z self, 2025-05-07T20:32:03.5630437Z T: int, 2025-05-07T20:32:03.5630648Z D: int, 2025-05-07T20:32:03.5630878Z scale_ub: Optional[float], 2025-05-07T20:32:03.5631160Z contiguous: bool, 2025-05-07T20:32:03.5631407Z compiled: bool, 2025-05-07T20:32:03.5631643Z ) -> None: 2025-05-07T20:32:03.5631871Z torch.manual_seed(2025) 2025-05-07T20:32:03.5632131Z 2025-05-07T20:32:03.5632408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.5632761Z 2025-05-07T20:32:03.5632968Z x_sign = torch.sign(x) 2025-05-07T20:32:03.5633266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.5633589Z x = x_sign * x_clamp 2025-05-07T20:32:03.5633842Z x0 = x[:, :D] 2025-05-07T20:32:03.5634066Z x1 = x[:, D:] 2025-05-07T20:32:03.5634339Z 2025-05-07T20:32:03.5634538Z if contiguous: 2025-05-07T20:32:03.5634778Z x0 = x0.contiguous() 2025-05-07T20:32:03.5635051Z x1 = x1.contiguous() 2025-05-07T20:32:03.5635306Z 2025-05-07T20:32:03.5635507Z if scale_ub is not None: 2025-05-07T20:32:03.5635794Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.5636145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.5636460Z ) 2025-05-07T20:32:03.5636673Z else: 2025-05-07T20:32:03.5636933Z scale_ub_tensor = None 2025-05-07T20:32:03.5637223Z 2025-05-07T20:32:03.5637460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.5637788Z op = silu_mul_quant 2025-05-07T20:32:03.5638055Z if compiled: 2025-05-07T20:32:03.5638309Z op = torch.compile(op) 2025-05-07T20:32:03.5638621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.5638913Z 2025-05-07T20:32:03.5639167Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.5639343Z 2025-05-07T20:32:03.5639452Z moe/activation_test.py:117: 2025-05-07T20:32:03.5639763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.5640106Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.5640411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.5640982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.5641558Z return fn(*args, **kwargs) 2025-05-07T20:32:03.5642224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.5642926Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.5643476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.5644176Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.5644849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.5645392Z kernel = self.compile( 2025-05-07T20:32:03.5645946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.5646612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.5647174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.5647416Z 2025-05-07T20:32:03.5647711Z self = 2025-05-07T20:32:03.5648794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.5650166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a941e200>} 2025-05-07T20:32:03.5651518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.5652541Z context = 2025-05-07T20:32:03.5652844Z 2025-05-07T20:32:03.5653020Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.5653552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.5654021Z module_map=module_map) 2025-05-07T20:32:03.5654390Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.5654746Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.5655062Z E ^ 2025-05-07T20:32:03.5655532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.5655988Z 2025-05-07T20:32:03.5656408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.5656922Z 2025-05-07T20:32:03.6971485Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.6972162Z self=, 2025-05-07T20:32:03.6972726Z T=16384, 2025-05-07T20:32:03.6972984Z D=5120, 2025-05-07T20:32:03.6973189Z scale_ub=None, 2025-05-07T20:32:03.6973410Z contiguous=False, 2025-05-07T20:32:03.6973636Z compiled=True, 2025-05-07T20:32:03.6973853Z ) 2025-05-07T20:32:03.6974178Z self = 2025-05-07T20:32:03.6974675Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.6975268Z 2025-05-07T20:32:03.6975352Z @given( 2025-05-07T20:32:03.6975591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.6975902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.6976215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.6976550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.6976879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.6977174Z ) 2025-05-07T20:32:03.6977538Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.6977984Z def test_silu_mul_quant( 2025-05-07T20:32:03.6978229Z self, 2025-05-07T20:32:03.6978431Z T: int, 2025-05-07T20:32:03.6978637Z D: int, 2025-05-07T20:32:03.6978856Z scale_ub: Optional[float], 2025-05-07T20:32:03.6979135Z contiguous: bool, 2025-05-07T20:32:03.6979381Z compiled: bool, 2025-05-07T20:32:03.6979613Z ) -> None: 2025-05-07T20:32:03.6979844Z torch.manual_seed(2025) 2025-05-07T20:32:03.6980098Z 2025-05-07T20:32:03.6980376Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.6980728Z 2025-05-07T20:32:03.6980932Z x_sign = torch.sign(x) 2025-05-07T20:32:03.6981223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.6981542Z x = x_sign * x_clamp 2025-05-07T20:32:03.6981792Z x0 = x[:, :D] 2025-05-07T20:32:03.6982011Z x1 = x[:, D:] 2025-05-07T20:32:03.6982385Z 2025-05-07T20:32:03.6982580Z if contiguous: 2025-05-07T20:32:03.6982825Z x0 = x0.contiguous() 2025-05-07T20:32:03.6983083Z x1 = x1.contiguous() 2025-05-07T20:32:03.6983330Z 2025-05-07T20:32:03.6983530Z if scale_ub is not None: 2025-05-07T20:32:03.6983803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.6984146Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.6984467Z ) 2025-05-07T20:32:03.6984664Z else: 2025-05-07T20:32:03.6984886Z scale_ub_tensor = None 2025-05-07T20:32:03.6985145Z 2025-05-07T20:32:03.6985376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.6985694Z op = silu_mul_quant 2025-05-07T20:32:03.6985954Z if compiled: 2025-05-07T20:32:03.6986201Z op = torch.compile(op) 2025-05-07T20:32:03.6986529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.6986837Z 2025-05-07T20:32:03.6987043Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.6987219Z 2025-05-07T20:32:03.6987323Z moe/activation_test.py:117: 2025-05-07T20:32:03.6987629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.6987970Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.6988249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.6988812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.6989457Z return fn(*args, **kwargs) 2025-05-07T20:32:03.6990116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.6990812Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.6991354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.6992046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.6992706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.6993248Z kernel = self.compile( 2025-05-07T20:32:03.6993794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.6994451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.6994900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.6995136Z 2025-05-07T20:32:03.6995345Z self = 2025-05-07T20:32:03.6996427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.6997831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a941ed40>} 2025-05-07T20:32:03.6999168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.7000195Z context = 2025-05-07T20:32:03.7000493Z 2025-05-07T20:32:03.7000658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.7001186Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.7001650Z module_map=module_map) 2025-05-07T20:32:03.7002020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.7002379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.7002722Z E ^ 2025-05-07T20:32:03.7003195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.7003654Z 2025-05-07T20:32:03.7004069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.7004578Z 2025-05-07T20:32:03.7004693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.7005106Z self=, 2025-05-07T20:32:03.7005523Z T=2048, 2025-05-07T20:32:03.7006044Z D=5120, 2025-05-07T20:32:03.7006239Z scale_ub=None, 2025-05-07T20:32:03.7006462Z contiguous=False, 2025-05-07T20:32:03.7006694Z compiled=True, 2025-05-07T20:32:03.7006905Z ) 2025-05-07T20:32:03.9536922Z self = 2025-05-07T20:32:03.9537740Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.9538075Z 2025-05-07T20:32:03.9538163Z @given( 2025-05-07T20:32:03.9538404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.9538727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.9539034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.9539371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.9539705Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.9540303Z ) 2025-05-07T20:32:03.9540662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.9541109Z def test_silu_mul_quant( 2025-05-07T20:32:03.9541369Z self, 2025-05-07T20:32:03.9541566Z T: int, 2025-05-07T20:32:03.9541774Z D: int, 2025-05-07T20:32:03.9542002Z scale_ub: Optional[float], 2025-05-07T20:32:03.9542274Z contiguous: bool, 2025-05-07T20:32:03.9542521Z compiled: bool, 2025-05-07T20:32:03.9542755Z ) -> None: 2025-05-07T20:32:03.9542978Z torch.manual_seed(2025) 2025-05-07T20:32:03.9543232Z 2025-05-07T20:32:03.9543511Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.9543859Z 2025-05-07T20:32:03.9544070Z x_sign = torch.sign(x) 2025-05-07T20:32:03.9544368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.9544677Z x = x_sign * x_clamp 2025-05-07T20:32:03.9544929Z x0 = x[:, :D] 2025-05-07T20:32:03.9545263Z x1 = x[:, D:] 2025-05-07T20:32:03.9545475Z 2025-05-07T20:32:03.9545669Z if contiguous: 2025-05-07T20:32:03.9545907Z x0 = x0.contiguous() 2025-05-07T20:32:03.9546165Z x1 = x1.contiguous() 2025-05-07T20:32:03.9546413Z 2025-05-07T20:32:03.9546611Z if scale_ub is not None: 2025-05-07T20:32:03.9546887Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.9547222Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.9547544Z ) 2025-05-07T20:32:03.9547743Z else: 2025-05-07T20:32:03.9547953Z scale_ub_tensor = None 2025-05-07T20:32:03.9548210Z 2025-05-07T20:32:03.9548445Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.9548761Z op = silu_mul_quant 2025-05-07T20:32:03.9549017Z if compiled: 2025-05-07T20:32:03.9549273Z op = torch.compile(op) 2025-05-07T20:32:03.9549567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.9549860Z 2025-05-07T20:32:03.9550057Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.9550223Z 2025-05-07T20:32:03.9550326Z moe/activation_test.py:117: 2025-05-07T20:32:03.9550632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.9550974Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.9551265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.9551976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.9552549Z return fn(*args, **kwargs) 2025-05-07T20:32:03.9553219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.9553912Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.9554461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.9555153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.9555819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.9556352Z kernel = self.compile( 2025-05-07T20:32:03.9556907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.9557570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.9557978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.9558216Z 2025-05-07T20:32:03.9558423Z self = 2025-05-07T20:32:03.9559523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.9560975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a93fc7c0>} 2025-05-07T20:32:03.9562326Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.9563359Z context = 2025-05-07T20:32:03.9563656Z 2025-05-07T20:32:03.9563827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.9564357Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.9564830Z module_map=module_map) 2025-05-07T20:32:03.9565198Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.9565634Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.9565903Z E ^ 2025-05-07T20:32:03.9566370Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.9566828Z 2025-05-07T20:32:03.9567251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.9567902Z 2025-05-07T20:32:03.9568011Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.9568439Z self=, 2025-05-07T20:32:03.9568843Z T=2048, 2025-05-07T20:32:03.9569043Z D=5120, 2025-05-07T20:32:03.9569248Z scale_ub=1200.0, 2025-05-07T20:32:03.9581251Z contiguous=False, 2025-05-07T20:32:03.9581521Z compiled=True, 2025-05-07T20:32:03.9581734Z ) 2025-05-07T20:32:03.9582055Z self = 2025-05-07T20:32:03.9582566Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:03.9582847Z 2025-05-07T20:32:03.9582930Z @given( 2025-05-07T20:32:03.9583163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.9583477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.9583783Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.9584111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.9584435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.9584835Z ) 2025-05-07T20:32:03.9585182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.9585612Z def test_silu_mul_quant( 2025-05-07T20:32:03.9585864Z self, 2025-05-07T20:32:03.9586066Z T: int, 2025-05-07T20:32:03.9586263Z D: int, 2025-05-07T20:32:03.9586480Z scale_ub: Optional[float], 2025-05-07T20:32:03.9586760Z contiguous: bool, 2025-05-07T20:32:03.9587004Z compiled: bool, 2025-05-07T20:32:03.9587246Z ) -> None: 2025-05-07T20:32:03.9587479Z torch.manual_seed(2025) 2025-05-07T20:32:03.9587722Z 2025-05-07T20:32:03.9588001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.9588355Z 2025-05-07T20:32:03.9588549Z x_sign = torch.sign(x) 2025-05-07T20:32:03.9588846Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.9589164Z x = x_sign * x_clamp 2025-05-07T20:32:03.9589410Z x0 = x[:, :D] 2025-05-07T20:32:03.9589636Z x1 = x[:, D:] 2025-05-07T20:32:03.9589853Z 2025-05-07T20:32:03.9590047Z if contiguous: 2025-05-07T20:32:03.9590281Z x0 = x0.contiguous() 2025-05-07T20:32:03.9590546Z x1 = x1.contiguous() 2025-05-07T20:32:03.9590795Z 2025-05-07T20:32:03.9590991Z if scale_ub is not None: 2025-05-07T20:32:03.9591268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.9591614Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.9591976Z ) 2025-05-07T20:32:03.9592176Z else: 2025-05-07T20:32:03.9592397Z scale_ub_tensor = None 2025-05-07T20:32:03.9592646Z 2025-05-07T20:32:03.9592882Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.9593200Z op = silu_mul_quant 2025-05-07T20:32:03.9593457Z if compiled: 2025-05-07T20:32:03.9593716Z op = torch.compile(op) 2025-05-07T20:32:03.9594023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.9594304Z 2025-05-07T20:32:03.9594502Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.9594676Z 2025-05-07T20:32:03.9594779Z moe/activation_test.py:117: 2025-05-07T20:32:03.9595085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.9595425Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.9595721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.9596297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.9596912Z return fn(*args, **kwargs) 2025-05-07T20:32:03.9597580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.9598277Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.9598818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.9599506Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.9600175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.9600715Z kernel = self.compile( 2025-05-07T20:32:03.9601265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.9601927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.9602331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.9602562Z 2025-05-07T20:32:03.9602781Z self = 2025-05-07T20:32:03.9603944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.9605322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a93fd300>} 2025-05-07T20:32:03.9606987Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.9608113Z context = 2025-05-07T20:32:03.9608402Z 2025-05-07T20:32:03.9608578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.9609099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.9609581Z module_map=module_map) 2025-05-07T20:32:03.9609952Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.9610317Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.9610578Z E ^ 2025-05-07T20:32:03.9611048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.9611496Z 2025-05-07T20:32:03.9611918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.9612429Z 2025-05-07T20:32:04.0934451Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.0935138Z self=, 2025-05-07T20:32:04.0935702Z T=4096, 2025-05-07T20:32:04.0935966Z D=5120, 2025-05-07T20:32:04.0936230Z scale_ub=1200.0, 2025-05-07T20:32:04.0936499Z contiguous=True, 2025-05-07T20:32:04.0936723Z compiled=True, 2025-05-07T20:32:04.0936930Z ) 2025-05-07T20:32:04.0937259Z self = 2025-05-07T20:32:04.0937775Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:04.0938047Z 2025-05-07T20:32:04.0938139Z @given( 2025-05-07T20:32:04.0938374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.0938700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.0939017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.0939349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.0939980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.0940280Z ) 2025-05-07T20:32:04.0940633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.0941085Z def test_silu_mul_quant( 2025-05-07T20:32:04.0941346Z self, 2025-05-07T20:32:04.0941553Z T: int, 2025-05-07T20:32:04.0941755Z D: int, 2025-05-07T20:32:04.0941984Z scale_ub: Optional[float], 2025-05-07T20:32:04.0942264Z contiguous: bool, 2025-05-07T20:32:04.0942515Z compiled: bool, 2025-05-07T20:32:04.0942744Z ) -> None: 2025-05-07T20:32:04.0942972Z torch.manual_seed(2025) 2025-05-07T20:32:04.0943224Z 2025-05-07T20:32:04.0943500Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.0943853Z 2025-05-07T20:32:04.0944058Z x_sign = torch.sign(x) 2025-05-07T20:32:04.0944350Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.0944665Z x = x_sign * x_clamp 2025-05-07T20:32:04.0944916Z x0 = x[:, :D] 2025-05-07T20:32:04.0945136Z x1 = x[:, D:] 2025-05-07T20:32:04.0945370Z 2025-05-07T20:32:04.0945562Z if contiguous: 2025-05-07T20:32:04.0945803Z x0 = x0.contiguous() 2025-05-07T20:32:04.0946066Z x1 = x1.contiguous() 2025-05-07T20:32:04.0946313Z 2025-05-07T20:32:04.0946515Z if scale_ub is not None: 2025-05-07T20:32:04.0946788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.0947275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.0947599Z ) 2025-05-07T20:32:04.0947798Z else: 2025-05-07T20:32:04.0948019Z scale_ub_tensor = None 2025-05-07T20:32:04.0948282Z 2025-05-07T20:32:04.0948516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.0948842Z op = silu_mul_quant 2025-05-07T20:32:04.0949101Z if compiled: 2025-05-07T20:32:04.0949355Z op = torch.compile(op) 2025-05-07T20:32:04.0949663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0949949Z 2025-05-07T20:32:04.0950147Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.0950322Z 2025-05-07T20:32:04.0950425Z moe/activation_test.py:117: 2025-05-07T20:32:04.0950732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0951076Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.0951364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0951940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.0952509Z return fn(*args, **kwargs) 2025-05-07T20:32:04.0953172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.0953877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.0954426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.0955196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.0955858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.0956399Z kernel = self.compile( 2025-05-07T20:32:04.0956946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.0957611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.0958008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0958244Z 2025-05-07T20:32:04.0958452Z self = 2025-05-07T20:32:04.0959539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.0960993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a93fdbc0>} 2025-05-07T20:32:04.0962346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.0963381Z context = 2025-05-07T20:32:04.0963676Z 2025-05-07T20:32:04.0963844Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.0964374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.0964842Z module_map=module_map) 2025-05-07T20:32:04.0965220Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.0965585Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.0965845Z E ^ 2025-05-07T20:32:04.0966316Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.0966821Z 2025-05-07T20:32:04.0967242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.0967859Z 2025-05-07T20:32:04.0968085Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.0968495Z self=, 2025-05-07T20:32:04.0968901Z T=128, 2025-05-07T20:32:04.0969091Z D=5120, 2025-05-07T20:32:04.0969281Z scale_ub=1200.0, 2025-05-07T20:32:04.0969507Z contiguous=False, 2025-05-07T20:32:04.0969731Z compiled=True, 2025-05-07T20:32:04.0969932Z ) 2025-05-07T20:32:04.1801493Z self = 2025-05-07T20:32:04.1802301Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:04.1802693Z 2025-05-07T20:32:04.1802803Z @given( 2025-05-07T20:32:04.1803129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.1803469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.1803778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.1804114Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.1804465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.1804751Z ) 2025-05-07T20:32:04.1805105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.1805554Z def test_silu_mul_quant( 2025-05-07T20:32:04.1806081Z self, 2025-05-07T20:32:04.1806285Z T: int, 2025-05-07T20:32:04.1806490Z D: int, 2025-05-07T20:32:04.1806718Z scale_ub: Optional[float], 2025-05-07T20:32:04.1807262Z contiguous: bool, 2025-05-07T20:32:04.1807509Z compiled: bool, 2025-05-07T20:32:04.1807827Z ) -> None: 2025-05-07T20:32:04.1808043Z torch.manual_seed(2025) 2025-05-07T20:32:04.1808303Z 2025-05-07T20:32:04.1808580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.1808935Z 2025-05-07T20:32:04.1809132Z x_sign = torch.sign(x) 2025-05-07T20:32:04.1809432Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.1809750Z x = x_sign * x_clamp 2025-05-07T20:32:04.1809993Z x0 = x[:, :D] 2025-05-07T20:32:04.1810216Z x1 = x[:, D:] 2025-05-07T20:32:04.1810432Z 2025-05-07T20:32:04.1810621Z if contiguous: 2025-05-07T20:32:04.1810861Z x0 = x0.contiguous() 2025-05-07T20:32:04.1811128Z x1 = x1.contiguous() 2025-05-07T20:32:04.1811368Z 2025-05-07T20:32:04.1811566Z if scale_ub is not None: 2025-05-07T20:32:04.1811845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.1812278Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.1812595Z ) 2025-05-07T20:32:04.1812793Z else: 2025-05-07T20:32:04.1813003Z scale_ub_tensor = None 2025-05-07T20:32:04.1813265Z 2025-05-07T20:32:04.1813501Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.1813812Z op = silu_mul_quant 2025-05-07T20:32:04.1814068Z if compiled: 2025-05-07T20:32:04.1814324Z op = torch.compile(op) 2025-05-07T20:32:04.1814624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.1814901Z 2025-05-07T20:32:04.1815099Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.1815265Z 2025-05-07T20:32:04.1815372Z moe/activation_test.py:117: 2025-05-07T20:32:04.1815671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.1816014Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.1816334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.1816918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.1817484Z return fn(*args, **kwargs) 2025-05-07T20:32:04.1818150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.1818855Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.1819593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.1820287Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.1820953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.1821493Z kernel = self.compile( 2025-05-07T20:32:04.1822030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.1822702Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.1823102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.1823330Z 2025-05-07T20:32:04.1823539Z self = 2025-05-07T20:32:04.1824626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.1826029Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a93feb60>} 2025-05-07T20:32:04.1827380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.1828457Z context = 2025-05-07T20:32:04.1828746Z 2025-05-07T20:32:04.1828912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.1829440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.1829913Z module_map=module_map) 2025-05-07T20:32:04.1830284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.1830646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.1830914Z E ^ 2025-05-07T20:32:04.1831385Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.1831836Z 2025-05-07T20:32:04.1832255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.1832826Z 2025-05-07T20:32:04.1832933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.1833351Z self=, 2025-05-07T20:32:04.1833759Z T=16384, 2025-05-07T20:32:04.1833955Z D=7168, 2025-05-07T20:32:04.1834156Z scale_ub=1200.0, 2025-05-07T20:32:04.1834386Z contiguous=True, 2025-05-07T20:32:04.1834608Z compiled=True, 2025-05-07T20:32:04.1834833Z ) 2025-05-07T20:32:04.1835167Z self = 2025-05-07T20:32:04.1835663Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:04.1835948Z 2025-05-07T20:32:04.1836034Z @given( 2025-05-07T20:32:04.1836274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.1836586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.1836898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.1837234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.1837569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.1837852Z ) 2025-05-07T20:32:04.1838201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.1838647Z def test_silu_mul_quant( 2025-05-07T20:32:04.1838888Z self, 2025-05-07T20:32:04.1839087Z T: int, 2025-05-07T20:32:04.1839289Z D: int, 2025-05-07T20:32:04.1839504Z scale_ub: Optional[float], 2025-05-07T20:32:04.1839868Z contiguous: bool, 2025-05-07T20:32:04.1840114Z compiled: bool, 2025-05-07T20:32:04.1840336Z ) -> None: 2025-05-07T20:32:04.1840558Z torch.manual_seed(2025) 2025-05-07T20:32:04.1840803Z 2025-05-07T20:32:04.1841079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.1841431Z 2025-05-07T20:32:04.1841638Z x_sign = torch.sign(x) 2025-05-07T20:32:04.1841928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.1842246Z x = x_sign * x_clamp 2025-05-07T20:32:04.1842494Z x0 = x[:, :D] 2025-05-07T20:32:04.1842719Z x1 = x[:, D:] 2025-05-07T20:32:04.1842929Z 2025-05-07T20:32:04.1843122Z if contiguous: 2025-05-07T20:32:04.1843361Z x0 = x0.contiguous() 2025-05-07T20:32:04.1843615Z x1 = x1.contiguous() 2025-05-07T20:32:04.1843864Z 2025-05-07T20:32:04.1844065Z if scale_ub is not None: 2025-05-07T20:32:04.1844344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.1844685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.1845003Z ) 2025-05-07T20:32:04.1845196Z else: 2025-05-07T20:32:04.1845414Z scale_ub_tensor = None 2025-05-07T20:32:04.1845673Z 2025-05-07T20:32:04.1845901Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.1846220Z op = silu_mul_quant 2025-05-07T20:32:04.1846554Z if compiled: 2025-05-07T20:32:04.1846803Z op = torch.compile(op) 2025-05-07T20:32:04.1847104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.1847387Z 2025-05-07T20:32:04.1847649Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.1847814Z 2025-05-07T20:32:04.1847924Z moe/activation_test.py:117: 2025-05-07T20:32:04.1848224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.1848562Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.1848848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.1849410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.1849975Z return fn(*args, **kwargs) 2025-05-07T20:32:04.1850636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.1851324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.1851916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.1852606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.1853265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.1853800Z kernel = self.compile( 2025-05-07T20:32:04.1854349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.1855008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.1855401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.1855636Z 2025-05-07T20:32:04.1855841Z self = 2025-05-07T20:32:04.1856924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.1858304Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8e18720>} 2025-05-07T20:32:04.1859728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.1860751Z context = 2025-05-07T20:32:04.1861044Z 2025-05-07T20:32:04.1861210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.1861729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.1862195Z module_map=module_map) 2025-05-07T20:32:04.1862570Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.1862927Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.1863195Z E ^ 2025-05-07T20:32:04.1863658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.1864118Z 2025-05-07T20:32:04.1864539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.1865049Z 2025-05-07T20:32:04.2825860Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.2826536Z self=, 2025-05-07T20:32:04.2827147Z T=16384, 2025-05-07T20:32:04.2827387Z D=5120, 2025-05-07T20:32:04.2827578Z scale_ub=1200.0, 2025-05-07T20:32:04.2827801Z contiguous=True, 2025-05-07T20:32:04.2828024Z compiled=False, 2025-05-07T20:32:04.2828520Z ) 2025-05-07T20:32:04.2828844Z self = 2025-05-07T20:32:04.2829343Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.2829623Z 2025-05-07T20:32:04.2829700Z @given( 2025-05-07T20:32:04.2829937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.2830252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.2830561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.2830899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.2831225Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.2831519Z ) 2025-05-07T20:32:04.2831861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.2832299Z def test_silu_mul_quant( 2025-05-07T20:32:04.2832545Z self, 2025-05-07T20:32:04.2832739Z T: int, 2025-05-07T20:32:04.2832938Z D: int, 2025-05-07T20:32:04.2833256Z scale_ub: Optional[float], 2025-05-07T20:32:04.2833520Z contiguous: bool, 2025-05-07T20:32:04.2833761Z compiled: bool, 2025-05-07T20:32:04.2833984Z ) -> None: 2025-05-07T20:32:04.2834192Z torch.manual_seed(2025) 2025-05-07T20:32:04.2834435Z 2025-05-07T20:32:04.2834705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.2835050Z 2025-05-07T20:32:04.2835239Z x_sign = torch.sign(x) 2025-05-07T20:32:04.2835529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.2835840Z x = x_sign * x_clamp 2025-05-07T20:32:04.2836074Z x0 = x[:, :D] 2025-05-07T20:32:04.2836288Z x1 = x[:, D:] 2025-05-07T20:32:04.2836496Z 2025-05-07T20:32:04.2836673Z if contiguous: 2025-05-07T20:32:04.2836901Z x0 = x0.contiguous() 2025-05-07T20:32:04.2837160Z x1 = x1.contiguous() 2025-05-07T20:32:04.2837392Z 2025-05-07T20:32:04.2837586Z if scale_ub is not None: 2025-05-07T20:32:04.2837859Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.2838187Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.2838494Z ) 2025-05-07T20:32:04.2838687Z else: 2025-05-07T20:32:04.2838897Z scale_ub_tensor = None 2025-05-07T20:32:04.2839146Z 2025-05-07T20:32:04.2839377Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.2839682Z op = silu_mul_quant 2025-05-07T20:32:04.2840103Z if compiled: 2025-05-07T20:32:04.2849007Z op = torch.compile(op) 2025-05-07T20:32:04.2849370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.2849664Z 2025-05-07T20:32:04.2849867Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.2850047Z 2025-05-07T20:32:04.2850151Z moe/activation_test.py:117: 2025-05-07T20:32:04.2850462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.2850818Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.2851104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.2851821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.2852531Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.2853079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.2853778Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.2854451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.2854998Z kernel = self.compile( 2025-05-07T20:32:04.2855545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.2856210Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.2856708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.2856981Z 2025-05-07T20:32:04.2857199Z self = 2025-05-07T20:32:04.2858285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.2859687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8e19d00>} 2025-05-07T20:32:04.2861037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.2862067Z context = 2025-05-07T20:32:04.2862407Z 2025-05-07T20:32:04.2862576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.2863105Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.2863580Z module_map=module_map) 2025-05-07T20:32:04.2863958Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.2864316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.2864594Z E ^ 2025-05-07T20:32:04.2865074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.2865526Z 2025-05-07T20:32:04.2865945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.2866466Z 2025-05-07T20:32:04.2866576Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.2867006Z self=, 2025-05-07T20:32:04.2867418Z T=1, 2025-05-07T20:32:04.2867611Z D=7168, 2025-05-07T20:32:04.2867825Z scale_ub=1200.0, 2025-05-07T20:32:04.2868063Z contiguous=False, 2025-05-07T20:32:04.2868296Z compiled=False, 2025-05-07T20:32:04.2868515Z ) 2025-05-07T20:32:04.2868848Z self = 2025-05-07T20:32:04.2869428Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:04.2869708Z 2025-05-07T20:32:04.2869792Z @given( 2025-05-07T20:32:04.2870037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.2870355Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.2870675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.2871017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.2871357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.2871653Z ) 2025-05-07T20:32:04.2872019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.2872474Z def test_silu_mul_quant( 2025-05-07T20:32:04.2872721Z self, 2025-05-07T20:32:04.2872932Z T: int, 2025-05-07T20:32:04.2873147Z D: int, 2025-05-07T20:32:04.2873374Z scale_ub: Optional[float], 2025-05-07T20:32:04.2873662Z contiguous: bool, 2025-05-07T20:32:04.2873918Z compiled: bool, 2025-05-07T20:32:04.2874153Z ) -> None: 2025-05-07T20:32:04.2874384Z torch.manual_seed(2025) 2025-05-07T20:32:04.2874639Z 2025-05-07T20:32:04.2874918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.2875280Z 2025-05-07T20:32:04.2875490Z x_sign = torch.sign(x) 2025-05-07T20:32:04.2875796Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.2876116Z x = x_sign * x_clamp 2025-05-07T20:32:04.2876416Z x0 = x[:, :D] 2025-05-07T20:32:04.2876652Z x1 = x[:, D:] 2025-05-07T20:32:04.2876868Z 2025-05-07T20:32:04.2877066Z if contiguous: 2025-05-07T20:32:04.2877313Z x0 = x0.contiguous() 2025-05-07T20:32:04.2877582Z x1 = x1.contiguous() 2025-05-07T20:32:04.2877832Z 2025-05-07T20:32:04.2878040Z if scale_ub is not None: 2025-05-07T20:32:04.2878319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.2878667Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.2878991Z ) 2025-05-07T20:32:04.2879193Z else: 2025-05-07T20:32:04.2879423Z scale_ub_tensor = None 2025-05-07T20:32:04.2879684Z 2025-05-07T20:32:04.2879922Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.2880250Z op = silu_mul_quant 2025-05-07T20:32:04.2880520Z if compiled: 2025-05-07T20:32:04.2880774Z op = torch.compile(op) 2025-05-07T20:32:04.2881086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.2881428Z 2025-05-07T20:32:04.2881635Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.2881805Z 2025-05-07T20:32:04.2881910Z moe/activation_test.py:117: 2025-05-07T20:32:04.2882223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.2882571Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.2882859Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.2883561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.2884267Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.2884820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.2885509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.2886185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.2886740Z kernel = self.compile( 2025-05-07T20:32:04.2887318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.2888063Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.2888470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.2888711Z 2025-05-07T20:32:04.2889001Z self = 2025-05-07T20:32:04.2890087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.2891465Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8e193a0>} 2025-05-07T20:32:04.2892821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.2893850Z context = 2025-05-07T20:32:04.2894147Z 2025-05-07T20:32:04.2894315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.2894849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.2895325Z module_map=module_map) 2025-05-07T20:32:04.2895693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.2896054Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.2896320Z E ^ 2025-05-07T20:32:04.2896789Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.2897293Z 2025-05-07T20:32:04.2897713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.2898234Z 2025-05-07T20:32:04.6050128Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.6050807Z self=, 2025-05-07T20:32:04.6051369Z T=4096, 2025-05-07T20:32:04.6051634Z D=7168, 2025-05-07T20:32:04.6051873Z scale_ub=1200.0, 2025-05-07T20:32:04.6052101Z contiguous=False, 2025-05-07T20:32:04.6052326Z compiled=True, 2025-05-07T20:32:04.6052529Z ) 2025-05-07T20:32:04.6052850Z self = 2025-05-07T20:32:04.6053353Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:04.6053626Z 2025-05-07T20:32:04.6053710Z @given( 2025-05-07T20:32:04.6053939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.6054559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.6054865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.6055188Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.6055517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.6055804Z ) 2025-05-07T20:32:04.6056145Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.6056591Z def test_silu_mul_quant( 2025-05-07T20:32:04.6056833Z self, 2025-05-07T20:32:04.6057026Z T: int, 2025-05-07T20:32:04.6057217Z D: int, 2025-05-07T20:32:04.6057434Z scale_ub: Optional[float], 2025-05-07T20:32:04.6057705Z contiguous: bool, 2025-05-07T20:32:04.6057942Z compiled: bool, 2025-05-07T20:32:04.6058164Z ) -> None: 2025-05-07T20:32:04.6058381Z torch.manual_seed(2025) 2025-05-07T20:32:04.6058621Z 2025-05-07T20:32:04.6058902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.6059243Z 2025-05-07T20:32:04.6059431Z x_sign = torch.sign(x) 2025-05-07T20:32:04.6059728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.6060034Z x = x_sign * x_clamp 2025-05-07T20:32:04.6060271Z x0 = x[:, :D] 2025-05-07T20:32:04.6060482Z x1 = x[:, D:] 2025-05-07T20:32:04.6060686Z 2025-05-07T20:32:04.6060869Z if contiguous: 2025-05-07T20:32:04.6061251Z x0 = x0.contiguous() 2025-05-07T20:32:04.6061514Z x1 = x1.contiguous() 2025-05-07T20:32:04.6061747Z 2025-05-07T20:32:04.6061938Z if scale_ub is not None: 2025-05-07T20:32:04.6062212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.6062550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.6062855Z ) 2025-05-07T20:32:04.6063061Z else: 2025-05-07T20:32:04.6063274Z scale_ub_tensor = None 2025-05-07T20:32:04.6063528Z 2025-05-07T20:32:04.6063761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6064073Z op = silu_mul_quant 2025-05-07T20:32:04.6064315Z if compiled: 2025-05-07T20:32:04.6064566Z op = torch.compile(op) 2025-05-07T20:32:04.6064861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6065128Z 2025-05-07T20:32:04.6065321Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.6065483Z 2025-05-07T20:32:04.6065592Z moe/activation_test.py:117: 2025-05-07T20:32:04.6065882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6066221Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.6066509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6067119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.6067677Z return fn(*args, **kwargs) 2025-05-07T20:32:04.6068419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.6069115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.6069646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.6070349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.6071021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.6071553Z kernel = self.compile( 2025-05-07T20:32:04.6072098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.6072758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6073162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6073441Z 2025-05-07T20:32:04.6073646Z self = 2025-05-07T20:32:04.6074730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.6076142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8d982c0>} 2025-05-07T20:32:04.6077500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.6078532Z context = 2025-05-07T20:32:04.6078830Z 2025-05-07T20:32:04.6078999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.6079540Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6080013Z module_map=module_map) 2025-05-07T20:32:04.6080382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6080743Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.6081011Z E ^ 2025-05-07T20:32:04.6081585Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6082043Z 2025-05-07T20:32:04.6082468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.6082985Z 2025-05-07T20:32:04.6083091Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.6083511Z self=, 2025-05-07T20:32:04.6083917Z T=128, 2025-05-07T20:32:04.6084117Z D=7168, 2025-05-07T20:32:04.6084314Z scale_ub=1200.0, 2025-05-07T20:32:04.6084541Z contiguous=False, 2025-05-07T20:32:04.6084775Z compiled=True, 2025-05-07T20:32:04.6084989Z ) 2025-05-07T20:32:04.6804713Z self = 2025-05-07T20:32:04.6805454Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:04.6806058Z 2025-05-07T20:32:04.6806169Z @given( 2025-05-07T20:32:04.6806518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.6806950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.6807264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.6807696Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.6808025Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.6808316Z ) 2025-05-07T20:32:04.6808676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.6809393Z def test_silu_mul_quant( 2025-05-07T20:32:04.6809646Z self, 2025-05-07T20:32:04.6809846Z T: int, 2025-05-07T20:32:04.6810059Z D: int, 2025-05-07T20:32:04.6810290Z scale_ub: Optional[float], 2025-05-07T20:32:04.6810566Z contiguous: bool, 2025-05-07T20:32:04.6810828Z compiled: bool, 2025-05-07T20:32:04.6811069Z ) -> None: 2025-05-07T20:32:04.6811295Z torch.manual_seed(2025) 2025-05-07T20:32:04.6811550Z 2025-05-07T20:32:04.6811839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.6812189Z 2025-05-07T20:32:04.6812397Z x_sign = torch.sign(x) 2025-05-07T20:32:04.6812701Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.6813020Z x = x_sign * x_clamp 2025-05-07T20:32:04.6813273Z x0 = x[:, :D] 2025-05-07T20:32:04.6813501Z x1 = x[:, D:] 2025-05-07T20:32:04.6813717Z 2025-05-07T20:32:04.6813922Z if contiguous: 2025-05-07T20:32:04.6814257Z x0 = x0.contiguous() 2025-05-07T20:32:04.6814521Z x1 = x1.contiguous() 2025-05-07T20:32:04.6814770Z 2025-05-07T20:32:04.6814980Z if scale_ub is not None: 2025-05-07T20:32:04.6815262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.6815601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.6815920Z ) 2025-05-07T20:32:04.6816132Z else: 2025-05-07T20:32:04.6816353Z scale_ub_tensor = None 2025-05-07T20:32:04.6816617Z 2025-05-07T20:32:04.6816859Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6817175Z op = silu_mul_quant 2025-05-07T20:32:04.6817435Z if compiled: 2025-05-07T20:32:04.6817692Z op = torch.compile(op) 2025-05-07T20:32:04.6817993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6818286Z 2025-05-07T20:32:04.6818492Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.6818667Z 2025-05-07T20:32:04.6818770Z moe/activation_test.py:117: 2025-05-07T20:32:04.6819077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6819419Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.6819709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6820267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.6820835Z return fn(*args, **kwargs) 2025-05-07T20:32:04.6821647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.6822342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.6822884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.6823570Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.6824247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.6824789Z kernel = self.compile( 2025-05-07T20:32:04.6825340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.6826004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6826409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6826648Z 2025-05-07T20:32:04.6826862Z self = 2025-05-07T20:32:04.6827995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.6829393Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8d98e00>} 2025-05-07T20:32:04.6830791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.6831815Z context = 2025-05-07T20:32:04.6832113Z 2025-05-07T20:32:04.6832288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.6832821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6833298Z module_map=module_map) 2025-05-07T20:32:04.6833666Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6834030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.6834301Z E ^ 2025-05-07T20:32:04.6834769Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6835275Z 2025-05-07T20:32:04.6835696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.6836217Z 2025-05-07T20:32:04.6836324Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.6836747Z self=, 2025-05-07T20:32:04.6837150Z T=2048, 2025-05-07T20:32:04.6837356Z D=7168, 2025-05-07T20:32:04.6837561Z scale_ub=None, 2025-05-07T20:32:04.6837781Z contiguous=True, 2025-05-07T20:32:04.6838019Z compiled=True, 2025-05-07T20:32:04.6838236Z ) 2025-05-07T20:32:04.6838556Z self = 2025-05-07T20:32:04.6839052Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:04.6839324Z 2025-05-07T20:32:04.6839413Z @given( 2025-05-07T20:32:04.6839652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.6839963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.6840274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.6840606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.6840931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.6841224Z ) 2025-05-07T20:32:04.6841664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.6842103Z def test_silu_mul_quant( 2025-05-07T20:32:04.6842351Z self, 2025-05-07T20:32:04.6842552Z T: int, 2025-05-07T20:32:04.6842750Z D: int, 2025-05-07T20:32:04.6842974Z scale_ub: Optional[float], 2025-05-07T20:32:04.6843258Z contiguous: bool, 2025-05-07T20:32:04.6843504Z compiled: bool, 2025-05-07T20:32:04.6843726Z ) -> None: 2025-05-07T20:32:04.6843947Z torch.manual_seed(2025) 2025-05-07T20:32:04.6844199Z 2025-05-07T20:32:04.6844468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.6844818Z 2025-05-07T20:32:04.6845022Z x_sign = torch.sign(x) 2025-05-07T20:32:04.6845311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.6845623Z x = x_sign * x_clamp 2025-05-07T20:32:04.6845870Z x0 = x[:, :D] 2025-05-07T20:32:04.6846087Z x1 = x[:, D:] 2025-05-07T20:32:04.6846304Z 2025-05-07T20:32:04.6846502Z if contiguous: 2025-05-07T20:32:04.6846730Z x0 = x0.contiguous() 2025-05-07T20:32:04.6846992Z x1 = x1.contiguous() 2025-05-07T20:32:04.6847239Z 2025-05-07T20:32:04.6847430Z if scale_ub is not None: 2025-05-07T20:32:04.6847801Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.6848146Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.6848467Z ) 2025-05-07T20:32:04.6848716Z else: 2025-05-07T20:32:04.6848946Z scale_ub_tensor = None 2025-05-07T20:32:04.6849209Z 2025-05-07T20:32:04.6849446Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6849770Z op = silu_mul_quant 2025-05-07T20:32:04.6850035Z if compiled: 2025-05-07T20:32:04.6850288Z op = torch.compile(op) 2025-05-07T20:32:04.6850593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6850881Z 2025-05-07T20:32:04.6851085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.6851258Z 2025-05-07T20:32:04.6851360Z moe/activation_test.py:117: 2025-05-07T20:32:04.6851667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6852011Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.6852294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6852858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.6853478Z return fn(*args, **kwargs) 2025-05-07T20:32:04.6854137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.6854837Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.6855379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.6856066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.6856788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.6857327Z kernel = self.compile( 2025-05-07T20:32:04.6857874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.6858537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6858937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6859180Z 2025-05-07T20:32:04.6859390Z self = 2025-05-07T20:32:04.6860476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.6861937Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8d99a80>} 2025-05-07T20:32:04.6863285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.6864319Z context = 2025-05-07T20:32:04.6864621Z 2025-05-07T20:32:04.6864789Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.6865318Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6865785Z module_map=module_map) 2025-05-07T20:32:04.6866158Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6866521Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.6866783Z E ^ 2025-05-07T20:32:04.6867259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6867717Z 2025-05-07T20:32:04.6868136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.6868651Z 2025-05-07T20:32:04.7470538Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7471200Z self=, 2025-05-07T20:32:04.7471857Z T=16384, 2025-05-07T20:32:04.7472082Z D=5120, 2025-05-07T20:32:04.7472283Z scale_ub=None, 2025-05-07T20:32:04.7472506Z contiguous=False, 2025-05-07T20:32:04.7481093Z compiled=False, 2025-05-07T20:32:04.7481362Z ) 2025-05-07T20:32:04.7481703Z self = 2025-05-07T20:32:04.7482230Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:04.7482517Z 2025-05-07T20:32:04.7482612Z @given( 2025-05-07T20:32:04.7482865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7483193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7483512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7483860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7484202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7484496Z ) 2025-05-07T20:32:04.7484863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7485485Z def test_silu_mul_quant( 2025-05-07T20:32:04.7485734Z self, 2025-05-07T20:32:04.7485945Z T: int, 2025-05-07T20:32:04.7486157Z D: int, 2025-05-07T20:32:04.7486407Z scale_ub: Optional[float], 2025-05-07T20:32:04.7486718Z contiguous: bool, 2025-05-07T20:32:04.7486974Z compiled: bool, 2025-05-07T20:32:04.7487219Z ) -> None: 2025-05-07T20:32:04.7487448Z torch.manual_seed(2025) 2025-05-07T20:32:04.7487809Z 2025-05-07T20:32:04.7488102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7488452Z 2025-05-07T20:32:04.7488659Z x_sign = torch.sign(x) 2025-05-07T20:32:04.7488970Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.7490998Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7492902Z 2025-05-07T20:32:04.7493176Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:04.7493396Z 2025-05-07T20:32:04.7493505Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7493932Z self=, 2025-05-07T20:32:04.7494345Z T=4096, 2025-05-07T20:32:04.7494548Z D=7168, 2025-05-07T20:32:04.7494746Z scale_ub=1200.0, 2025-05-07T20:32:04.7494986Z contiguous=True, 2025-05-07T20:32:04.7495221Z compiled=True, 2025-05-07T20:32:04.7495434Z ) 2025-05-07T20:32:04.7495770Z self = 2025-05-07T20:32:04.7496278Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:04.7496555Z 2025-05-07T20:32:04.7496641Z @given( 2025-05-07T20:32:04.7496892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7497219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7497534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7497881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7498224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7498528Z ) 2025-05-07T20:32:04.7498883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7499339Z def test_silu_mul_quant( 2025-05-07T20:32:04.7499594Z self, 2025-05-07T20:32:04.7499796Z T: int, 2025-05-07T20:32:04.7500010Z D: int, 2025-05-07T20:32:04.7500297Z scale_ub: Optional[float], 2025-05-07T20:32:04.7500577Z contiguous: bool, 2025-05-07T20:32:04.7500833Z compiled: bool, 2025-05-07T20:32:04.7501071Z ) -> None: 2025-05-07T20:32:04.7501293Z torch.manual_seed(2025) 2025-05-07T20:32:04.7501549Z 2025-05-07T20:32:04.7501836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7502182Z 2025-05-07T20:32:04.7502391Z x_sign = torch.sign(x) 2025-05-07T20:32:04.7502700Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.7504718Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7506951Z 2025-05-07T20:32:04.7507083Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:04.7507301Z 2025-05-07T20:32:04.7507410Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7507836Z self=, 2025-05-07T20:32:04.7508252Z T=16384, 2025-05-07T20:32:04.7508457Z D=7168, 2025-05-07T20:32:04.7508663Z scale_ub=None, 2025-05-07T20:32:04.7508891Z contiguous=False, 2025-05-07T20:32:04.7509127Z compiled=False, 2025-05-07T20:32:04.7509348Z ) 2025-05-07T20:32:04.7509676Z self = 2025-05-07T20:32:04.7510190Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:04.7510472Z 2025-05-07T20:32:04.7510555Z @given( 2025-05-07T20:32:04.7510803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7511132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7511442Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7511779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7512116Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7512405Z ) 2025-05-07T20:32:04.7512764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7513336Z def test_silu_mul_quant( 2025-05-07T20:32:04.7513598Z self, 2025-05-07T20:32:04.7513801Z T: int, 2025-05-07T20:32:04.7514017Z D: int, 2025-05-07T20:32:04.7514249Z scale_ub: Optional[float], 2025-05-07T20:32:04.7514526Z contiguous: bool, 2025-05-07T20:32:04.7514783Z compiled: bool, 2025-05-07T20:32:04.7515022Z ) -> None: 2025-05-07T20:32:04.7515245Z torch.manual_seed(2025) 2025-05-07T20:32:04.7515503Z 2025-05-07T20:32:04.7515791Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7517906Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7519784Z 2025-05-07T20:32:04.7519909Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.7520134Z 2025-05-07T20:32:04.7520241Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7520661Z self=, 2025-05-07T20:32:04.7521151Z T=2048, 2025-05-07T20:32:04.7521344Z D=7168, 2025-05-07T20:32:04.7521549Z scale_ub=1200.0, 2025-05-07T20:32:04.7521780Z contiguous=True, 2025-05-07T20:32:04.7522005Z compiled=True, 2025-05-07T20:32:04.7522220Z ) 2025-05-07T20:32:04.7522546Z self = 2025-05-07T20:32:04.7523040Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:04.7523318Z 2025-05-07T20:32:04.7523399Z @given( 2025-05-07T20:32:04.7523640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7523954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7524255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7524575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7524903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7525188Z ) 2025-05-07T20:32:04.7525540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7526057Z def test_silu_mul_quant( 2025-05-07T20:32:04.7526296Z self, 2025-05-07T20:32:04.7526496Z T: int, 2025-05-07T20:32:04.7526697Z D: int, 2025-05-07T20:32:04.7526914Z scale_ub: Optional[float], 2025-05-07T20:32:04.7527190Z contiguous: bool, 2025-05-07T20:32:04.7527434Z compiled: bool, 2025-05-07T20:32:04.7527730Z ) -> None: 2025-05-07T20:32:04.7527949Z torch.manual_seed(2025) 2025-05-07T20:32:04.7528195Z 2025-05-07T20:32:04.7528472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7528811Z 2025-05-07T20:32:04.7529011Z x_sign = torch.sign(x) 2025-05-07T20:32:04.7529308Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.7531291Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7533143Z 2025-05-07T20:32:04.7533261Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:04.7533569Z 2025-05-07T20:32:04.7533676Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7534092Z self=, 2025-05-07T20:32:04.7534499Z T=2048, 2025-05-07T20:32:04.7534689Z D=7168, 2025-05-07T20:32:04.7534888Z scale_ub=None, 2025-05-07T20:32:04.7535104Z contiguous=True, 2025-05-07T20:32:04.7535332Z compiled=False, 2025-05-07T20:32:04.7535543Z ) 2025-05-07T20:32:04.8401625Z self = 2025-05-07T20:32:04.8402889Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.8403641Z 2025-05-07T20:32:04.8403855Z @given( 2025-05-07T20:32:04.8404480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.8405272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.8406170Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.8406644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.8406982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.8407267Z ) 2025-05-07T20:32:04.8407675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.8408124Z def test_silu_mul_quant( 2025-05-07T20:32:04.8408374Z self, 2025-05-07T20:32:04.8408568Z T: int, 2025-05-07T20:32:04.8408773Z D: int, 2025-05-07T20:32:04.8408995Z scale_ub: Optional[float], 2025-05-07T20:32:04.8409510Z contiguous: bool, 2025-05-07T20:32:04.8409752Z compiled: bool, 2025-05-07T20:32:04.8409982Z ) -> None: 2025-05-07T20:32:04.8410200Z torch.manual_seed(2025) 2025-05-07T20:32:04.8410449Z 2025-05-07T20:32:04.8410725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.8411067Z 2025-05-07T20:32:04.8411271Z > x_sign = torch.sign(x) 2025-05-07T20:32:04.8413216Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.8415166Z 2025-05-07T20:32:04.8415292Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:04.8415504Z 2025-05-07T20:32:04.8415616Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.8416025Z self=, 2025-05-07T20:32:04.8416433Z T=1, 2025-05-07T20:32:04.8416629Z D=7168, 2025-05-07T20:32:04.8416823Z scale_ub=1200.0, 2025-05-07T20:32:04.8417055Z contiguous=True, 2025-05-07T20:32:04.8417288Z compiled=False, 2025-05-07T20:32:04.8417497Z ) 2025-05-07T20:32:04.8417820Z self = 2025-05-07T20:32:04.8418309Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.8418572Z 2025-05-07T20:32:04.8418654Z @given( 2025-05-07T20:32:04.8418894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.8419208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.8419529Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.8419862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.8420194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.8420485Z ) 2025-05-07T20:32:04.8420833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.8421283Z def test_silu_mul_quant( 2025-05-07T20:32:04.8421531Z self, 2025-05-07T20:32:04.8421746Z T: int, 2025-05-07T20:32:04.8422119Z D: int, 2025-05-07T20:32:04.8422348Z scale_ub: Optional[float], 2025-05-07T20:32:04.8422630Z contiguous: bool, 2025-05-07T20:32:04.8422871Z compiled: bool, 2025-05-07T20:32:04.8423101Z ) -> None: 2025-05-07T20:32:04.8423330Z torch.manual_seed(2025) 2025-05-07T20:32:04.8423572Z 2025-05-07T20:32:04.8423851Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.8424211Z 2025-05-07T20:32:04.8424412Z x_sign = torch.sign(x) 2025-05-07T20:32:04.8424719Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.8425054Z x = x_sign * x_clamp 2025-05-07T20:32:04.8425300Z x0 = x[:, :D] 2025-05-07T20:32:04.8425515Z x1 = x[:, D:] 2025-05-07T20:32:04.8425734Z 2025-05-07T20:32:04.8425932Z if contiguous: 2025-05-07T20:32:04.8426168Z x0 = x0.contiguous() 2025-05-07T20:32:04.8426436Z x1 = x1.contiguous() 2025-05-07T20:32:04.8426729Z 2025-05-07T20:32:04.8426929Z if scale_ub is not None: 2025-05-07T20:32:04.8427209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.8427550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.8427858Z ) 2025-05-07T20:32:04.8428063Z else: 2025-05-07T20:32:04.8428280Z scale_ub_tensor = None 2025-05-07T20:32:04.8428532Z 2025-05-07T20:32:04.8428769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.8429143Z op = silu_mul_quant 2025-05-07T20:32:04.8429398Z if compiled: 2025-05-07T20:32:04.8429648Z op = torch.compile(op) 2025-05-07T20:32:04.8429948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.8430228Z 2025-05-07T20:32:04.8430422Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.8430597Z 2025-05-07T20:32:04.8430699Z moe/activation_test.py:117: 2025-05-07T20:32:04.8431010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.8431344Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.8431631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.8432338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.8433040Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.8433578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.8434326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.8435000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.8435541Z kernel = self.compile( 2025-05-07T20:32:04.8436089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.8436760Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.8437215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.8437447Z 2025-05-07T20:32:04.8437653Z self = 2025-05-07T20:32:04.8438744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.8440142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8a01440>} 2025-05-07T20:32:04.8441578Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.8442620Z context = 2025-05-07T20:32:04.8442909Z 2025-05-07T20:32:04.8443077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.8443607Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.8444084Z module_map=module_map) 2025-05-07T20:32:04.8444450Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.8444820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.8445090Z E ^ 2025-05-07T20:32:04.8445567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.8446016Z 2025-05-07T20:32:04.8446429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.8446947Z 2025-05-07T20:32:04.8447061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.8447479Z self=, 2025-05-07T20:32:04.8447984Z T=128, 2025-05-07T20:32:04.8448188Z D=5120, 2025-05-07T20:32:04.8448388Z scale_ub=None, 2025-05-07T20:32:04.8448609Z contiguous=True, 2025-05-07T20:32:04.8448831Z compiled=False, 2025-05-07T20:32:04.8449042Z ) 2025-05-07T20:32:04.9004943Z self = 2025-05-07T20:32:04.9007051Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.9007490Z 2025-05-07T20:32:04.9007642Z @given( 2025-05-07T20:32:04.9007882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.9008194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.9008508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.9008844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.9009217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.9009510Z ) 2025-05-07T20:32:04.9009855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.9010302Z def test_silu_mul_quant( 2025-05-07T20:32:04.9010549Z self, 2025-05-07T20:32:04.9010749Z T: int, 2025-05-07T20:32:04.9010947Z D: int, 2025-05-07T20:32:04.9011173Z scale_ub: Optional[float], 2025-05-07T20:32:04.9011616Z contiguous: bool, 2025-05-07T20:32:04.9011854Z compiled: bool, 2025-05-07T20:32:04.9012087Z ) -> None: 2025-05-07T20:32:04.9012307Z torch.manual_seed(2025) 2025-05-07T20:32:04.9012546Z 2025-05-07T20:32:04.9012824Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.9013178Z 2025-05-07T20:32:04.9013373Z x_sign = torch.sign(x) 2025-05-07T20:32:04.9013672Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.9013987Z x = x_sign * x_clamp 2025-05-07T20:32:04.9014224Z x0 = x[:, :D] 2025-05-07T20:32:04.9014452Z x1 = x[:, D:] 2025-05-07T20:32:04.9014668Z 2025-05-07T20:32:04.9014860Z if contiguous: 2025-05-07T20:32:04.9015104Z x0 = x0.contiguous() 2025-05-07T20:32:04.9015373Z x1 = x1.contiguous() 2025-05-07T20:32:04.9015622Z 2025-05-07T20:32:04.9015816Z if scale_ub is not None: 2025-05-07T20:32:04.9016096Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.9016443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.9016753Z ) 2025-05-07T20:32:04.9016957Z else: 2025-05-07T20:32:04.9017174Z scale_ub_tensor = None 2025-05-07T20:32:04.9017425Z 2025-05-07T20:32:04.9017665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.9017998Z op = silu_mul_quant 2025-05-07T20:32:04.9018249Z if compiled: 2025-05-07T20:32:04.9019388Z op = torch.compile(op) 2025-05-07T20:32:04.9019838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9020242Z 2025-05-07T20:32:04.9020515Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.9020742Z 2025-05-07T20:32:04.9020891Z moe/activation_test.py:117: 2025-05-07T20:32:04.9021315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9021749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.9022132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9023214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.9024315Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.9025193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.9026307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.9027393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.9028137Z kernel = self.compile( 2025-05-07T20:32:04.9028792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.9029462Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.9029861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9030224Z 2025-05-07T20:32:04.9030430Z self = 2025-05-07T20:32:04.9031520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.9032913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8a02660>} 2025-05-07T20:32:04.9034265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.9035285Z context = 2025-05-07T20:32:04.9035582Z 2025-05-07T20:32:04.9035798Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.9036325Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.9036800Z module_map=module_map) 2025-05-07T20:32:04.9037162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.9037524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.9037789Z E ^ 2025-05-07T20:32:04.9038255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.9038711Z 2025-05-07T20:32:04.9039127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.9039644Z 2025-05-07T20:32:04.9039750Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.9040166Z self=, 2025-05-07T20:32:04.9040575Z T=128, 2025-05-07T20:32:04.9040766Z D=7168, 2025-05-07T20:32:04.9040972Z scale_ub=None, 2025-05-07T20:32:04.9041186Z contiguous=True, 2025-05-07T20:32:04.9041421Z compiled=False, 2025-05-07T20:32:04.9041638Z ) 2025-05-07T20:32:04.9041959Z self = 2025-05-07T20:32:04.9042451Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.9042729Z 2025-05-07T20:32:04.9042809Z @given( 2025-05-07T20:32:04.9043127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.9043440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.9043749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.9044079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.9044404Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.9044698Z ) 2025-05-07T20:32:04.9045049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.9045493Z def test_silu_mul_quant( 2025-05-07T20:32:04.9045741Z self, 2025-05-07T20:32:04.9045939Z T: int, 2025-05-07T20:32:04.9046134Z D: int, 2025-05-07T20:32:04.9046357Z scale_ub: Optional[float], 2025-05-07T20:32:04.9046630Z contiguous: bool, 2025-05-07T20:32:04.9046867Z compiled: bool, 2025-05-07T20:32:04.9047096Z ) -> None: 2025-05-07T20:32:04.9047318Z torch.manual_seed(2025) 2025-05-07T20:32:04.9047672Z 2025-05-07T20:32:04.9047938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.9048284Z 2025-05-07T20:32:04.9048484Z x_sign = torch.sign(x) 2025-05-07T20:32:04.9048771Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.9049085Z x = x_sign * x_clamp 2025-05-07T20:32:04.9049336Z x0 = x[:, :D] 2025-05-07T20:32:04.9049553Z x1 = x[:, D:] 2025-05-07T20:32:04.9049884Z 2025-05-07T20:32:04.9050081Z if contiguous: 2025-05-07T20:32:04.9050312Z x0 = x0.contiguous() 2025-05-07T20:32:04.9050578Z x1 = x1.contiguous() 2025-05-07T20:32:04.9050828Z 2025-05-07T20:32:04.9051020Z if scale_ub is not None: 2025-05-07T20:32:04.9051297Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.9059441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.9059769Z ) 2025-05-07T20:32:04.9059978Z else: 2025-05-07T20:32:04.9060214Z scale_ub_tensor = None 2025-05-07T20:32:04.9060472Z 2025-05-07T20:32:04.9060725Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.9061059Z op = silu_mul_quant 2025-05-07T20:32:04.9061318Z if compiled: 2025-05-07T20:32:04.9061581Z op = torch.compile(op) 2025-05-07T20:32:04.9061890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9062173Z 2025-05-07T20:32:04.9062464Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.9062642Z 2025-05-07T20:32:04.9062749Z moe/activation_test.py:117: 2025-05-07T20:32:04.9063063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9063405Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.9063700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9064407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.9065114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.9065657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.9066354Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.9067030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.9067569Z kernel = self.compile( 2025-05-07T20:32:04.9068130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.9068800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.9069203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9069445Z 2025-05-07T20:32:04.9069658Z self = 2025-05-07T20:32:04.9070826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.9072215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8a036a0>} 2025-05-07T20:32:04.9073568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.9074601Z context = 2025-05-07T20:32:04.9074897Z 2025-05-07T20:32:04.9075067Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.9075603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.9076082Z module_map=module_map) 2025-05-07T20:32:04.9076456Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.9076821Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.9077092Z E ^ 2025-05-07T20:32:04.9077564Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.9078067Z 2025-05-07T20:32:04.9078491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.9079013Z 2025-05-07T20:32:04.9079122Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.9079548Z self=, 2025-05-07T20:32:04.9079955Z T=2048, 2025-05-07T20:32:04.9080157Z D=7168, 2025-05-07T20:32:04.9080365Z scale_ub=1200.0, 2025-05-07T20:32:04.9080594Z contiguous=True, 2025-05-07T20:32:04.9080836Z compiled=False, 2025-05-07T20:32:04.9081057Z ) 2025-05-07T20:32:04.9750334Z self = 2025-05-07T20:32:04.9751112Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.9751487Z 2025-05-07T20:32:04.9751599Z @given( 2025-05-07T20:32:04.9751875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.9752206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.9752817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.9753152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.9753483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.9753781Z ) 2025-05-07T20:32:04.9754131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.9754578Z def test_silu_mul_quant( 2025-05-07T20:32:04.9754827Z self, 2025-05-07T20:32:04.9755034Z T: int, 2025-05-07T20:32:04.9755244Z D: int, 2025-05-07T20:32:04.9755472Z scale_ub: Optional[float], 2025-05-07T20:32:04.9755744Z contiguous: bool, 2025-05-07T20:32:04.9755993Z compiled: bool, 2025-05-07T20:32:04.9756233Z ) -> None: 2025-05-07T20:32:04.9756450Z torch.manual_seed(2025) 2025-05-07T20:32:04.9756728Z 2025-05-07T20:32:04.9757034Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.9759385Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.9761268Z 2025-05-07T20:32:04.9761399Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.9761611Z 2025-05-07T20:32:04.9761721Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.9762134Z self=, 2025-05-07T20:32:04.9762544Z T=1, 2025-05-07T20:32:04.9762724Z D=5120, 2025-05-07T20:32:04.9762929Z scale_ub=1200.0, 2025-05-07T20:32:04.9763175Z contiguous=True, 2025-05-07T20:32:04.9763398Z compiled=False, 2025-05-07T20:32:04.9763624Z ) 2025-05-07T20:32:04.9763949Z self = 2025-05-07T20:32:04.9764444Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.9764709Z 2025-05-07T20:32:04.9764792Z @given( 2025-05-07T20:32:04.9765029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.9765360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.9765741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.9766080Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.9766413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.9766701Z ) 2025-05-07T20:32:04.9767057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.9767729Z def test_silu_mul_quant( 2025-05-07T20:32:04.9768096Z self, 2025-05-07T20:32:04.9768306Z T: int, 2025-05-07T20:32:04.9768592Z D: int, 2025-05-07T20:32:04.9768884Z scale_ub: Optional[float], 2025-05-07T20:32:04.9769254Z contiguous: bool, 2025-05-07T20:32:04.9769594Z compiled: bool, 2025-05-07T20:32:04.9769817Z ) -> None: 2025-05-07T20:32:04.9770032Z torch.manual_seed(2025) 2025-05-07T20:32:04.9770271Z 2025-05-07T20:32:04.9770542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.9770888Z 2025-05-07T20:32:04.9771080Z x_sign = torch.sign(x) 2025-05-07T20:32:04.9771371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.9771681Z x = x_sign * x_clamp 2025-05-07T20:32:04.9771914Z x0 = x[:, :D] 2025-05-07T20:32:04.9772132Z x1 = x[:, D:] 2025-05-07T20:32:04.9772341Z 2025-05-07T20:32:04.9772523Z if contiguous: 2025-05-07T20:32:04.9772756Z x0 = x0.contiguous() 2025-05-07T20:32:04.9773096Z x1 = x1.contiguous() 2025-05-07T20:32:04.9773339Z 2025-05-07T20:32:04.9773539Z if scale_ub is not None: 2025-05-07T20:32:04.9773820Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.9774155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.9774473Z ) 2025-05-07T20:32:04.9774675Z else: 2025-05-07T20:32:04.9774886Z scale_ub_tensor = None 2025-05-07T20:32:04.9775146Z 2025-05-07T20:32:04.9775390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.9775713Z op = silu_mul_quant 2025-05-07T20:32:04.9775965Z if compiled: 2025-05-07T20:32:04.9776226Z op = torch.compile(op) 2025-05-07T20:32:04.9776531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9776808Z 2025-05-07T20:32:04.9777012Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.9777179Z 2025-05-07T20:32:04.9777292Z moe/activation_test.py:117: 2025-05-07T20:32:04.9777594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9777936Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.9778226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9778922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.9779617Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.9780245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.9780939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.9781598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.9782142Z kernel = self.compile( 2025-05-07T20:32:04.9782689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.9783357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.9783757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9783999Z 2025-05-07T20:32:04.9784210Z self = 2025-05-07T20:32:04.9785301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.9786679Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8c90b80>} 2025-05-07T20:32:04.9788069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.9789142Z context = 2025-05-07T20:32:04.9789436Z 2025-05-07T20:32:04.9789604Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.9790133Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.9790599Z module_map=module_map) 2025-05-07T20:32:04.9790975Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.9791338Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.9791603Z E ^ 2025-05-07T20:32:04.9792064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.9792520Z 2025-05-07T20:32:04.9792935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.9793493Z 2025-05-07T20:32:04.9793610Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.9794023Z self=, 2025-05-07T20:32:04.9794432Z T=2048, 2025-05-07T20:32:04.9794627Z D=5120, 2025-05-07T20:32:04.9794826Z scale_ub=None, 2025-05-07T20:32:04.9795041Z contiguous=True, 2025-05-07T20:32:04.9795270Z compiled=False, 2025-05-07T20:32:04.9795483Z ) 2025-05-07T20:32:04.9795806Z self = 2025-05-07T20:32:04.9796307Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.9796579Z 2025-05-07T20:32:04.9796667Z @given( 2025-05-07T20:32:04.9796899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.9797224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.9797539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.9797873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.9798212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.9798510Z ) 2025-05-07T20:32:04.9798866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.9799312Z def test_silu_mul_quant( 2025-05-07T20:32:04.9799563Z self, 2025-05-07T20:32:04.9799765Z T: int, 2025-05-07T20:32:04.9799964Z D: int, 2025-05-07T20:32:04.9800188Z scale_ub: Optional[float], 2025-05-07T20:32:04.9800575Z contiguous: bool, 2025-05-07T20:32:04.9800822Z compiled: bool, 2025-05-07T20:32:04.9801056Z ) -> None: 2025-05-07T20:32:04.9801282Z torch.manual_seed(2025) 2025-05-07T20:32:04.9801529Z 2025-05-07T20:32:04.9801812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.9802165Z 2025-05-07T20:32:04.9802361Z > x_sign = torch.sign(x) 2025-05-07T20:32:04.9804331Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.9806743Z 2025-05-07T20:32:04.9806868Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:04.9807090Z 2025-05-07T20:32:04.9807196Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.9807686Z self=, 2025-05-07T20:32:04.9808094Z T=16384, 2025-05-07T20:32:04.9808290Z D=5120, 2025-05-07T20:32:04.9808491Z scale_ub=None, 2025-05-07T20:32:04.9808802Z contiguous=True, 2025-05-07T20:32:04.9809024Z compiled=False, 2025-05-07T20:32:04.9809235Z ) 2025-05-07T20:32:05.0558739Z self = 2025-05-07T20:32:05.0559551Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.0559950Z 2025-05-07T20:32:05.0560059Z @given( 2025-05-07T20:32:05.0560378Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0560697Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0561029Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0561359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0561682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0561967Z ) 2025-05-07T20:32:05.0562315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0562747Z def test_silu_mul_quant( 2025-05-07T20:32:05.0562988Z self, 2025-05-07T20:32:05.0563463Z T: int, 2025-05-07T20:32:05.0563655Z D: int, 2025-05-07T20:32:05.0563873Z scale_ub: Optional[float], 2025-05-07T20:32:05.0564142Z contiguous: bool, 2025-05-07T20:32:05.0564374Z compiled: bool, 2025-05-07T20:32:05.0564599Z ) -> None: 2025-05-07T20:32:05.0564815Z torch.manual_seed(2025) 2025-05-07T20:32:05.0565056Z 2025-05-07T20:32:05.0565324Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0567473Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.0569492Z 2025-05-07T20:32:05.0569616Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.0569831Z 2025-05-07T20:32:05.0569942Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0570355Z self=, 2025-05-07T20:32:05.0570767Z T=4096, 2025-05-07T20:32:05.0570962Z D=5120, 2025-05-07T20:32:05.0571163Z scale_ub=None, 2025-05-07T20:32:05.0571528Z contiguous=True, 2025-05-07T20:32:05.0571762Z compiled=False, 2025-05-07T20:32:05.0571984Z ) 2025-05-07T20:32:05.0572301Z self = 2025-05-07T20:32:05.0572804Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.0573076Z 2025-05-07T20:32:05.0573168Z @given( 2025-05-07T20:32:05.0573398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0573724Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0574033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0574362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0574703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0574998Z ) 2025-05-07T20:32:05.0575353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0575793Z def test_silu_mul_quant( 2025-05-07T20:32:05.0576046Z self, 2025-05-07T20:32:05.0576255Z T: int, 2025-05-07T20:32:05.0576491Z D: int, 2025-05-07T20:32:05.0576789Z scale_ub: Optional[float], 2025-05-07T20:32:05.0577070Z contiguous: bool, 2025-05-07T20:32:05.0577313Z compiled: bool, 2025-05-07T20:32:05.0577547Z ) -> None: 2025-05-07T20:32:05.0577769Z torch.manual_seed(2025) 2025-05-07T20:32:05.0578012Z 2025-05-07T20:32:05.0578290Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0580459Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.0582336Z 2025-05-07T20:32:05.0582459Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.0582675Z 2025-05-07T20:32:05.0582787Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0583198Z self=, 2025-05-07T20:32:05.0583707Z T=2048, 2025-05-07T20:32:05.0583903Z D=5120, 2025-05-07T20:32:05.0584155Z scale_ub=None, 2025-05-07T20:32:05.0584379Z contiguous=False, 2025-05-07T20:32:05.0584609Z compiled=False, 2025-05-07T20:32:05.0584812Z ) 2025-05-07T20:32:05.0585137Z self = 2025-05-07T20:32:05.0585632Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.0585907Z 2025-05-07T20:32:05.0585993Z @given( 2025-05-07T20:32:05.0586260Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0586620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0586929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0587297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0587766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0588176Z ) 2025-05-07T20:32:05.0588548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0589000Z def test_silu_mul_quant( 2025-05-07T20:32:05.0589244Z self, 2025-05-07T20:32:05.0589440Z T: int, 2025-05-07T20:32:05.0589649Z D: int, 2025-05-07T20:32:05.0589871Z scale_ub: Optional[float], 2025-05-07T20:32:05.0590140Z contiguous: bool, 2025-05-07T20:32:05.0590385Z compiled: bool, 2025-05-07T20:32:05.0590615Z ) -> None: 2025-05-07T20:32:05.0590837Z torch.manual_seed(2025) 2025-05-07T20:32:05.0591075Z 2025-05-07T20:32:05.0591462Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0593540Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.0595425Z 2025-05-07T20:32:05.0595556Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.0595770Z 2025-05-07T20:32:05.0595876Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0596295Z self=, 2025-05-07T20:32:05.0596708Z T=4096, 2025-05-07T20:32:05.0596904Z D=7168, 2025-05-07T20:32:05.0597101Z scale_ub=None, 2025-05-07T20:32:05.0597325Z contiguous=True, 2025-05-07T20:32:05.0597552Z compiled=True, 2025-05-07T20:32:05.0597753Z ) 2025-05-07T20:32:05.0598074Z self = 2025-05-07T20:32:05.0598570Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.0598837Z 2025-05-07T20:32:05.0598919Z @given( 2025-05-07T20:32:05.0599205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0599521Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0599825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0600157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0600491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0600781Z ) 2025-05-07T20:32:05.0601126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0601582Z def test_silu_mul_quant( 2025-05-07T20:32:05.0601836Z self, 2025-05-07T20:32:05.0602027Z T: int, 2025-05-07T20:32:05.0602231Z D: int, 2025-05-07T20:32:05.0602450Z scale_ub: Optional[float], 2025-05-07T20:32:05.0602718Z contiguous: bool, 2025-05-07T20:32:05.0602960Z compiled: bool, 2025-05-07T20:32:05.0603186Z ) -> None: 2025-05-07T20:32:05.0603400Z torch.manual_seed(2025) 2025-05-07T20:32:05.0603658Z 2025-05-07T20:32:05.0603982Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0606364Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.0608334Z 2025-05-07T20:32:05.0608464Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.0608677Z 2025-05-07T20:32:05.0608779Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0609187Z self=, 2025-05-07T20:32:05.0609595Z T=2048, 2025-05-07T20:32:05.0609774Z D=5120, 2025-05-07T20:32:05.0609963Z scale_ub=1200.0, 2025-05-07T20:32:05.0610191Z contiguous=False, 2025-05-07T20:32:05.0610410Z compiled=False, 2025-05-07T20:32:05.0610612Z ) 2025-05-07T20:32:05.0610930Z self = 2025-05-07T20:32:05.0611418Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.0611700Z 2025-05-07T20:32:05.0611777Z @given( 2025-05-07T20:32:05.0612152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0612467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0612765Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0613099Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0613426Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0613705Z ) 2025-05-07T20:32:05.0614051Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0614500Z def test_silu_mul_quant( 2025-05-07T20:32:05.0614735Z self, 2025-05-07T20:32:05.0614932Z T: int, 2025-05-07T20:32:05.0615133Z D: int, 2025-05-07T20:32:05.0615343Z scale_ub: Optional[float], 2025-05-07T20:32:05.0615615Z contiguous: bool, 2025-05-07T20:32:05.0615855Z compiled: bool, 2025-05-07T20:32:05.0616075Z ) -> None: 2025-05-07T20:32:05.0616285Z torch.manual_seed(2025) 2025-05-07T20:32:05.0616529Z 2025-05-07T20:32:05.0616796Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0618839Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.0620792Z 2025-05-07T20:32:05.0620911Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.0621128Z 2025-05-07T20:32:05.0621235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0621645Z self=, 2025-05-07T20:32:05.0622047Z T=4096, 2025-05-07T20:32:05.0622233Z D=7168, 2025-05-07T20:32:05.0622421Z scale_ub=1200.0, 2025-05-07T20:32:05.0622640Z contiguous=True, 2025-05-07T20:32:05.0622851Z compiled=False, 2025-05-07T20:32:05.0623051Z ) 2025-05-07T20:32:05.1568315Z self = 2025-05-07T20:32:05.1569023Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.1569630Z 2025-05-07T20:32:05.1569715Z @given( 2025-05-07T20:32:05.1569952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.1570271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.1570576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.1570910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.1571271Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.1571564Z ) 2025-05-07T20:32:05.1571935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.1572376Z def test_silu_mul_quant( 2025-05-07T20:32:05.1572624Z self, 2025-05-07T20:32:05.1572821Z T: int, 2025-05-07T20:32:05.1573016Z D: int, 2025-05-07T20:32:05.1573238Z scale_ub: Optional[float], 2025-05-07T20:32:05.1573511Z contiguous: bool, 2025-05-07T20:32:05.1573749Z compiled: bool, 2025-05-07T20:32:05.1573984Z ) -> None: 2025-05-07T20:32:05.1574213Z torch.manual_seed(2025) 2025-05-07T20:32:05.1574459Z 2025-05-07T20:32:05.1574736Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.1576981Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.1578888Z 2025-05-07T20:32:05.1579011Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.1579227Z 2025-05-07T20:32:05.1579343Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.1579759Z self=, 2025-05-07T20:32:05.1580175Z T=16384, 2025-05-07T20:32:05.1580379Z D=7168, 2025-05-07T20:32:05.1589396Z scale_ub=None, 2025-05-07T20:32:05.1589643Z contiguous=False, 2025-05-07T20:32:05.1589880Z compiled=True, 2025-05-07T20:32:05.1590099Z ) 2025-05-07T20:32:05.1590434Z self = 2025-05-07T20:32:05.1590945Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.1591244Z 2025-05-07T20:32:05.1591328Z @given( 2025-05-07T20:32:05.1591575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.1591892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.1592216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.1592562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.1592902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.1593320Z ) 2025-05-07T20:32:05.1593680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.1594130Z def test_silu_mul_quant( 2025-05-07T20:32:05.1594378Z self, 2025-05-07T20:32:05.1594590Z T: int, 2025-05-07T20:32:05.1594801Z D: int, 2025-05-07T20:32:05.1595027Z scale_ub: Optional[float], 2025-05-07T20:32:05.1595311Z contiguous: bool, 2025-05-07T20:32:05.1595562Z compiled: bool, 2025-05-07T20:32:05.1595793Z ) -> None: 2025-05-07T20:32:05.1596029Z torch.manual_seed(2025) 2025-05-07T20:32:05.1596283Z 2025-05-07T20:32:05.1596560Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.1598623Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.1600715Z 2025-05-07T20:32:05.1600843Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.1601070Z 2025-05-07T20:32:05.1601178Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.1601608Z self=, 2025-05-07T20:32:05.1602021Z T=4096, 2025-05-07T20:32:05.1602230Z D=7168, 2025-05-07T20:32:05.1602442Z scale_ub=None, 2025-05-07T20:32:05.1602664Z contiguous=True, 2025-05-07T20:32:05.1602906Z compiled=False, 2025-05-07T20:32:05.1603127Z ) 2025-05-07T20:32:05.1603451Z self = 2025-05-07T20:32:05.1603964Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.1604250Z 2025-05-07T20:32:05.1604337Z @given( 2025-05-07T20:32:05.1604581Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.1604899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.1605223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.1605571Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.1606177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.1606640Z ) 2025-05-07T20:32:05.1607000Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.1607448Z def test_silu_mul_quant( 2025-05-07T20:32:05.1607785Z self, 2025-05-07T20:32:05.1607997Z T: int, 2025-05-07T20:32:05.1608207Z D: int, 2025-05-07T20:32:05.1608430Z scale_ub: Optional[float], 2025-05-07T20:32:05.1608717Z contiguous: bool, 2025-05-07T20:32:05.1608981Z compiled: bool, 2025-05-07T20:32:05.1609213Z ) -> None: 2025-05-07T20:32:05.1609443Z torch.manual_seed(2025) 2025-05-07T20:32:05.1609701Z 2025-05-07T20:32:05.1609980Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.1612418Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.1614322Z 2025-05-07T20:32:05.1614451Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.1614673Z 2025-05-07T20:32:05.1614885Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.1615313Z self=, 2025-05-07T20:32:05.1615723Z T=16384, 2025-05-07T20:32:05.1615931Z D=7168, 2025-05-07T20:32:05.1616136Z scale_ub=None, 2025-05-07T20:32:05.1616357Z contiguous=True, 2025-05-07T20:32:05.1616595Z compiled=False, 2025-05-07T20:32:05.1616818Z ) 2025-05-07T20:32:05.1617144Z self = 2025-05-07T20:32:05.1617653Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.1617935Z 2025-05-07T20:32:05.1618022Z @given( 2025-05-07T20:32:05.1618259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.1618581Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.1618897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.1619240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.1619654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.1619954Z ) 2025-05-07T20:32:05.1620317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.1620763Z def test_silu_mul_quant( 2025-05-07T20:32:05.1621022Z self, 2025-05-07T20:32:05.1621233Z T: int, 2025-05-07T20:32:05.1621440Z D: int, 2025-05-07T20:32:05.1621670Z scale_ub: Optional[float], 2025-05-07T20:32:05.1621956Z contiguous: bool, 2025-05-07T20:32:05.1622205Z compiled: bool, 2025-05-07T20:32:05.1622442Z ) -> None: 2025-05-07T20:32:05.1622674Z torch.manual_seed(2025) 2025-05-07T20:32:05.1622922Z 2025-05-07T20:32:05.1623205Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.1625271Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.1627162Z 2025-05-07T20:32:05.1627302Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.1627542Z 2025-05-07T20:32:05.1627741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.1628162Z self=, 2025-05-07T20:32:05.1628572Z T=16384, 2025-05-07T20:32:05.1628779Z D=7168, 2025-05-07T20:32:05.1628980Z scale_ub=1200.0, 2025-05-07T20:32:05.1629218Z contiguous=True, 2025-05-07T20:32:05.1629451Z compiled=False, 2025-05-07T20:32:05.1629659Z ) 2025-05-07T20:32:05.1629987Z self = 2025-05-07T20:32:05.1630497Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.1630782Z 2025-05-07T20:32:05.1630870Z @given( 2025-05-07T20:32:05.1631103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.1631423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.1631736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.1632071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.1632417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.1632716Z ) 2025-05-07T20:32:05.1633066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.1633513Z def test_silu_mul_quant( 2025-05-07T20:32:05.1633768Z self, 2025-05-07T20:32:05.1633965Z T: int, 2025-05-07T20:32:05.1634173Z D: int, 2025-05-07T20:32:05.1634399Z scale_ub: Optional[float], 2025-05-07T20:32:05.1634727Z contiguous: bool, 2025-05-07T20:32:05.1634980Z compiled: bool, 2025-05-07T20:32:05.1635212Z ) -> None: 2025-05-07T20:32:05.1635438Z torch.manual_seed(2025) 2025-05-07T20:32:05.1635685Z 2025-05-07T20:32:05.1635965Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.1638027Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.1639885Z 2025-05-07T20:32:05.1640010Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.1640274Z 2025-05-07T20:32:05.1640378Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.1640791Z self=, 2025-05-07T20:32:05.1641200Z T=128, 2025-05-07T20:32:05.1641396Z D=5120, 2025-05-07T20:32:05.1641588Z scale_ub=1200.0, 2025-05-07T20:32:05.1641820Z contiguous=False, 2025-05-07T20:32:05.1642051Z compiled=False, 2025-05-07T20:32:05.1642253Z ) 2025-05-07T20:32:05.2665080Z self = 2025-05-07T20:32:05.2665682Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.2665965Z 2025-05-07T20:32:05.2666053Z @given( 2025-05-07T20:32:05.2666292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.2666614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.2666927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.2667277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.2667613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.2667905Z ) 2025-05-07T20:32:05.2668264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.2668709Z def test_silu_mul_quant( 2025-05-07T20:32:05.2668961Z self, 2025-05-07T20:32:05.2669165Z T: int, 2025-05-07T20:32:05.2669361Z D: int, 2025-05-07T20:32:05.2669844Z scale_ub: Optional[float], 2025-05-07T20:32:05.2670130Z contiguous: bool, 2025-05-07T20:32:05.2670372Z compiled: bool, 2025-05-07T20:32:05.2670608Z ) -> None: 2025-05-07T20:32:05.2670828Z torch.manual_seed(2025) 2025-05-07T20:32:05.2671075Z 2025-05-07T20:32:05.2671363Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.2671709Z 2025-05-07T20:32:05.2671903Z x_sign = torch.sign(x) 2025-05-07T20:32:05.2672208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.2672532Z x = x_sign * x_clamp 2025-05-07T20:32:05.2672775Z x0 = x[:, :D] 2025-05-07T20:32:05.2673005Z x1 = x[:, D:] 2025-05-07T20:32:05.2673221Z 2025-05-07T20:32:05.2673409Z if contiguous: 2025-05-07T20:32:05.2673651Z x0 = x0.contiguous() 2025-05-07T20:32:05.2673924Z x1 = x1.contiguous() 2025-05-07T20:32:05.2674176Z 2025-05-07T20:32:05.2674372Z if scale_ub is not None: 2025-05-07T20:32:05.2674663Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.2675008Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.2675319Z ) 2025-05-07T20:32:05.2675524Z else: 2025-05-07T20:32:05.2675744Z scale_ub_tensor = None 2025-05-07T20:32:05.2675997Z 2025-05-07T20:32:05.2676239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.2676585Z op = silu_mul_quant 2025-05-07T20:32:05.2676948Z if compiled: 2025-05-07T20:32:05.2677208Z op = torch.compile(op) 2025-05-07T20:32:05.2677514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.2677790Z 2025-05-07T20:32:05.2677997Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.2678163Z 2025-05-07T20:32:05.2678275Z moe/activation_test.py:117: 2025-05-07T20:32:05.2678582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.2678915Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.2679215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.2679922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.2680629Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.2681169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.2681861Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.2682619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.2683163Z kernel = self.compile( 2025-05-07T20:32:05.2683706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.2684374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.2684784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.2685016Z 2025-05-07T20:32:05.2685228Z self = 2025-05-07T20:32:05.2686324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.2687820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a89237e0>} 2025-05-07T20:32:05.2689177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.2690293Z context = 2025-05-07T20:32:05.2690587Z 2025-05-07T20:32:05.2690756Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.2691287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.2691758Z module_map=module_map) 2025-05-07T20:32:05.2692122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.2692483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.2692756Z E ^ 2025-05-07T20:32:05.2693226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.2693676Z 2025-05-07T20:32:05.2694095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.2694614Z 2025-05-07T20:32:05.2694721Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.2695146Z self=, 2025-05-07T20:32:05.2695558Z T=2048, 2025-05-07T20:32:05.2695745Z D=7168, 2025-05-07T20:32:05.2695946Z scale_ub=None, 2025-05-07T20:32:05.2696168Z contiguous=False, 2025-05-07T20:32:05.2696396Z compiled=False, 2025-05-07T20:32:05.2696611Z ) 2025-05-07T20:32:05.2696971Z self = 2025-05-07T20:32:05.2697477Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.2697808Z 2025-05-07T20:32:05.2697887Z @given( 2025-05-07T20:32:05.2698126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.2698441Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.2698752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.2699084Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.2699420Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.2699706Z ) 2025-05-07T20:32:05.2700063Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.2700511Z def test_silu_mul_quant( 2025-05-07T20:32:05.2700756Z self, 2025-05-07T20:32:05.2700959Z T: int, 2025-05-07T20:32:05.2701165Z D: int, 2025-05-07T20:32:05.2701382Z scale_ub: Optional[float], 2025-05-07T20:32:05.2701661Z contiguous: bool, 2025-05-07T20:32:05.2701908Z compiled: bool, 2025-05-07T20:32:05.2702187Z ) -> None: 2025-05-07T20:32:05.2702417Z torch.manual_seed(2025) 2025-05-07T20:32:05.2702666Z 2025-05-07T20:32:05.2702942Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.2705014Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.2707319Z 2025-05-07T20:32:05.2707447Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.2707668Z 2025-05-07T20:32:05.2707773Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.2708196Z self=, 2025-05-07T20:32:05.2708596Z T=128, 2025-05-07T20:32:05.2708792Z D=7168, 2025-05-07T20:32:05.2708990Z scale_ub=1200.0, 2025-05-07T20:32:05.2709212Z contiguous=True, 2025-05-07T20:32:05.2709444Z compiled=True, 2025-05-07T20:32:05.2709654Z ) 2025-05-07T20:32:05.3016489Z self = 2025-05-07T20:32:05.3017229Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.3017541Z 2025-05-07T20:32:05.3017663Z @given( 2025-05-07T20:32:05.3017997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.3018375Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.3018686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.3019019Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.3019345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.3019682Z ) 2025-05-07T20:32:05.3020031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.3020474Z def test_silu_mul_quant( 2025-05-07T20:32:05.3020719Z self, 2025-05-07T20:32:05.3020911Z T: int, 2025-05-07T20:32:05.3021114Z D: int, 2025-05-07T20:32:05.3021335Z scale_ub: Optional[float], 2025-05-07T20:32:05.3021601Z contiguous: bool, 2025-05-07T20:32:05.3021844Z compiled: bool, 2025-05-07T20:32:05.3022081Z ) -> None: 2025-05-07T20:32:05.3022303Z torch.manual_seed(2025) 2025-05-07T20:32:05.3022547Z 2025-05-07T20:32:05.3022820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.3023161Z 2025-05-07T20:32:05.3023354Z x_sign = torch.sign(x) 2025-05-07T20:32:05.3023645Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.3023958Z x = x_sign * x_clamp 2025-05-07T20:32:05.3024285Z x0 = x[:, :D] 2025-05-07T20:32:05.3024507Z x1 = x[:, D:] 2025-05-07T20:32:05.3024718Z 2025-05-07T20:32:05.3024905Z if contiguous: 2025-05-07T20:32:05.3025142Z x0 = x0.contiguous() 2025-05-07T20:32:05.3025403Z x1 = x1.contiguous() 2025-05-07T20:32:05.3025640Z 2025-05-07T20:32:05.3025837Z if scale_ub is not None: 2025-05-07T20:32:05.3026116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.3026453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.3026767Z ) 2025-05-07T20:32:05.3026970Z else: 2025-05-07T20:32:05.3027183Z scale_ub_tensor = None 2025-05-07T20:32:05.3027441Z 2025-05-07T20:32:05.3027681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.3027999Z op = silu_mul_quant 2025-05-07T20:32:05.3028252Z if compiled: 2025-05-07T20:32:05.3028506Z op = torch.compile(op) 2025-05-07T20:32:05.3028880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.3029161Z 2025-05-07T20:32:05.3029365Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.3029532Z 2025-05-07T20:32:05.3029639Z moe/activation_test.py:117: 2025-05-07T20:32:05.3029933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.3030273Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.3030556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.3031126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.3031686Z return fn(*args, **kwargs) 2025-05-07T20:32:05.3032350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.3033042Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.3033577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.3034264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.3034931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.3035470Z kernel = self.compile( 2025-05-07T20:32:05.3036008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.3036747Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.3037150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.3037381Z 2025-05-07T20:32:05.3037589Z self = 2025-05-07T20:32:05.3038670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.3040047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8786a20>} 2025-05-07T20:32:05.3041389Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.3042417Z context = 2025-05-07T20:32:05.3042704Z 2025-05-07T20:32:05.3042871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.3043395Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.3043863Z module_map=module_map) 2025-05-07T20:32:05.3044230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.3044657Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.3044922Z E ^ 2025-05-07T20:32:05.3045388Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.3045836Z 2025-05-07T20:32:05.3046252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.3046765Z 2025-05-07T20:32:05.3046876Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.3047291Z self=, 2025-05-07T20:32:05.3047785Z T=128, 2025-05-07T20:32:05.3047971Z D=7168, 2025-05-07T20:32:05.3048178Z scale_ub=1200.0, 2025-05-07T20:32:05.3048406Z contiguous=True, 2025-05-07T20:32:05.3048630Z compiled=False, 2025-05-07T20:32:05.3048841Z ) 2025-05-07T20:32:05.3049163Z self = 2025-05-07T20:32:05.3049704Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.3049984Z 2025-05-07T20:32:05.3050064Z @given( 2025-05-07T20:32:05.3050306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.3050624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.3050928Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.3051261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.3051596Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.3051885Z ) 2025-05-07T20:32:05.3052237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.3052681Z def test_silu_mul_quant( 2025-05-07T20:32:05.3052922Z self, 2025-05-07T20:32:05.3053124Z T: int, 2025-05-07T20:32:05.3053331Z D: int, 2025-05-07T20:32:05.3053549Z scale_ub: Optional[float], 2025-05-07T20:32:05.3053826Z contiguous: bool, 2025-05-07T20:32:05.3054071Z compiled: bool, 2025-05-07T20:32:05.3054291Z ) -> None: 2025-05-07T20:32:05.3054515Z torch.manual_seed(2025) 2025-05-07T20:32:05.3054760Z 2025-05-07T20:32:05.3055030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.3055377Z 2025-05-07T20:32:05.3055576Z x_sign = torch.sign(x) 2025-05-07T20:32:05.3055870Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.3057949Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.3059806Z 2025-05-07T20:32:05.3059928Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.3060145Z 2025-05-07T20:32:05.3060250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.3060667Z self=, 2025-05-07T20:32:05.3061070Z T=128, 2025-05-07T20:32:05.3061260Z D=5120, 2025-05-07T20:32:05.3061458Z scale_ub=1200.0, 2025-05-07T20:32:05.3061686Z contiguous=True, 2025-05-07T20:32:05.3061909Z compiled=True, 2025-05-07T20:32:05.3062121Z ) 2025-05-07T20:32:05.3062442Z self = 2025-05-07T20:32:05.3062923Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.3063194Z 2025-05-07T20:32:05.3063276Z @given( 2025-05-07T20:32:05.3063510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.3063819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.3065526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.3065860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.3066186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.3066481Z ) 2025-05-07T20:32:05.3066843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.3067288Z def test_silu_mul_quant( 2025-05-07T20:32:05.3067530Z self, 2025-05-07T20:32:05.3067739Z T: int, 2025-05-07T20:32:05.3067946Z D: int, 2025-05-07T20:32:05.3068162Z scale_ub: Optional[float], 2025-05-07T20:32:05.3068440Z contiguous: bool, 2025-05-07T20:32:05.3068685Z compiled: bool, 2025-05-07T20:32:05.3068907Z ) -> None: 2025-05-07T20:32:05.3069129Z torch.manual_seed(2025) 2025-05-07T20:32:05.3078275Z 2025-05-07T20:32:05.3078574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.3079016Z 2025-05-07T20:32:05.3079225Z x_sign = torch.sign(x) 2025-05-07T20:32:05.3079527Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.3081538Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.3083393Z 2025-05-07T20:32:05.3083518Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.3083744Z 2025-05-07T20:32:05.3083852Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.3084278Z self=, 2025-05-07T20:32:05.3084689Z T=128, 2025-05-07T20:32:05.3084891Z D=7168, 2025-05-07T20:32:05.3085095Z scale_ub=None, 2025-05-07T20:32:05.3085314Z contiguous=True, 2025-05-07T20:32:05.3085550Z compiled=True, 2025-05-07T20:32:05.3085767Z ) 2025-05-07T20:32:05.7406934Z self = 2025-05-07T20:32:05.7407630Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.7408073Z 2025-05-07T20:32:05.7408168Z @given( 2025-05-07T20:32:05.7408400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7408715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7409018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7409345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7409682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7409981Z ) 2025-05-07T20:32:05.7410341Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7410783Z def test_silu_mul_quant( 2025-05-07T20:32:05.7411038Z self, 2025-05-07T20:32:05.7411250Z T: int, 2025-05-07T20:32:05.7411453Z D: int, 2025-05-07T20:32:05.7411682Z scale_ub: Optional[float], 2025-05-07T20:32:05.7411962Z contiguous: bool, 2025-05-07T20:32:05.7412204Z compiled: bool, 2025-05-07T20:32:05.7412442Z ) -> None: 2025-05-07T20:32:05.7412678Z torch.manual_seed(2025) 2025-05-07T20:32:05.7412923Z 2025-05-07T20:32:05.7413204Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7415262Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.7417256Z 2025-05-07T20:32:05.7417380Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.7417595Z 2025-05-07T20:32:05.7449834Z FAILED 2025-05-07T20:32:05.7450011Z 2025-05-07T20:32:05.7450201Z =================================== FAILURES =================================== 2025-05-07T20:32:05.7450668Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:05.7451223Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:05.7451975Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:05.7452686Z | yield 2025-05-07T20:32:05.7453149Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:05.7453902Z | self._callTestMethod(testMethod) 2025-05-07T20:32:05.7454581Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:05.7455144Z | if method() is not None: 2025-05-07T20:32:05.7455407Z | ^^^^^^^^ 2025-05-07T20:32:05.7456090Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:05.7456815Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7457125Z | ^^^^^^^ 2025-05-07T20:32:05.7457705Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:05.7458358Z | raise the_error_hypothesis_found 2025-05-07T20:32:05.7458789Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:05.7459228Z +-+---------------- 1 ---------------- 2025-05-07T20:32:05.7459545Z | Traceback (most recent call last): 2025-05-07T20:32:05.7460266Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:05.7461058Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7461548Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7463622Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.7465802Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:05.7466261Z | self=, 2025-05-07T20:32:05.7466688Z | T=2048, 2025-05-07T20:32:05.7467000Z | D=5120, # or any other generated value 2025-05-07T20:32:05.7467362Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:05.7467727Z | contiguous=True, # or any other generated value 2025-05-07T20:32:05.7468105Z | compiled=False, # or any other generated value 2025-05-07T20:32:05.7468405Z | ) 2025-05-07T20:32:05.7468591Z | 2025-05-07T20:32:05.7469137Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:05.7469807Z +---------------- 2 ---------------- 2025-05-07T20:32:05.7470098Z | Traceback (most recent call last): 2025-05-07T20:32:05.7470814Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:05.7471590Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7471964Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7474024Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.7476045Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:05.7476560Z | self=, 2025-05-07T20:32:05.7476990Z | T=128, 2025-05-07T20:32:05.7477202Z | D=7168, 2025-05-07T20:32:05.7477422Z | scale_ub=None, 2025-05-07T20:32:05.7477674Z | contiguous=True, 2025-05-07T20:32:05.7477923Z | compiled=True, 2025-05-07T20:32:05.7478163Z | ) 2025-05-07T20:32:05.7478349Z | 2025-05-07T20:32:05.7478954Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:05.7479598Z +---------------- 3 ---------------- 2025-05-07T20:32:05.7479892Z | Traceback (most recent call last): 2025-05-07T20:32:05.7480613Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:05.7481417Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7481810Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7483878Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.7485875Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:05.7486324Z | self=, 2025-05-07T20:32:05.7486744Z | T=128, 2025-05-07T20:32:05.7486951Z | D=5120, 2025-05-07T20:32:05.7487170Z | scale_ub=1200.0, 2025-05-07T20:32:05.7487414Z | contiguous=True, 2025-05-07T20:32:05.7487776Z | compiled=True, 2025-05-07T20:32:05.7488006Z | ) 2025-05-07T20:32:05.7488188Z | 2025-05-07T20:32:05.7488732Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:05.7489348Z +---------------- 4 ---------------- 2025-05-07T20:32:05.7489689Z | Traceback (most recent call last): 2025-05-07T20:32:05.7490412Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:05.7491249Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7491558Z | ^^^^^^^^ 2025-05-07T20:32:05.7492278Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:05.7492974Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7493320Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7494126Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:05.7494922Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7495531Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:05.7496260Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7496716Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7497454Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:05.7498225Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7498698Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7499380Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:05.7500185Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7500657Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7501299Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:05.7502014Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7502392Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7502993Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:05.7503565Z | fn() 2025-05-07T20:32:05.7504258Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:05.7504891Z | self.fn.run( 2025-05-07T20:32:05.7505438Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:05.7506314Z | kernel = self.compile( 2025-05-07T20:32:05.7506585Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:05.7507184Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:05.7507902Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7508290Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7508929Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.7509724Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7510217Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.7510603Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7510957Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7511225Z | ^ 2025-05-07T20:32:05.7511688Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7512350Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:05.7512750Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:05.7513270Z | self=, 2025-05-07T20:32:05.7513705Z | T=1, # or any other generated value 2025-05-07T20:32:05.7514024Z | D=5120, # or any other generated value 2025-05-07T20:32:05.7514367Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:05.7514743Z | contiguous=True, # or any other generated value 2025-05-07T20:32:05.7515103Z | compiled=True, # or any other generated value 2025-05-07T20:32:05.7515416Z | ) 2025-05-07T20:32:05.7515603Z | 2025-05-07T20:32:05.7516126Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:05.7516739Z +------------------------------------ 2025-05-07T20:32:05.7517186Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:05.7517572Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7517982Z self=, 2025-05-07T20:32:05.7518391Z T=1, 2025-05-07T20:32:05.7518586Z D=5120, 2025-05-07T20:32:05.7518786Z scale_ub=None, 2025-05-07T20:32:05.7519014Z contiguous=True, 2025-05-07T20:32:05.7519246Z compiled=True, 2025-05-07T20:32:05.7519454Z ) 2025-05-07T20:32:05.7519789Z self = 2025-05-07T20:32:05.7520275Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.7520536Z 2025-05-07T20:32:05.7520627Z @given( 2025-05-07T20:32:05.7520864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7521189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7521500Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7521840Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7522174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7522472Z ) 2025-05-07T20:32:05.7522819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7523272Z def test_silu_mul_quant( 2025-05-07T20:32:05.7523522Z self, 2025-05-07T20:32:05.7523716Z T: int, 2025-05-07T20:32:05.7523925Z D: int, 2025-05-07T20:32:05.7524278Z scale_ub: Optional[float], 2025-05-07T20:32:05.7524567Z contiguous: bool, 2025-05-07T20:32:05.7524812Z compiled: bool, 2025-05-07T20:32:05.7525050Z ) -> None: 2025-05-07T20:32:05.7525277Z torch.manual_seed(2025) 2025-05-07T20:32:05.7525521Z 2025-05-07T20:32:05.7525799Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7526157Z 2025-05-07T20:32:05.7526358Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7526667Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7526987Z x = x_sign * x_clamp 2025-05-07T20:32:05.7527233Z x0 = x[:, :D] 2025-05-07T20:32:05.7527463Z x1 = x[:, D:] 2025-05-07T20:32:05.7527776Z 2025-05-07T20:32:05.7527966Z if contiguous: 2025-05-07T20:32:05.7528208Z x0 = x0.contiguous() 2025-05-07T20:32:05.7528478Z x1 = x1.contiguous() 2025-05-07T20:32:05.7528722Z 2025-05-07T20:32:05.7528920Z if scale_ub is not None: 2025-05-07T20:32:05.7529213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7529549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7529868Z ) 2025-05-07T20:32:05.7530068Z else: 2025-05-07T20:32:05.7530290Z scale_ub_tensor = None 2025-05-07T20:32:05.7530539Z 2025-05-07T20:32:05.7530779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7531097Z op = silu_mul_quant 2025-05-07T20:32:05.7531402Z if compiled: 2025-05-07T20:32:05.7531658Z op = torch.compile(op) 2025-05-07T20:32:05.7531964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7532246Z 2025-05-07T20:32:05.7532447Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7532747Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7533040Z 2025-05-07T20:32:05.7533289Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7533636Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7533934Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7534256Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7534631Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7534946Z 2025-05-07T20:32:05.7535156Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7535360Z 2025-05-07T20:32:05.7535462Z moe/activation_test.py:126: 2025-05-07T20:32:05.7535829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7536170Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7536509Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7537304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7538058Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7538608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7539299Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7539988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7540711Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7541466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7542221Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7542947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7543588Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7544276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7544810Z fn() 2025-05-07T20:32:05.7545328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7545904Z self.fn.run( 2025-05-07T20:32:05.7546385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7546925Z kernel = self.compile( 2025-05-07T20:32:05.7547469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7548132Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7548530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7548762Z 2025-05-07T20:32:05.7548978Z self = 2025-05-07T20:32:05.7550061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7551445Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a97697ce0>} 2025-05-07T20:32:05.7552842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7553873Z context = 2025-05-07T20:32:05.7554161Z 2025-05-07T20:32:05.7554335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7554860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7555334Z module_map=module_map) 2025-05-07T20:32:05.7555704Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7556060Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7556332Z E ^ 2025-05-07T20:32:05.7556926Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7557613Z 2025-05-07T20:32:05.7558052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7558568Z 2025-05-07T20:32:05.7558675Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7559098Z self=, 2025-05-07T20:32:05.7559506Z T=2048, 2025-05-07T20:32:05.7559701Z D=5120, 2025-05-07T20:32:05.7559904Z scale_ub=1200.0, 2025-05-07T20:32:05.7560142Z contiguous=True, 2025-05-07T20:32:05.7560369Z compiled=False, 2025-05-07T20:32:05.7560591Z ) 2025-05-07T20:32:05.7560913Z self = 2025-05-07T20:32:05.7561421Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.7561695Z 2025-05-07T20:32:05.7561776Z @given( 2025-05-07T20:32:05.7562019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7562344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7562659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7562995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7563332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7563614Z ) 2025-05-07T20:32:05.7563964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7564407Z def test_silu_mul_quant( 2025-05-07T20:32:05.7564654Z self, 2025-05-07T20:32:05.7564946Z T: int, 2025-05-07T20:32:05.7565150Z D: int, 2025-05-07T20:32:05.7565375Z scale_ub: Optional[float], 2025-05-07T20:32:05.7565645Z contiguous: bool, 2025-05-07T20:32:05.7565893Z compiled: bool, 2025-05-07T20:32:05.7566126Z ) -> None: 2025-05-07T20:32:05.7566340Z torch.manual_seed(2025) 2025-05-07T20:32:05.7566618Z 2025-05-07T20:32:05.7566923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7567277Z 2025-05-07T20:32:05.7567487Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7567909Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7568223Z x = x_sign * x_clamp 2025-05-07T20:32:05.7568471Z x0 = x[:, :D] 2025-05-07T20:32:05.7568696Z x1 = x[:, D:] 2025-05-07T20:32:05.7568909Z 2025-05-07T20:32:05.7569100Z if contiguous: 2025-05-07T20:32:05.7569341Z x0 = x0.contiguous() 2025-05-07T20:32:05.7569608Z x1 = x1.contiguous() 2025-05-07T20:32:05.7569858Z 2025-05-07T20:32:05.7570062Z if scale_ub is not None: 2025-05-07T20:32:05.7570337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7570680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7571003Z ) 2025-05-07T20:32:05.7571210Z else: 2025-05-07T20:32:05.7571422Z scale_ub_tensor = None 2025-05-07T20:32:05.7571685Z 2025-05-07T20:32:05.7571983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7572302Z op = silu_mul_quant 2025-05-07T20:32:05.7572558Z if compiled: 2025-05-07T20:32:05.7572812Z op = torch.compile(op) 2025-05-07T20:32:05.7573110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7573396Z 2025-05-07T20:32:05.7573595Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7573763Z 2025-05-07T20:32:05.7573865Z moe/activation_test.py:117: 2025-05-07T20:32:05.7574178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7595238Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7595684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7596676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7597702Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7598449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7599572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7600506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7601268Z kernel = self.compile( 2025-05-07T20:32:05.7602035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7602936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7603474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7603797Z 2025-05-07T20:32:05.7604078Z self = 2025-05-07T20:32:05.7605591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7607897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a97911da0>} 2025-05-07T20:32:05.7609984Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7611336Z context = 2025-05-07T20:32:05.7611720Z 2025-05-07T20:32:05.7611934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7612674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7613335Z module_map=module_map) 2025-05-07T20:32:05.7613859Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7614355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7614711Z E ^ 2025-05-07T20:32:05.7615363Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7616006Z 2025-05-07T20:32:05.7616590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7617364Z 2025-05-07T20:32:05.7617519Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7618084Z self=, 2025-05-07T20:32:05.7618647Z T=2048, 2025-05-07T20:32:05.7618906Z D=5120, 2025-05-07T20:32:05.7619169Z scale_ub=1200.0, 2025-05-07T20:32:05.7619486Z contiguous=True, 2025-05-07T20:32:05.7619798Z compiled=True, 2025-05-07T20:32:05.7620091Z ) 2025-05-07T20:32:05.7620638Z self = 2025-05-07T20:32:05.7621330Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.7621703Z 2025-05-07T20:32:05.7621835Z @given( 2025-05-07T20:32:05.7622143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7622555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7622953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7623391Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7623831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7624207Z ) 2025-05-07T20:32:05.7624656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7625225Z def test_silu_mul_quant( 2025-05-07T20:32:05.7625539Z self, 2025-05-07T20:32:05.7625793Z T: int, 2025-05-07T20:32:05.7626050Z D: int, 2025-05-07T20:32:05.7626354Z scale_ub: Optional[float], 2025-05-07T20:32:05.7626920Z contiguous: bool, 2025-05-07T20:32:05.7627254Z compiled: bool, 2025-05-07T20:32:05.7627587Z ) -> None: 2025-05-07T20:32:05.7627897Z torch.manual_seed(2025) 2025-05-07T20:32:05.7628227Z 2025-05-07T20:32:05.7628611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7629099Z 2025-05-07T20:32:05.7629359Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7629761Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7630187Z x = x_sign * x_clamp 2025-05-07T20:32:05.7630521Z x0 = x[:, :D] 2025-05-07T20:32:05.7630838Z x1 = x[:, D:] 2025-05-07T20:32:05.7631133Z 2025-05-07T20:32:05.7631390Z if contiguous: 2025-05-07T20:32:05.7631721Z x0 = x0.contiguous() 2025-05-07T20:32:05.7632088Z x1 = x1.contiguous() 2025-05-07T20:32:05.7632436Z 2025-05-07T20:32:05.7632706Z if scale_ub is not None: 2025-05-07T20:32:05.7633108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7633583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7634012Z ) 2025-05-07T20:32:05.7634290Z else: 2025-05-07T20:32:05.7634591Z scale_ub_tensor = None 2025-05-07T20:32:05.7634925Z 2025-05-07T20:32:05.7635247Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7635679Z op = silu_mul_quant 2025-05-07T20:32:05.7636020Z if compiled: 2025-05-07T20:32:05.7636474Z op = torch.compile(op) 2025-05-07T20:32:05.7636889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7637274Z 2025-05-07T20:32:05.7637553Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7637949Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7638346Z 2025-05-07T20:32:05.7638672Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7639141Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7639566Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7639975Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7640450Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7640835Z 2025-05-07T20:32:05.7641079Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7641331Z 2025-05-07T20:32:05.7641453Z moe/activation_test.py:126: 2025-05-07T20:32:05.7641827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7642236Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7642660Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7643774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7644798Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7645651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7646629Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7647708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7648738Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7651011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7652069Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7653091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7653991Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7654915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7655658Z fn() 2025-05-07T20:32:05.7656390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7657240Z self.fn.run( 2025-05-07T20:32:05.7657902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7658646Z kernel = self.compile( 2025-05-07T20:32:05.7659411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7660328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7660883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7661207Z 2025-05-07T20:32:05.7661490Z self = 2025-05-07T20:32:05.7663011Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7664994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a9642f2e0>} 2025-05-07T20:32:05.7667007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7668479Z context = 2025-05-07T20:32:05.7668906Z 2025-05-07T20:32:05.7669144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7669886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7670538Z module_map=module_map) 2025-05-07T20:32:05.7671021Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7671512Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7671893Z E ^ 2025-05-07T20:32:05.7672541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7673187Z 2025-05-07T20:32:05.7673780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7674513Z 2025-05-07T20:32:05.7674661Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7675243Z self=, 2025-05-07T20:32:05.7675805Z T=16384, 2025-05-07T20:32:05.7676090Z D=7168, 2025-05-07T20:32:05.7676370Z scale_ub=1200.0, 2025-05-07T20:32:05.7676687Z contiguous=False, 2025-05-07T20:32:05.7677073Z compiled=False, 2025-05-07T20:32:05.7677367Z ) 2025-05-07T20:32:05.7677808Z self = 2025-05-07T20:32:05.7678511Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.7678914Z 2025-05-07T20:32:05.7679027Z @given( 2025-05-07T20:32:05.7679356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7679798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7680247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7680725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7681182Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7681589Z ) 2025-05-07T20:32:05.7682089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7682705Z def test_silu_mul_quant( 2025-05-07T20:32:05.7683052Z self, 2025-05-07T20:32:05.7683394Z T: int, 2025-05-07T20:32:05.7683674Z D: int, 2025-05-07T20:32:05.7683988Z scale_ub: Optional[float], 2025-05-07T20:32:05.7684375Z contiguous: bool, 2025-05-07T20:32:05.7684723Z compiled: bool, 2025-05-07T20:32:05.7685029Z ) -> None: 2025-05-07T20:32:05.7685332Z torch.manual_seed(2025) 2025-05-07T20:32:05.7685681Z 2025-05-07T20:32:05.7686048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7686527Z 2025-05-07T20:32:05.7686840Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7687251Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7687769Z x = x_sign * x_clamp 2025-05-07T20:32:05.7688092Z x0 = x[:, :D] 2025-05-07T20:32:05.7688396Z x1 = x[:, D:] 2025-05-07T20:32:05.7688701Z 2025-05-07T20:32:05.7688969Z if contiguous: 2025-05-07T20:32:05.7689291Z x0 = x0.contiguous() 2025-05-07T20:32:05.7689661Z x1 = x1.contiguous() 2025-05-07T20:32:05.7690014Z 2025-05-07T20:32:05.7690279Z if scale_ub is not None: 2025-05-07T20:32:05.7690662Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7691137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7691591Z ) 2025-05-07T20:32:05.7691869Z else: 2025-05-07T20:32:05.7692172Z scale_ub_tensor = None 2025-05-07T20:32:05.7692537Z 2025-05-07T20:32:05.7693007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7693462Z op = silu_mul_quant 2025-05-07T20:32:05.7693826Z if compiled: 2025-05-07T20:32:05.7694185Z op = torch.compile(op) 2025-05-07T20:32:05.7694610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7695007Z 2025-05-07T20:32:05.7695276Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7695512Z 2025-05-07T20:32:05.7695648Z moe/activation_test.py:117: 2025-05-07T20:32:05.7696070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7696539Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7696943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7697924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7698894Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7699666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7700634Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7701555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7702288Z kernel = self.compile( 2025-05-07T20:32:05.7703030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7704028Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7704584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7704905Z 2025-05-07T20:32:05.7705192Z self = 2025-05-07T20:32:05.7706913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7708303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a96238860>} 2025-05-07T20:32:05.7709647Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7710822Z context = 2025-05-07T20:32:05.7711110Z 2025-05-07T20:32:05.7711285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7711808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7712283Z module_map=module_map) 2025-05-07T20:32:05.7712656Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7713007Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7713276Z E ^ 2025-05-07T20:32:05.7713746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7714196Z 2025-05-07T20:32:05.7714617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7715132Z 2025-05-07T20:32:05.7715237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7715656Z self=, 2025-05-07T20:32:05.7716061Z T=1, 2025-05-07T20:32:05.7716247Z D=7168, 2025-05-07T20:32:05.7716445Z scale_ub=None, 2025-05-07T20:32:05.7716664Z contiguous=True, 2025-05-07T20:32:05.7716884Z compiled=True, 2025-05-07T20:32:05.7717093Z ) 2025-05-07T20:32:05.7717413Z self = 2025-05-07T20:32:05.7718024Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.7718285Z 2025-05-07T20:32:05.7718369Z @given( 2025-05-07T20:32:05.7718605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7718922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7719230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7719561Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7719899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7720183Z ) 2025-05-07T20:32:05.7720533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7720981Z def test_silu_mul_quant( 2025-05-07T20:32:05.7721220Z self, 2025-05-07T20:32:05.7721420Z T: int, 2025-05-07T20:32:05.7721622Z D: int, 2025-05-07T20:32:05.7721847Z scale_ub: Optional[float], 2025-05-07T20:32:05.7722116Z contiguous: bool, 2025-05-07T20:32:05.7722367Z compiled: bool, 2025-05-07T20:32:05.7722596Z ) -> None: 2025-05-07T20:32:05.7722812Z torch.manual_seed(2025) 2025-05-07T20:32:05.7723057Z 2025-05-07T20:32:05.7723336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7723676Z 2025-05-07T20:32:05.7723878Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7724179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7724556Z x = x_sign * x_clamp 2025-05-07T20:32:05.7724808Z x0 = x[:, :D] 2025-05-07T20:32:05.7725030Z x1 = x[:, D:] 2025-05-07T20:32:05.7725239Z 2025-05-07T20:32:05.7725434Z if contiguous: 2025-05-07T20:32:05.7725669Z x0 = x0.contiguous() 2025-05-07T20:32:05.7725927Z x1 = x1.contiguous() 2025-05-07T20:32:05.7726174Z 2025-05-07T20:32:05.7726375Z if scale_ub is not None: 2025-05-07T20:32:05.7726645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7726992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7727309Z ) 2025-05-07T20:32:05.7727511Z else: 2025-05-07T20:32:05.7727840Z scale_ub_tensor = None 2025-05-07T20:32:05.7728093Z 2025-05-07T20:32:05.7728325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7728633Z op = silu_mul_quant 2025-05-07T20:32:05.7728887Z if compiled: 2025-05-07T20:32:05.7729194Z op = torch.compile(op) 2025-05-07T20:32:05.7729486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7729766Z 2025-05-07T20:32:05.7729967Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7730248Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7730542Z 2025-05-07T20:32:05.7730784Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7731116Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7731419Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7731736Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7732100Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7732410Z 2025-05-07T20:32:05.7732616Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7732809Z 2025-05-07T20:32:05.7732916Z moe/activation_test.py:126: 2025-05-07T20:32:05.7733210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7733556Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7733890Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7734667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7735427Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7736061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7736749Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7737427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7738150Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7738904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7739656Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7740376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7741014Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7741622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7742149Z fn() 2025-05-07T20:32:05.7742652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7743235Z self.fn.run( 2025-05-07T20:32:05.7743703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7744230Z kernel = self.compile( 2025-05-07T20:32:05.7744817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7745476Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7745879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7746107Z 2025-05-07T20:32:05.7746315Z self = 2025-05-07T20:32:05.7747399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7748771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a96239080>} 2025-05-07T20:32:05.7750115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7751185Z context = 2025-05-07T20:32:05.7751477Z 2025-05-07T20:32:05.7751644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7752165Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7752279Z module_map=module_map) 2025-05-07T20:32:05.7752448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7752552Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7752631Z E ^ 2025-05-07T20:32:05.7752992Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7752997Z 2025-05-07T20:32:05.7753409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7753420Z 2025-05-07T20:32:05.7753532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7753754Z self=, 2025-05-07T20:32:05.7753831Z T=4096, 2025-05-07T20:32:05.7753918Z D=5120, 2025-05-07T20:32:05.7754005Z scale_ub=None, 2025-05-07T20:32:05.7754095Z contiguous=False, 2025-05-07T20:32:05.7754186Z compiled=False, 2025-05-07T20:32:05.7754343Z ) 2025-05-07T20:32:05.7754565Z self = 2025-05-07T20:32:05.7754749Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.7754754Z 2025-05-07T20:32:05.7754834Z @given( 2025-05-07T20:32:05.7754962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7755064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7755183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7755310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7755425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7755500Z ) 2025-05-07T20:32:05.7755752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7755849Z def test_silu_mul_quant( 2025-05-07T20:32:05.7755936Z self, 2025-05-07T20:32:05.7756019Z T: int, 2025-05-07T20:32:05.7756096Z D: int, 2025-05-07T20:32:05.7756209Z scale_ub: Optional[float], 2025-05-07T20:32:05.7756302Z contiguous: bool, 2025-05-07T20:32:05.7756389Z compiled: bool, 2025-05-07T20:32:05.7756477Z ) -> None: 2025-05-07T20:32:05.7756574Z torch.manual_seed(2025) 2025-05-07T20:32:05.7756651Z 2025-05-07T20:32:05.7756827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7756906Z 2025-05-07T20:32:05.7756999Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7757186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7757277Z x = x_sign * x_clamp 2025-05-07T20:32:05.7757368Z x0 = x[:, :D] 2025-05-07T20:32:05.7757451Z x1 = x[:, D:] 2025-05-07T20:32:05.7757528Z 2025-05-07T20:32:05.7757622Z if contiguous: 2025-05-07T20:32:05.7757716Z x0 = x0.contiguous() 2025-05-07T20:32:05.7757808Z x1 = x1.contiguous() 2025-05-07T20:32:05.7757886Z 2025-05-07T20:32:05.7757985Z if scale_ub is not None: 2025-05-07T20:32:05.7758094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7758239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7758317Z ) 2025-05-07T20:32:05.7758394Z else: 2025-05-07T20:32:05.7758497Z scale_ub_tensor = None 2025-05-07T20:32:05.7758572Z 2025-05-07T20:32:05.7758704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7758856Z op = silu_mul_quant 2025-05-07T20:32:05.7758943Z if compiled: 2025-05-07T20:32:05.7759054Z op = torch.compile(op) 2025-05-07T20:32:05.7759161Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7759238Z 2025-05-07T20:32:05.7759337Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7759342Z 2025-05-07T20:32:05.7759440Z moe/activation_test.py:117: 2025-05-07T20:32:05.7759572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7759686Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7759789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7760290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7760401Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7760760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7760995Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7761337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7761434Z kernel = self.compile( 2025-05-07T20:32:05.7761824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7762107Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7762247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7762252Z 2025-05-07T20:32:05.7762457Z self = 2025-05-07T20:32:05.7763233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7763747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85f2f6a0>} 2025-05-07T20:32:05.7764493Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7764696Z context = 2025-05-07T20:32:05.7764701Z 2025-05-07T20:32:05.7764868Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7765133Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7765251Z module_map=module_map) 2025-05-07T20:32:05.7765415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7765565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7765646Z E ^ 2025-05-07T20:32:05.7766002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7766007Z 2025-05-07T20:32:05.7766426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7766431Z 2025-05-07T20:32:05.7766543Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7766801Z self=, 2025-05-07T20:32:05.7766900Z T=4096, 2025-05-07T20:32:05.7766989Z D=7168, 2025-05-07T20:32:05.7767081Z scale_ub=None, 2025-05-07T20:32:05.7767170Z contiguous=False, 2025-05-07T20:32:05.7767257Z compiled=False, 2025-05-07T20:32:05.7767342Z ) 2025-05-07T20:32:05.7767652Z self = 2025-05-07T20:32:05.7767878Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.7767883Z 2025-05-07T20:32:05.7767969Z @given( 2025-05-07T20:32:05.7768090Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7768195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7768313Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7768436Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7768556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7768640Z ) 2025-05-07T20:32:05.7768884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7768987Z def test_silu_mul_quant( 2025-05-07T20:32:05.7769068Z self, 2025-05-07T20:32:05.7769146Z T: int, 2025-05-07T20:32:05.7769230Z D: int, 2025-05-07T20:32:05.7769331Z scale_ub: Optional[float], 2025-05-07T20:32:05.7769421Z contiguous: bool, 2025-05-07T20:32:05.7769519Z compiled: bool, 2025-05-07T20:32:05.7769602Z ) -> None: 2025-05-07T20:32:05.7769705Z torch.manual_seed(2025) 2025-05-07T20:32:05.7769781Z 2025-05-07T20:32:05.7769951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7770034Z 2025-05-07T20:32:05.7770127Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7770255Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7770356Z x = x_sign * x_clamp 2025-05-07T20:32:05.7770516Z x0 = x[:, :D] 2025-05-07T20:32:05.7770601Z x1 = x[:, D:] 2025-05-07T20:32:05.7770680Z 2025-05-07T20:32:05.7770766Z if contiguous: 2025-05-07T20:32:05.7770862Z x0 = x0.contiguous() 2025-05-07T20:32:05.7770959Z x1 = x1.contiguous() 2025-05-07T20:32:05.7771035Z 2025-05-07T20:32:05.7771126Z if scale_ub is not None: 2025-05-07T20:32:05.7771240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7771378Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7771467Z ) 2025-05-07T20:32:05.7771547Z else: 2025-05-07T20:32:05.7771644Z scale_ub_tensor = None 2025-05-07T20:32:05.7771722Z 2025-05-07T20:32:05.7771854Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7771949Z op = silu_mul_quant 2025-05-07T20:32:05.7772045Z if compiled: 2025-05-07T20:32:05.7772151Z op = torch.compile(op) 2025-05-07T20:32:05.7772264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7772347Z 2025-05-07T20:32:05.7772443Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7772448Z 2025-05-07T20:32:05.7772554Z moe/activation_test.py:117: 2025-05-07T20:32:05.7772685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7772789Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7772895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7773439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7773539Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7773910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7774133Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7774482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7784958Z kernel = self.compile( 2025-05-07T20:32:05.7785376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7785552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7785683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7785770Z 2025-05-07T20:32:05.7785974Z self = 2025-05-07T20:32:05.7786751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7787260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85f2dda0>} 2025-05-07T20:32:05.7788013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7788202Z context = 2025-05-07T20:32:05.7788207Z 2025-05-07T20:32:05.7788370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7788637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7788744Z module_map=module_map) 2025-05-07T20:32:05.7788908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7789006Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7789085Z E ^ 2025-05-07T20:32:05.7789542Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7789548Z 2025-05-07T20:32:05.7789959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7789963Z 2025-05-07T20:32:05.7790067Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7790290Z self=, 2025-05-07T20:32:05.7790367Z T=128, 2025-05-07T20:32:05.7790453Z D=7168, 2025-05-07T20:32:05.7790537Z scale_ub=None, 2025-05-07T20:32:05.7790623Z contiguous=False, 2025-05-07T20:32:05.7790709Z compiled=True, 2025-05-07T20:32:05.7790785Z ) 2025-05-07T20:32:05.7791000Z self = 2025-05-07T20:32:05.7791171Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.7791176Z 2025-05-07T20:32:05.7791253Z @given( 2025-05-07T20:32:05.7791378Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7791480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7791595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7791712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7791824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7791898Z ) 2025-05-07T20:32:05.7792141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7792306Z def test_silu_mul_quant( 2025-05-07T20:32:05.7792384Z self, 2025-05-07T20:32:05.7792463Z T: int, 2025-05-07T20:32:05.7792541Z D: int, 2025-05-07T20:32:05.7792638Z scale_ub: Optional[float], 2025-05-07T20:32:05.7792730Z contiguous: bool, 2025-05-07T20:32:05.7792815Z compiled: bool, 2025-05-07T20:32:05.7792893Z ) -> None: 2025-05-07T20:32:05.7792991Z torch.manual_seed(2025) 2025-05-07T20:32:05.7793065Z 2025-05-07T20:32:05.7793239Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7793316Z 2025-05-07T20:32:05.7793408Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7793535Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7793623Z x = x_sign * x_clamp 2025-05-07T20:32:05.7793704Z x0 = x[:, :D] 2025-05-07T20:32:05.7793786Z x1 = x[:, D:] 2025-05-07T20:32:05.7793859Z 2025-05-07T20:32:05.7793945Z if contiguous: 2025-05-07T20:32:05.7794086Z x0 = x0.contiguous() 2025-05-07T20:32:05.7794174Z x1 = x1.contiguous() 2025-05-07T20:32:05.7794248Z 2025-05-07T20:32:05.7794340Z if scale_ub is not None: 2025-05-07T20:32:05.7794444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7794581Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7794656Z ) 2025-05-07T20:32:05.7794733Z else: 2025-05-07T20:32:05.7794833Z scale_ub_tensor = None 2025-05-07T20:32:05.7794905Z 2025-05-07T20:32:05.7795033Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7795124Z op = silu_mul_quant 2025-05-07T20:32:05.7795208Z if compiled: 2025-05-07T20:32:05.7795307Z op = torch.compile(op) 2025-05-07T20:32:05.7795415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7795488Z 2025-05-07T20:32:05.7795578Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7795710Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7795784Z 2025-05-07T20:32:05.7795919Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7796021Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7796119Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7796243Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7796384Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7796539Z 2025-05-07T20:32:05.7796643Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7796647Z 2025-05-07T20:32:05.7796746Z moe/activation_test.py:126: 2025-05-07T20:32:05.7796876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7796983Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7797116Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7797676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7797782Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7798139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7798363Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7798733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7798988Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7799383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7799634Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7800054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7800218Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7800557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7800639Z fn() 2025-05-07T20:32:05.7801034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7801122Z self.fn.run( 2025-05-07T20:32:05.7801456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7801549Z kernel = self.compile( 2025-05-07T20:32:05.7801928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7802100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7802299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7802307Z 2025-05-07T20:32:05.7802508Z self = 2025-05-07T20:32:05.7803280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7803785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c81620>} 2025-05-07T20:32:05.7804525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7804715Z context = 2025-05-07T20:32:05.7804727Z 2025-05-07T20:32:05.7804888Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7805148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7805256Z module_map=module_map) 2025-05-07T20:32:05.7805416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7805520Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7805985Z E ^ 2025-05-07T20:32:05.7806403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7806409Z 2025-05-07T20:32:05.7806827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7806831Z 2025-05-07T20:32:05.7806935Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7807157Z self=, 2025-05-07T20:32:05.7807243Z T=128, 2025-05-07T20:32:05.7807321Z D=7168, 2025-05-07T20:32:05.7807406Z scale_ub=None, 2025-05-07T20:32:05.7807493Z contiguous=False, 2025-05-07T20:32:05.7807633Z compiled=False, 2025-05-07T20:32:05.7807710Z ) 2025-05-07T20:32:05.7807926Z self = 2025-05-07T20:32:05.7808095Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.7808105Z 2025-05-07T20:32:05.7808186Z @given( 2025-05-07T20:32:05.7808304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7808401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7808519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7808634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7808747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7808898Z ) 2025-05-07T20:32:05.7809143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7809241Z def test_silu_mul_quant( 2025-05-07T20:32:05.7809318Z self, 2025-05-07T20:32:05.7809395Z T: int, 2025-05-07T20:32:05.7809475Z D: int, 2025-05-07T20:32:05.7809577Z scale_ub: Optional[float], 2025-05-07T20:32:05.7809667Z contiguous: bool, 2025-05-07T20:32:05.7809756Z compiled: bool, 2025-05-07T20:32:05.7809835Z ) -> None: 2025-05-07T20:32:05.7809936Z torch.manual_seed(2025) 2025-05-07T20:32:05.7810011Z 2025-05-07T20:32:05.7810194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7810271Z 2025-05-07T20:32:05.7810364Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7810494Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7810586Z x = x_sign * x_clamp 2025-05-07T20:32:05.7810668Z x0 = x[:, :D] 2025-05-07T20:32:05.7810820Z x1 = x[:, D:] 2025-05-07T20:32:05.7810897Z 2025-05-07T20:32:05.7810980Z if contiguous: 2025-05-07T20:32:05.7811069Z x0 = x0.contiguous() 2025-05-07T20:32:05.7811160Z x1 = x1.contiguous() 2025-05-07T20:32:05.7811232Z 2025-05-07T20:32:05.7811322Z if scale_ub is not None: 2025-05-07T20:32:05.7811428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7811561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7811643Z ) 2025-05-07T20:32:05.7811720Z else: 2025-05-07T20:32:05.7811815Z scale_ub_tensor = None 2025-05-07T20:32:05.7811891Z 2025-05-07T20:32:05.7812018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7812109Z op = silu_mul_quant 2025-05-07T20:32:05.7812197Z if compiled: 2025-05-07T20:32:05.7812297Z op = torch.compile(op) 2025-05-07T20:32:05.7812401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7812483Z 2025-05-07T20:32:05.7812574Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7812579Z 2025-05-07T20:32:05.7812675Z moe/activation_test.py:117: 2025-05-07T20:32:05.7812808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7812909Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7813010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7813603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7813702Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7814060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7814281Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7814622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7814721Z kernel = self.compile( 2025-05-07T20:32:05.7815100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7815274Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7815400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7815405Z 2025-05-07T20:32:05.7815612Z self = 2025-05-07T20:32:05.7816394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7816904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c825c0>} 2025-05-07T20:32:05.7817704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7817896Z context = 2025-05-07T20:32:05.7817901Z 2025-05-07T20:32:05.7818073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7818345Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7818461Z module_map=module_map) 2025-05-07T20:32:05.7818625Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7818723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7818804Z E ^ 2025-05-07T20:32:05.7819154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7819203Z 2025-05-07T20:32:05.7819617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7819622Z 2025-05-07T20:32:05.7819725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7819944Z self=, 2025-05-07T20:32:05.7820031Z T=4096, 2025-05-07T20:32:05.7820110Z D=5120, 2025-05-07T20:32:05.7820201Z scale_ub=1200.0, 2025-05-07T20:32:05.7820294Z contiguous=True, 2025-05-07T20:32:05.7820382Z compiled=False, 2025-05-07T20:32:05.7820467Z ) 2025-05-07T20:32:05.7820686Z self = 2025-05-07T20:32:05.7820862Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.7820866Z 2025-05-07T20:32:05.7820954Z @given( 2025-05-07T20:32:05.7821076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7821181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7821305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7821424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7821537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7821621Z ) 2025-05-07T20:32:05.7821866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7821968Z def test_silu_mul_quant( 2025-05-07T20:32:05.7822131Z self, 2025-05-07T20:32:05.7822212Z T: int, 2025-05-07T20:32:05.7822296Z D: int, 2025-05-07T20:32:05.7822398Z scale_ub: Optional[float], 2025-05-07T20:32:05.7822488Z contiguous: bool, 2025-05-07T20:32:05.7822580Z compiled: bool, 2025-05-07T20:32:05.7822662Z ) -> None: 2025-05-07T20:32:05.7822759Z torch.manual_seed(2025) 2025-05-07T20:32:05.7822842Z 2025-05-07T20:32:05.7823013Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7823095Z 2025-05-07T20:32:05.7823196Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7823323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7823420Z x = x_sign * x_clamp 2025-05-07T20:32:05.7823504Z x0 = x[:, :D] 2025-05-07T20:32:05.7823588Z x1 = x[:, D:] 2025-05-07T20:32:05.7823673Z 2025-05-07T20:32:05.7823758Z if contiguous: 2025-05-07T20:32:05.7823851Z x0 = x0.contiguous() 2025-05-07T20:32:05.7823957Z x1 = x1.contiguous() 2025-05-07T20:32:05.7824034Z 2025-05-07T20:32:05.7824129Z if scale_ub is not None: 2025-05-07T20:32:05.7824241Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7824378Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7824454Z ) 2025-05-07T20:32:05.7824540Z else: 2025-05-07T20:32:05.7824635Z scale_ub_tensor = None 2025-05-07T20:32:05.7824760Z 2025-05-07T20:32:05.7824898Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7824993Z op = silu_mul_quant 2025-05-07T20:32:05.7825085Z if compiled: 2025-05-07T20:32:05.7825186Z op = torch.compile(op) 2025-05-07T20:32:05.7825293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7825376Z 2025-05-07T20:32:05.7825471Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7825475Z 2025-05-07T20:32:05.7825580Z moe/activation_test.py:117: 2025-05-07T20:32:05.7825718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7825822Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7825924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7826431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7826533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7826948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7827170Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7827511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7827613Z kernel = self.compile( 2025-05-07T20:32:05.7828003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7828185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7828317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7828321Z 2025-05-07T20:32:05.7828527Z self = 2025-05-07T20:32:05.7829306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7829818Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c839c0>} 2025-05-07T20:32:05.7830645Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7830838Z context = 2025-05-07T20:32:05.7830843Z 2025-05-07T20:32:05.7831008Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7831278Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7831387Z module_map=module_map) 2025-05-07T20:32:05.7831558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7831658Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7831740Z E ^ 2025-05-07T20:32:05.7832104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7832108Z 2025-05-07T20:32:05.7832529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7832534Z 2025-05-07T20:32:05.7832645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7832868Z self=, 2025-05-07T20:32:05.7832949Z T=1, 2025-05-07T20:32:05.7833036Z D=5120, 2025-05-07T20:32:05.7833122Z scale_ub=None, 2025-05-07T20:32:05.7833208Z contiguous=True, 2025-05-07T20:32:05.7833301Z compiled=True, 2025-05-07T20:32:05.7833445Z ) 2025-05-07T20:32:05.7833665Z self = 2025-05-07T20:32:05.7833835Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.7833840Z 2025-05-07T20:32:05.7833920Z @given( 2025-05-07T20:32:05.7834051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7834151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7834268Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7834395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7834511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7834591Z ) 2025-05-07T20:32:05.7834844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7834940Z def test_silu_mul_quant( 2025-05-07T20:32:05.7835021Z self, 2025-05-07T20:32:05.7835108Z T: int, 2025-05-07T20:32:05.7835188Z D: int, 2025-05-07T20:32:05.7835337Z scale_ub: Optional[float], 2025-05-07T20:32:05.7835435Z contiguous: bool, 2025-05-07T20:32:05.7835522Z compiled: bool, 2025-05-07T20:32:05.7835606Z ) -> None: 2025-05-07T20:32:05.7835702Z torch.manual_seed(2025) 2025-05-07T20:32:05.7835777Z 2025-05-07T20:32:05.7835951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7836029Z 2025-05-07T20:32:05.7836123Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7836257Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7836347Z x = x_sign * x_clamp 2025-05-07T20:32:05.7836429Z x0 = x[:, :D] 2025-05-07T20:32:05.7836516Z x1 = x[:, D:] 2025-05-07T20:32:05.7836592Z 2025-05-07T20:32:05.7836679Z if contiguous: 2025-05-07T20:32:05.7836780Z x0 = x0.contiguous() 2025-05-07T20:32:05.7836872Z x1 = x1.contiguous() 2025-05-07T20:32:05.7836953Z 2025-05-07T20:32:05.7837047Z if scale_ub is not None: 2025-05-07T20:32:05.7837160Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7837301Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7837376Z ) 2025-05-07T20:32:05.7837456Z else: 2025-05-07T20:32:05.7837562Z scale_ub_tensor = None 2025-05-07T20:32:05.7837637Z 2025-05-07T20:32:05.7837769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7837869Z op = silu_mul_quant 2025-05-07T20:32:05.7838092Z if compiled: 2025-05-07T20:32:05.7838197Z op = torch.compile(op) 2025-05-07T20:32:05.7838310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7838387Z 2025-05-07T20:32:05.7838489Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7838611Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7838687Z 2025-05-07T20:32:05.7838832Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7838941Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7839041Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7839170Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7839312Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7839391Z 2025-05-07T20:32:05.7839500Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7839505Z 2025-05-07T20:32:05.7839605Z moe/activation_test.py:126: 2025-05-07T20:32:05.7839749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7839857Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7839991Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7840555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7840661Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7841065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7841293Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7841661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7841923Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7842325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7842580Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7842960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7843129Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7843521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7843602Z fn() 2025-05-07T20:32:05.7844000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7844093Z self.fn.run( 2025-05-07T20:32:05.7844431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7844529Z kernel = self.compile( 2025-05-07T20:32:05.7844918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7845094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7845228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7845233Z 2025-05-07T20:32:05.7845438Z self = 2025-05-07T20:32:05.7846217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7846724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c8c7c0>} 2025-05-07T20:32:05.7847603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7847801Z context = 2025-05-07T20:32:05.7847806Z 2025-05-07T20:32:05.7847973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7848242Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7848358Z module_map=module_map) 2025-05-07T20:32:05.7848525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7848637Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7848718Z E ^ 2025-05-07T20:32:05.7849079Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7849084Z 2025-05-07T20:32:05.7849503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7849508Z 2025-05-07T20:32:05.7849616Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7849845Z self=, 2025-05-07T20:32:05.7849927Z T=2048, 2025-05-07T20:32:05.7850009Z D=5120, 2025-05-07T20:32:05.7850103Z scale_ub=None, 2025-05-07T20:32:05.7850237Z contiguous=True, 2025-05-07T20:32:05.7850330Z compiled=True, 2025-05-07T20:32:05.7850408Z ) 2025-05-07T20:32:05.7850628Z self = 2025-05-07T20:32:05.7850808Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.7850813Z 2025-05-07T20:32:05.7850892Z @given( 2025-05-07T20:32:05.7851015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7851123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7851247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7851367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7851492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7851570Z ) 2025-05-07T20:32:05.7851826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7851928Z def test_silu_mul_quant( 2025-05-07T20:32:05.7852010Z self, 2025-05-07T20:32:05.7852143Z T: int, 2025-05-07T20:32:05.7852223Z D: int, 2025-05-07T20:32:05.7852324Z scale_ub: Optional[float], 2025-05-07T20:32:05.7852423Z contiguous: bool, 2025-05-07T20:32:05.7852511Z compiled: bool, 2025-05-07T20:32:05.7852591Z ) -> None: 2025-05-07T20:32:05.7852696Z torch.manual_seed(2025) 2025-05-07T20:32:05.7852771Z 2025-05-07T20:32:05.7852941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7853027Z 2025-05-07T20:32:05.7853127Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7853260Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7853352Z x = x_sign * x_clamp 2025-05-07T20:32:05.7853435Z x0 = x[:, :D] 2025-05-07T20:32:05.7853522Z x1 = x[:, D:] 2025-05-07T20:32:05.7853598Z 2025-05-07T20:32:05.7853684Z if contiguous: 2025-05-07T20:32:05.7853786Z x0 = x0.contiguous() 2025-05-07T20:32:05.7853880Z x1 = x1.contiguous() 2025-05-07T20:32:05.7853960Z 2025-05-07T20:32:05.7854057Z if scale_ub is not None: 2025-05-07T20:32:05.7854164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7854300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7854385Z ) 2025-05-07T20:32:05.7854465Z else: 2025-05-07T20:32:05.7854567Z scale_ub_tensor = None 2025-05-07T20:32:05.7854644Z 2025-05-07T20:32:05.7854858Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7854958Z op = silu_mul_quant 2025-05-07T20:32:05.7855046Z if compiled: 2025-05-07T20:32:05.7855150Z op = torch.compile(op) 2025-05-07T20:32:05.7855265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7855341Z 2025-05-07T20:32:05.7855433Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7855565Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7855647Z 2025-05-07T20:32:05.7855784Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7855892Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7855994Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7856125Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7856266Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7856342Z 2025-05-07T20:32:05.7856451Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7856461Z 2025-05-07T20:32:05.7856561Z moe/activation_test.py:126: 2025-05-07T20:32:05.7856699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7856833Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7856993Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7857560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7857710Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7858070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7858301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7858666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7858930Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7859335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7859590Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7859972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7860185Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7860529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7860617Z fn() 2025-05-07T20:32:05.7861016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7861108Z self.fn.run( 2025-05-07T20:32:05.7861450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7861548Z kernel = self.compile( 2025-05-07T20:32:05.7861932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7862108Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7862240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7862249Z 2025-05-07T20:32:05.7862464Z self = 2025-05-07T20:32:05.7863240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7863854Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a85c8d4e0>} 2025-05-07T20:32:05.7864604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7864806Z context = 2025-05-07T20:32:05.7864811Z 2025-05-07T20:32:05.7864977Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7865246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7865365Z module_map=module_map) 2025-05-07T20:32:05.7865531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7865636Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7865721Z E ^ 2025-05-07T20:32:05.7866080Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7866085Z 2025-05-07T20:32:05.7866507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7866512Z 2025-05-07T20:32:05.7866618Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7866841Z self=, 2025-05-07T20:32:05.7866976Z T=128, 2025-05-07T20:32:05.7867061Z D=5120, 2025-05-07T20:32:05.7867147Z scale_ub=None, 2025-05-07T20:32:05.7867240Z contiguous=True, 2025-05-07T20:32:05.7867326Z compiled=True, 2025-05-07T20:32:05.7867409Z ) 2025-05-07T20:32:05.7867627Z self = 2025-05-07T20:32:05.7867798Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.7867802Z 2025-05-07T20:32:05.7867886Z @given( 2025-05-07T20:32:05.7868011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7868112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7868237Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7868357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7868472Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7868559Z ) 2025-05-07T20:32:05.7868803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7868964Z def test_silu_mul_quant( 2025-05-07T20:32:05.7869045Z self, 2025-05-07T20:32:05.7869126Z T: int, 2025-05-07T20:32:05.7869210Z D: int, 2025-05-07T20:32:05.7869310Z scale_ub: Optional[float], 2025-05-07T20:32:05.7869402Z contiguous: bool, 2025-05-07T20:32:05.7869496Z compiled: bool, 2025-05-07T20:32:05.7869578Z ) -> None: 2025-05-07T20:32:05.7869675Z torch.manual_seed(2025) 2025-05-07T20:32:05.7869756Z 2025-05-07T20:32:05.7869928Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7870005Z 2025-05-07T20:32:05.7870103Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7870229Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7870329Z x = x_sign * x_clamp 2025-05-07T20:32:05.7870413Z x0 = x[:, :D] 2025-05-07T20:32:05.7870496Z x1 = x[:, D:] 2025-05-07T20:32:05.7870577Z 2025-05-07T20:32:05.7870665Z if contiguous: 2025-05-07T20:32:05.7870763Z x0 = x0.contiguous() 2025-05-07T20:32:05.7870859Z x1 = x1.contiguous() 2025-05-07T20:32:05.7870936Z 2025-05-07T20:32:05.7871028Z if scale_ub is not None: 2025-05-07T20:32:05.7871140Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7871277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7871356Z ) 2025-05-07T20:32:05.7871444Z else: 2025-05-07T20:32:05.7871623Z scale_ub_tensor = None 2025-05-07T20:32:05.7871699Z 2025-05-07T20:32:05.7871836Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7871929Z op = silu_mul_quant 2025-05-07T20:32:05.7872026Z if compiled: 2025-05-07T20:32:05.7872128Z op = torch.compile(op) 2025-05-07T20:32:05.7872235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7872317Z 2025-05-07T20:32:05.7872410Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7872536Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7872614Z 2025-05-07T20:32:05.7872751Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7872855Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7872960Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7873084Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7873233Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7873318Z 2025-05-07T20:32:05.7873421Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7873426Z 2025-05-07T20:32:05.7873534Z moe/activation_test.py:126: 2025-05-07T20:32:05.7873664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7873770Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7873913Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7874517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7874629Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7874990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7875213Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7875591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7875847Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7876244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7876511Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7876929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7877110Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7877455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7877540Z fn() 2025-05-07T20:32:05.7877951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7878039Z self.fn.run( 2025-05-07T20:32:05.7878383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7878479Z kernel = self.compile( 2025-05-07T20:32:05.7878858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7879041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7879177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7879182Z 2025-05-07T20:32:05.7879387Z self = 2025-05-07T20:32:05.7880168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7880749Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84e442c0>} 2025-05-07T20:32:05.7881503Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7881696Z context = 2025-05-07T20:32:05.7881707Z 2025-05-07T20:32:05.7881878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7882144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7882253Z module_map=module_map) 2025-05-07T20:32:05.7882423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7882527Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7882614Z E ^ 2025-05-07T20:32:05.7882976Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7882980Z 2025-05-07T20:32:05.7883395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7883399Z 2025-05-07T20:32:05.7883512Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7883779Z self=, 2025-05-07T20:32:05.7883856Z T=4096, 2025-05-07T20:32:05.7883944Z D=5120, 2025-05-07T20:32:05.7884029Z scale_ub=None, 2025-05-07T20:32:05.7884117Z contiguous=True, 2025-05-07T20:32:05.7884210Z compiled=True, 2025-05-07T20:32:05.7884287Z ) 2025-05-07T20:32:05.7884510Z self = 2025-05-07T20:32:05.7884682Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.7884692Z 2025-05-07T20:32:05.7884769Z @given( 2025-05-07T20:32:05.7884899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7885000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7885116Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7885241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7885356Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7885479Z ) 2025-05-07T20:32:05.7885731Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7885827Z def test_silu_mul_quant( 2025-05-07T20:32:05.7885913Z self, 2025-05-07T20:32:05.7885994Z T: int, 2025-05-07T20:32:05.7886072Z D: int, 2025-05-07T20:32:05.7886178Z scale_ub: Optional[float], 2025-05-07T20:32:05.7886271Z contiguous: bool, 2025-05-07T20:32:05.7886359Z compiled: bool, 2025-05-07T20:32:05.7886447Z ) -> None: 2025-05-07T20:32:05.7886548Z torch.manual_seed(2025) 2025-05-07T20:32:05.7886622Z 2025-05-07T20:32:05.7886802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7886878Z 2025-05-07T20:32:05.7886970Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7887101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7887193Z x = x_sign * x_clamp 2025-05-07T20:32:05.7887283Z x0 = x[:, :D] 2025-05-07T20:32:05.7887372Z x1 = x[:, D:] 2025-05-07T20:32:05.7887448Z 2025-05-07T20:32:05.7887588Z if contiguous: 2025-05-07T20:32:05.7887684Z x0 = x0.contiguous() 2025-05-07T20:32:05.7887777Z x1 = x1.contiguous() 2025-05-07T20:32:05.7887859Z 2025-05-07T20:32:05.7887950Z if scale_ub is not None: 2025-05-07T20:32:05.7888057Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7888202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7888360Z ) 2025-05-07T20:32:05.7888439Z else: 2025-05-07T20:32:05.7888543Z scale_ub_tensor = None 2025-05-07T20:32:05.7888620Z 2025-05-07T20:32:05.7888752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7888850Z op = silu_mul_quant 2025-05-07T20:32:05.7888936Z if compiled: 2025-05-07T20:32:05.7889048Z op = torch.compile(op) 2025-05-07T20:32:05.7889155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7889235Z 2025-05-07T20:32:05.7889338Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7889461Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7889538Z 2025-05-07T20:32:05.7889681Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7889784Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7889886Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7890023Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7890164Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7890250Z 2025-05-07T20:32:05.7890350Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7890355Z 2025-05-07T20:32:05.7890457Z moe/activation_test.py:126: 2025-05-07T20:32:05.7890596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7890703Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7890885Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7891451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7891556Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7891920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7892150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7892515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7892782Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7893180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7893486Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7893860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7894029Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7894378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7894459Z fn() 2025-05-07T20:32:05.7894864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7894957Z self.fn.run( 2025-05-07T20:32:05.7895296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7895399Z kernel = self.compile( 2025-05-07T20:32:05.7895779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7895961Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7896098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7896103Z 2025-05-07T20:32:05.7896309Z self = 2025-05-07T20:32:05.7897191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7897699Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84e46e80>} 2025-05-07T20:32:05.7898445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7898653Z context = 2025-05-07T20:32:05.7898658Z 2025-05-07T20:32:05.7898824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7899093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7899210Z module_map=module_map) 2025-05-07T20:32:05.7899378Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7899489Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7899570Z E ^ 2025-05-07T20:32:05.7899927Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7899939Z 2025-05-07T20:32:05.7900355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7900401Z 2025-05-07T20:32:05.7900510Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7900745Z self=, 2025-05-07T20:32:05.7900827Z T=16384, 2025-05-07T20:32:05.7900909Z D=5120, 2025-05-07T20:32:05.7901001Z scale_ub=None, 2025-05-07T20:32:05.7901088Z contiguous=True, 2025-05-07T20:32:05.7901175Z compiled=True, 2025-05-07T20:32:05.7901258Z ) 2025-05-07T20:32:05.7901482Z self = 2025-05-07T20:32:05.7901665Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.7901669Z 2025-05-07T20:32:05.7901749Z @given( 2025-05-07T20:32:05.7901868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7901975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7902090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7902208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7902379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7902458Z ) 2025-05-07T20:32:05.7902703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7902805Z def test_silu_mul_quant( 2025-05-07T20:32:05.7902885Z self, 2025-05-07T20:32:05.7902971Z T: int, 2025-05-07T20:32:05.7903051Z D: int, 2025-05-07T20:32:05.7903152Z scale_ub: Optional[float], 2025-05-07T20:32:05.7903253Z contiguous: bool, 2025-05-07T20:32:05.7903341Z compiled: bool, 2025-05-07T20:32:05.7903421Z ) -> None: 2025-05-07T20:32:05.7903522Z torch.manual_seed(2025) 2025-05-07T20:32:05.7903600Z 2025-05-07T20:32:05.7903770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7903855Z 2025-05-07T20:32:05.7903949Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7904075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7904178Z x = x_sign * x_clamp 2025-05-07T20:32:05.7904264Z x0 = x[:, :D] 2025-05-07T20:32:05.7904354Z x1 = x[:, D:] 2025-05-07T20:32:05.7904430Z 2025-05-07T20:32:05.7904516Z if contiguous: 2025-05-07T20:32:05.7904615Z x0 = x0.contiguous() 2025-05-07T20:32:05.7904708Z x1 = x1.contiguous() 2025-05-07T20:32:05.7904786Z 2025-05-07T20:32:05.7904885Z if scale_ub is not None: 2025-05-07T20:32:05.7905071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7905211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7905296Z ) 2025-05-07T20:32:05.7905376Z else: 2025-05-07T20:32:05.7905472Z scale_ub_tensor = None 2025-05-07T20:32:05.7905554Z 2025-05-07T20:32:05.7905908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7906043Z op = silu_mul_quant 2025-05-07T20:32:05.7906146Z if compiled: 2025-05-07T20:32:05.7906251Z op = torch.compile(op) 2025-05-07T20:32:05.7906363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7906438Z 2025-05-07T20:32:05.7906530Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7906659Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7906730Z 2025-05-07T20:32:05.7906865Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7906973Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7907078Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7907199Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7907344Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7907419Z 2025-05-07T20:32:05.7907525Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7907530Z 2025-05-07T20:32:05.7907627Z moe/activation_test.py:126: 2025-05-07T20:32:05.7907756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7907955Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7908091Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7908649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7908760Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7909124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7909354Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7909721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7922183Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7922627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7923016Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7923398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7923565Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7923914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7923992Z fn() 2025-05-07T20:32:05.7924390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7924480Z self.fn.run( 2025-05-07T20:32:05.7924816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7924911Z kernel = self.compile( 2025-05-07T20:32:05.7925293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7925466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7925604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7925610Z 2025-05-07T20:32:05.7925816Z self = 2025-05-07T20:32:05.7926709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7927218Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a854ea5c0>} 2025-05-07T20:32:05.7928056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7928257Z context = 2025-05-07T20:32:05.7928262Z 2025-05-07T20:32:05.7928426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7928690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7928805Z module_map=module_map) 2025-05-07T20:32:05.7928969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7929074Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7929151Z E ^ 2025-05-07T20:32:05.7929505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7929513Z 2025-05-07T20:32:05.7929924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7930001Z 2025-05-07T20:32:05.7930105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7930327Z self=, 2025-05-07T20:32:05.7930404Z T=1, 2025-05-07T20:32:05.7930481Z D=5120, 2025-05-07T20:32:05.7930567Z scale_ub=1200.0, 2025-05-07T20:32:05.7930651Z contiguous=True, 2025-05-07T20:32:05.7930735Z compiled=True, 2025-05-07T20:32:05.7930817Z ) 2025-05-07T20:32:05.7931036Z self = 2025-05-07T20:32:05.7931205Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.7931210Z 2025-05-07T20:32:05.7931288Z @given( 2025-05-07T20:32:05.7931407Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7931516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7931677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7931793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7931908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7931982Z ) 2025-05-07T20:32:05.7932227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7932320Z def test_silu_mul_quant( 2025-05-07T20:32:05.7932396Z self, 2025-05-07T20:32:05.7932474Z T: int, 2025-05-07T20:32:05.7932554Z D: int, 2025-05-07T20:32:05.7932652Z scale_ub: Optional[float], 2025-05-07T20:32:05.7932744Z contiguous: bool, 2025-05-07T20:32:05.7932828Z compiled: bool, 2025-05-07T20:32:05.7932913Z ) -> None: 2025-05-07T20:32:05.7933016Z torch.manual_seed(2025) 2025-05-07T20:32:05.7933089Z 2025-05-07T20:32:05.7933256Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7933333Z 2025-05-07T20:32:05.7933430Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7933555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7933650Z x = x_sign * x_clamp 2025-05-07T20:32:05.7933731Z x0 = x[:, :D] 2025-05-07T20:32:05.7933813Z x1 = x[:, D:] 2025-05-07T20:32:05.7933887Z 2025-05-07T20:32:05.7933972Z if contiguous: 2025-05-07T20:32:05.7934070Z x0 = x0.contiguous() 2025-05-07T20:32:05.7934158Z x1 = x1.contiguous() 2025-05-07T20:32:05.7934229Z 2025-05-07T20:32:05.7934402Z if scale_ub is not None: 2025-05-07T20:32:05.7934508Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7934642Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7934722Z ) 2025-05-07T20:32:05.7934798Z else: 2025-05-07T20:32:05.7934894Z scale_ub_tensor = None 2025-05-07T20:32:05.7934969Z 2025-05-07T20:32:05.7935097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7935194Z op = silu_mul_quant 2025-05-07T20:32:05.7935277Z if compiled: 2025-05-07T20:32:05.7935379Z op = torch.compile(op) 2025-05-07T20:32:05.7935488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7935559Z 2025-05-07T20:32:05.7935650Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7935655Z 2025-05-07T20:32:05.7935753Z moe/activation_test.py:117: 2025-05-07T20:32:05.7935889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7935989Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7936092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7936462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.7936558Z return fn(*args, **kwargs) 2025-05-07T20:32:05.7937104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7937261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7937619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7937840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7938179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7938277Z kernel = self.compile( 2025-05-07T20:32:05.7938663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7938839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7938966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7938971Z 2025-05-07T20:32:05.7939174Z self = 2025-05-07T20:32:05.7939999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7940502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8455df80>} 2025-05-07T20:32:05.7941254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7941443Z context = 2025-05-07T20:32:05.7941448Z 2025-05-07T20:32:05.7941618Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7941878Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7941990Z module_map=module_map) 2025-05-07T20:32:05.7942154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7942253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7942332Z E ^ 2025-05-07T20:32:05.7942686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7942691Z 2025-05-07T20:32:05.7943177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7943182Z 2025-05-07T20:32:05.7943292Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7943513Z self=, 2025-05-07T20:32:05.7943594Z T=1, 2025-05-07T20:32:05.7943675Z D=5120, 2025-05-07T20:32:05.7943758Z scale_ub=None, 2025-05-07T20:32:05.7943844Z contiguous=False, 2025-05-07T20:32:05.7943933Z compiled=True, 2025-05-07T20:32:05.7944007Z ) 2025-05-07T20:32:05.7944222Z self = 2025-05-07T20:32:05.7944392Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.7944396Z 2025-05-07T20:32:05.7944475Z @given( 2025-05-07T20:32:05.7944596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7944696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7944813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7944932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7945049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7945122Z ) 2025-05-07T20:32:05.7945368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7945462Z def test_silu_mul_quant( 2025-05-07T20:32:05.7945543Z self, 2025-05-07T20:32:05.7945666Z T: int, 2025-05-07T20:32:05.7945743Z D: int, 2025-05-07T20:32:05.7945844Z scale_ub: Optional[float], 2025-05-07T20:32:05.7945933Z contiguous: bool, 2025-05-07T20:32:05.7946018Z compiled: bool, 2025-05-07T20:32:05.7946098Z ) -> None: 2025-05-07T20:32:05.7946192Z torch.manual_seed(2025) 2025-05-07T20:32:05.7946265Z 2025-05-07T20:32:05.7946438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7946513Z 2025-05-07T20:32:05.7946608Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7946736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7946824Z x = x_sign * x_clamp 2025-05-07T20:32:05.7946903Z x0 = x[:, :D] 2025-05-07T20:32:05.7946987Z x1 = x[:, D:] 2025-05-07T20:32:05.7947059Z 2025-05-07T20:32:05.7947145Z if contiguous: 2025-05-07T20:32:05.7947235Z x0 = x0.contiguous() 2025-05-07T20:32:05.7947323Z x1 = x1.contiguous() 2025-05-07T20:32:05.7947448Z 2025-05-07T20:32:05.7947538Z if scale_ub is not None: 2025-05-07T20:32:05.7947647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7947784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7947859Z ) 2025-05-07T20:32:05.7947935Z else: 2025-05-07T20:32:05.7948034Z scale_ub_tensor = None 2025-05-07T20:32:05.7948106Z 2025-05-07T20:32:05.7948235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7948337Z op = silu_mul_quant 2025-05-07T20:32:05.7948422Z if compiled: 2025-05-07T20:32:05.7948528Z op = torch.compile(op) 2025-05-07T20:32:05.7948633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7948708Z 2025-05-07T20:32:05.7948803Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.7948923Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.7948998Z 2025-05-07T20:32:05.7949141Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7949244Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.7949346Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.7949480Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.7949617Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7949689Z 2025-05-07T20:32:05.7949788Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.7949793Z 2025-05-07T20:32:05.7949969Z moe/activation_test.py:126: 2025-05-07T20:32:05.7950101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7950203Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.7950336Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.7950895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.7950998Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.7951358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7951578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7951942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.7952201Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7952594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.7952844Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.7953214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.7953420Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.7953758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.7953834Z fn() 2025-05-07T20:32:05.7954227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.7954312Z self.fn.run( 2025-05-07T20:32:05.7954648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7954739Z kernel = self.compile( 2025-05-07T20:32:05.7955125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7955301Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7955433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7955487Z 2025-05-07T20:32:05.7955689Z self = 2025-05-07T20:32:05.7956460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7957015Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8455d4e0>} 2025-05-07T20:32:05.7957771Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7957960Z context = 2025-05-07T20:32:05.7957965Z 2025-05-07T20:32:05.7958126Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7958398Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7958502Z module_map=module_map) 2025-05-07T20:32:05.7958664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7958762Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.7958837Z E ^ 2025-05-07T20:32:05.7959292Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7959297Z 2025-05-07T20:32:05.7959708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7959713Z 2025-05-07T20:32:05.7959816Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7960032Z self=, 2025-05-07T20:32:05.7960108Z T=1, 2025-05-07T20:32:05.7960189Z D=5120, 2025-05-07T20:32:05.7960269Z scale_ub=None, 2025-05-07T20:32:05.7960353Z contiguous=True, 2025-05-07T20:32:05.7960438Z compiled=False, 2025-05-07T20:32:05.7960507Z ) 2025-05-07T20:32:05.7960721Z self = 2025-05-07T20:32:05.7960888Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.7960893Z 2025-05-07T20:32:05.7960968Z @given( 2025-05-07T20:32:05.7961095Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7961193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7961306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7961426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7961534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7961609Z ) 2025-05-07T20:32:05.7961853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7961989Z def test_silu_mul_quant( 2025-05-07T20:32:05.7962064Z self, 2025-05-07T20:32:05.7962145Z T: int, 2025-05-07T20:32:05.7962221Z D: int, 2025-05-07T20:32:05.7962319Z scale_ub: Optional[float], 2025-05-07T20:32:05.7962405Z contiguous: bool, 2025-05-07T20:32:05.7962488Z compiled: bool, 2025-05-07T20:32:05.7962568Z ) -> None: 2025-05-07T20:32:05.7962659Z torch.manual_seed(2025) 2025-05-07T20:32:05.7962729Z 2025-05-07T20:32:05.7962906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7962982Z 2025-05-07T20:32:05.7963073Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7963204Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7963291Z x = x_sign * x_clamp 2025-05-07T20:32:05.7963372Z x0 = x[:, :D] 2025-05-07T20:32:05.7963457Z x1 = x[:, D:] 2025-05-07T20:32:05.7963529Z 2025-05-07T20:32:05.7963612Z if contiguous: 2025-05-07T20:32:05.7963747Z x0 = x0.contiguous() 2025-05-07T20:32:05.7963832Z x1 = x1.contiguous() 2025-05-07T20:32:05.7963909Z 2025-05-07T20:32:05.7963997Z if scale_ub is not None: 2025-05-07T20:32:05.7964103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7964241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7964311Z ) 2025-05-07T20:32:05.7964386Z else: 2025-05-07T20:32:05.7964484Z scale_ub_tensor = None 2025-05-07T20:32:05.7964561Z 2025-05-07T20:32:05.7964688Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7964781Z op = silu_mul_quant 2025-05-07T20:32:05.7964867Z if compiled: 2025-05-07T20:32:05.7964963Z op = torch.compile(op) 2025-05-07T20:32:05.7965070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7965140Z 2025-05-07T20:32:05.7965232Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7965242Z 2025-05-07T20:32:05.7965335Z moe/activation_test.py:117: 2025-05-07T20:32:05.7965461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7965566Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7965662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7966158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7966258Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7966691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7966936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7967302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7967395Z kernel = self.compile( 2025-05-07T20:32:05.7967829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7968004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7968130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7968138Z 2025-05-07T20:32:05.7968339Z self = 2025-05-07T20:32:05.7969121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7969627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8455cea0>} 2025-05-07T20:32:05.7970367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7970606Z context = 2025-05-07T20:32:05.7970610Z 2025-05-07T20:32:05.7970771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7971031Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7971144Z module_map=module_map) 2025-05-07T20:32:05.7971303Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7971406Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7971481Z E ^ 2025-05-07T20:32:05.7971833Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7971838Z 2025-05-07T20:32:05.7972249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7972298Z 2025-05-07T20:32:05.7972400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7972617Z self=, 2025-05-07T20:32:05.7972702Z T=128, 2025-05-07T20:32:05.7972778Z D=5120, 2025-05-07T20:32:05.7972863Z scale_ub=None, 2025-05-07T20:32:05.7972946Z contiguous=False, 2025-05-07T20:32:05.7973024Z compiled=True, 2025-05-07T20:32:05.7973100Z ) 2025-05-07T20:32:05.7973318Z self = 2025-05-07T20:32:05.7973488Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.7973493Z 2025-05-07T20:32:05.7973574Z @given( 2025-05-07T20:32:05.7973692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7973790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7973912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7974031Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7974144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7974213Z ) 2025-05-07T20:32:05.7974452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7974546Z def test_silu_mul_quant( 2025-05-07T20:32:05.7974620Z self, 2025-05-07T20:32:05.7974695Z T: int, 2025-05-07T20:32:05.7974778Z D: int, 2025-05-07T20:32:05.7974956Z scale_ub: Optional[float], 2025-05-07T20:32:05.7975045Z contiguous: bool, 2025-05-07T20:32:05.7975133Z compiled: bool, 2025-05-07T20:32:05.7975211Z ) -> None: 2025-05-07T20:32:05.7975302Z torch.manual_seed(2025) 2025-05-07T20:32:05.7975377Z 2025-05-07T20:32:05.7975543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7975622Z 2025-05-07T20:32:05.7975712Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7975838Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7975928Z x = x_sign * x_clamp 2025-05-07T20:32:05.7976006Z x0 = x[:, :D] 2025-05-07T20:32:05.7976084Z x1 = x[:, D:] 2025-05-07T20:32:05.7976165Z 2025-05-07T20:32:05.7976246Z if contiguous: 2025-05-07T20:32:05.7976335Z x0 = x0.contiguous() 2025-05-07T20:32:05.7976428Z x1 = x1.contiguous() 2025-05-07T20:32:05.7976498Z 2025-05-07T20:32:05.7976593Z if scale_ub is not None: 2025-05-07T20:32:05.7976703Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7976860Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7976944Z ) 2025-05-07T20:32:05.7977036Z else: 2025-05-07T20:32:05.7977128Z scale_ub_tensor = None 2025-05-07T20:32:05.7977198Z 2025-05-07T20:32:05.7977323Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7977529Z op = silu_mul_quant 2025-05-07T20:32:05.7977619Z if compiled: 2025-05-07T20:32:05.7977716Z op = torch.compile(op) 2025-05-07T20:32:05.7977821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7977894Z 2025-05-07T20:32:05.7977984Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7977989Z 2025-05-07T20:32:05.7978085Z moe/activation_test.py:117: 2025-05-07T20:32:05.7978215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7978322Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7978424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7978786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.7978876Z return fn(*args, **kwargs) 2025-05-07T20:32:05.7979368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7979512Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7979863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7980085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7980420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7980516Z kernel = self.compile( 2025-05-07T20:32:05.7980897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7981069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7981202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7981207Z 2025-05-07T20:32:05.7981409Z self = 2025-05-07T20:32:05.7982188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7982694Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9c09c60>} 2025-05-07T20:32:05.7983510Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7983706Z context = 2025-05-07T20:32:05.7983711Z 2025-05-07T20:32:05.7983874Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7984139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7984250Z module_map=module_map) 2025-05-07T20:32:05.7984408Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7984508Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7984586Z E ^ 2025-05-07T20:32:05.7984939Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7984944Z 2025-05-07T20:32:05.7985360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7985365Z 2025-05-07T20:32:05.7985466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7985689Z self=, 2025-05-07T20:32:05.7985768Z T=128, 2025-05-07T20:32:05.7985847Z D=7168, 2025-05-07T20:32:05.7985930Z scale_ub=1200.0, 2025-05-07T20:32:05.7986011Z contiguous=False, 2025-05-07T20:32:05.7986138Z compiled=False, 2025-05-07T20:32:05.7986209Z ) 2025-05-07T20:32:05.7986420Z self = 2025-05-07T20:32:05.7986593Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.7986598Z 2025-05-07T20:32:05.7986677Z @given( 2025-05-07T20:32:05.7986806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7986904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7987023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7987143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7987254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.7987329Z ) 2025-05-07T20:32:05.7987577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.7987668Z def test_silu_mul_quant( 2025-05-07T20:32:05.7987745Z self, 2025-05-07T20:32:05.7987831Z T: int, 2025-05-07T20:32:05.7987947Z D: int, 2025-05-07T20:32:05.7988052Z scale_ub: Optional[float], 2025-05-07T20:32:05.7988142Z contiguous: bool, 2025-05-07T20:32:05.7988227Z compiled: bool, 2025-05-07T20:32:05.7988313Z ) -> None: 2025-05-07T20:32:05.7988406Z torch.manual_seed(2025) 2025-05-07T20:32:05.7988477Z 2025-05-07T20:32:05.7988649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.7988725Z 2025-05-07T20:32:05.7988819Z x_sign = torch.sign(x) 2025-05-07T20:32:05.7988947Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.7989035Z x = x_sign * x_clamp 2025-05-07T20:32:05.7989116Z x0 = x[:, :D] 2025-05-07T20:32:05.7989202Z x1 = x[:, D:] 2025-05-07T20:32:05.7989275Z 2025-05-07T20:32:05.7989359Z if contiguous: 2025-05-07T20:32:05.7989455Z x0 = x0.contiguous() 2025-05-07T20:32:05.7989544Z x1 = x1.contiguous() 2025-05-07T20:32:05.7989629Z 2025-05-07T20:32:05.7989719Z if scale_ub is not None: 2025-05-07T20:32:05.7989825Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.7989963Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.7990040Z ) 2025-05-07T20:32:05.7990116Z else: 2025-05-07T20:32:05.7990216Z scale_ub_tensor = None 2025-05-07T20:32:05.7990289Z 2025-05-07T20:32:05.7990417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.7990617Z op = silu_mul_quant 2025-05-07T20:32:05.7990703Z if compiled: 2025-05-07T20:32:05.7990803Z op = torch.compile(op) 2025-05-07T20:32:05.7990914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7990987Z 2025-05-07T20:32:05.7991082Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.7991086Z 2025-05-07T20:32:05.7991182Z moe/activation_test.py:117: 2025-05-07T20:32:05.7991310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7991420Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.7991518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.7992012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.7992113Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.7992473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.7992698Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.7993035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.7993130Z kernel = self.compile( 2025-05-07T20:32:05.7993513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.7993732Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.7993858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.7993869Z 2025-05-07T20:32:05.7994071Z self = 2025-05-07T20:32:05.7994847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.7995357Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9c09800>} 2025-05-07T20:32:05.7996099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.7996339Z context = 2025-05-07T20:32:05.7996344Z 2025-05-07T20:32:05.7996505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.7996763Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.7996874Z module_map=module_map) 2025-05-07T20:32:05.7997035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.7997140Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.7997215Z E ^ 2025-05-07T20:32:05.7997568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.7997572Z 2025-05-07T20:32:05.7997989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.7997994Z 2025-05-07T20:32:05.7998101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.7998320Z self=, 2025-05-07T20:32:05.7998405Z T=128, 2025-05-07T20:32:05.7998484Z D=5120, 2025-05-07T20:32:05.7998570Z scale_ub=None, 2025-05-07T20:32:05.7998654Z contiguous=False, 2025-05-07T20:32:05.7998737Z compiled=False, 2025-05-07T20:32:05.7998818Z ) 2025-05-07T20:32:05.7999035Z self = 2025-05-07T20:32:05.7999282Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.7999287Z 2025-05-07T20:32:05.7999368Z @given( 2025-05-07T20:32:05.7999487Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.7999583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.7999703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.7999820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.7999945Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8000020Z ) 2025-05-07T20:32:05.8000261Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8000358Z def test_silu_mul_quant( 2025-05-07T20:32:05.8000436Z self, 2025-05-07T20:32:05.8000513Z T: int, 2025-05-07T20:32:05.8000596Z D: int, 2025-05-07T20:32:05.8000693Z scale_ub: Optional[float], 2025-05-07T20:32:05.8000781Z contiguous: bool, 2025-05-07T20:32:05.8000882Z compiled: bool, 2025-05-07T20:32:05.8000965Z ) -> None: 2025-05-07T20:32:05.8001061Z torch.manual_seed(2025) 2025-05-07T20:32:05.8001138Z 2025-05-07T20:32:05.8001302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8001383Z 2025-05-07T20:32:05.8001475Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8001601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8001738Z x = x_sign * x_clamp 2025-05-07T20:32:05.8001820Z x0 = x[:, :D] 2025-05-07T20:32:05.8001898Z x1 = x[:, D:] 2025-05-07T20:32:05.8001981Z 2025-05-07T20:32:05.8002063Z if contiguous: 2025-05-07T20:32:05.8002152Z x0 = x0.contiguous() 2025-05-07T20:32:05.8002252Z x1 = x1.contiguous() 2025-05-07T20:32:05.8002325Z 2025-05-07T20:32:05.8002416Z if scale_ub is not None: 2025-05-07T20:32:05.8002529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8002666Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8002745Z ) 2025-05-07T20:32:05.8002826Z else: 2025-05-07T20:32:05.8002921Z scale_ub_tensor = None 2025-05-07T20:32:05.8003000Z 2025-05-07T20:32:05.8003128Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8003217Z op = silu_mul_quant 2025-05-07T20:32:05.8003306Z if compiled: 2025-05-07T20:32:05.8003406Z op = torch.compile(op) 2025-05-07T20:32:05.8003558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8003636Z 2025-05-07T20:32:05.8003726Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8003730Z 2025-05-07T20:32:05.8003825Z moe/activation_test.py:117: 2025-05-07T20:32:05.8003959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8004057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8004162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8004658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8004755Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8005115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8005332Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8005913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8006048Z kernel = self.compile( 2025-05-07T20:32:05.8006479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8006659Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8006784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8006930Z 2025-05-07T20:32:05.8007153Z self = 2025-05-07T20:32:05.8008066Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8008571Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9c0bce0>} 2025-05-07T20:32:05.8009327Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8009515Z context = 2025-05-07T20:32:05.8009519Z 2025-05-07T20:32:05.8009694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8009958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8010066Z module_map=module_map) 2025-05-07T20:32:05.8010232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8010333Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8010413Z E ^ 2025-05-07T20:32:05.8010768Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8010840Z 2025-05-07T20:32:05.8011251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8011256Z 2025-05-07T20:32:05.8011364Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8011583Z self=, 2025-05-07T20:32:05.8011662Z T=128, 2025-05-07T20:32:05.8011752Z D=5120, 2025-05-07T20:32:05.8011836Z scale_ub=1200.0, 2025-05-07T20:32:05.8011921Z contiguous=True, 2025-05-07T20:32:05.8012009Z compiled=False, 2025-05-07T20:32:05.8012082Z ) 2025-05-07T20:32:05.8012298Z self = 2025-05-07T20:32:05.8012475Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8012480Z 2025-05-07T20:32:05.8012560Z @given( 2025-05-07T20:32:05.8012783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8012883Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8012996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8013119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8013231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8013307Z ) 2025-05-07T20:32:05.8013559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8013654Z def test_silu_mul_quant( 2025-05-07T20:32:05.8013734Z self, 2025-05-07T20:32:05.8013811Z T: int, 2025-05-07T20:32:05.8013888Z D: int, 2025-05-07T20:32:05.8013993Z scale_ub: Optional[float], 2025-05-07T20:32:05.8014083Z contiguous: bool, 2025-05-07T20:32:05.8014172Z compiled: bool, 2025-05-07T20:32:05.8014256Z ) -> None: 2025-05-07T20:32:05.8014351Z torch.manual_seed(2025) 2025-05-07T20:32:05.8014431Z 2025-05-07T20:32:05.8014602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8014677Z 2025-05-07T20:32:05.8014767Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8014896Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8014985Z x = x_sign * x_clamp 2025-05-07T20:32:05.8015075Z x0 = x[:, :D] 2025-05-07T20:32:05.8015156Z x1 = x[:, D:] 2025-05-07T20:32:05.8015231Z 2025-05-07T20:32:05.8015402Z if contiguous: 2025-05-07T20:32:05.8015496Z x0 = x0.contiguous() 2025-05-07T20:32:05.8015586Z x1 = x1.contiguous() 2025-05-07T20:32:05.8015667Z 2025-05-07T20:32:05.8015756Z if scale_ub is not None: 2025-05-07T20:32:05.8015861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8016002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8016076Z ) 2025-05-07T20:32:05.8016153Z else: 2025-05-07T20:32:05.8016257Z scale_ub_tensor = None 2025-05-07T20:32:05.8016331Z 2025-05-07T20:32:05.8016459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8016557Z op = silu_mul_quant 2025-05-07T20:32:05.8016643Z if compiled: 2025-05-07T20:32:05.8016748Z op = torch.compile(op) 2025-05-07T20:32:05.8016853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8016926Z 2025-05-07T20:32:05.8017026Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8017036Z 2025-05-07T20:32:05.8017133Z moe/activation_test.py:117: 2025-05-07T20:32:05.8017263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8017368Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8017467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8017964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8018116Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8018472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8018699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8019040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8019135Z kernel = self.compile( 2025-05-07T20:32:05.8019523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8019696Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8019830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8019835Z 2025-05-07T20:32:05.8020039Z self = 2025-05-07T20:32:05.8020879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8021384Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9fc3920>} 2025-05-07T20:32:05.8022133Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8022326Z context = 2025-05-07T20:32:05.8022331Z 2025-05-07T20:32:05.8022495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8022756Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8022873Z module_map=module_map) 2025-05-07T20:32:05.8023036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8023138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8023216Z E ^ 2025-05-07T20:32:05.8023567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8023571Z 2025-05-07T20:32:05.8024063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8024069Z 2025-05-07T20:32:05.8024173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8024399Z self=, 2025-05-07T20:32:05.8024478Z T=1, 2025-05-07T20:32:05.8024557Z D=7168, 2025-05-07T20:32:05.8024646Z scale_ub=1200.0, 2025-05-07T20:32:05.8024730Z contiguous=True, 2025-05-07T20:32:05.8024819Z compiled=True, 2025-05-07T20:32:05.8024897Z ) 2025-05-07T20:32:05.8025112Z self = 2025-05-07T20:32:05.8025276Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.8025280Z 2025-05-07T20:32:05.8025365Z @given( 2025-05-07T20:32:05.8025485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8025592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8025712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8025829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8025948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8026025Z ) 2025-05-07T20:32:05.8026268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8026370Z def test_silu_mul_quant( 2025-05-07T20:32:05.8026443Z self, 2025-05-07T20:32:05.8026563Z T: int, 2025-05-07T20:32:05.8026648Z D: int, 2025-05-07T20:32:05.8026744Z scale_ub: Optional[float], 2025-05-07T20:32:05.8026833Z contiguous: bool, 2025-05-07T20:32:05.8026925Z compiled: bool, 2025-05-07T20:32:05.8027003Z ) -> None: 2025-05-07T20:32:05.8027103Z torch.manual_seed(2025) 2025-05-07T20:32:05.8027178Z 2025-05-07T20:32:05.8027346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8027429Z 2025-05-07T20:32:05.8027526Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8027652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8027748Z x = x_sign * x_clamp 2025-05-07T20:32:05.8027827Z x0 = x[:, :D] 2025-05-07T20:32:05.8027908Z x1 = x[:, D:] 2025-05-07T20:32:05.8027991Z 2025-05-07T20:32:05.8028074Z if contiguous: 2025-05-07T20:32:05.8028163Z x0 = x0.contiguous() 2025-05-07T20:32:05.8028260Z x1 = x1.contiguous() 2025-05-07T20:32:05.8028382Z 2025-05-07T20:32:05.8028472Z if scale_ub is not None: 2025-05-07T20:32:05.8028585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8028720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8028806Z ) 2025-05-07T20:32:05.8028879Z else: 2025-05-07T20:32:05.8028973Z scale_ub_tensor = None 2025-05-07T20:32:05.8029050Z 2025-05-07T20:32:05.8029179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8029276Z op = silu_mul_quant 2025-05-07T20:32:05.8029369Z if compiled: 2025-05-07T20:32:05.8029470Z op = torch.compile(op) 2025-05-07T20:32:05.8029575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8029652Z 2025-05-07T20:32:05.8029745Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8029749Z 2025-05-07T20:32:05.8029851Z moe/activation_test.py:117: 2025-05-07T20:32:05.8029979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8030084Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8030189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8030554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8030647Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8031226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8031329Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8031692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8031913Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8032250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8032357Z kernel = self.compile( 2025-05-07T20:32:05.8032736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8032910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8033043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8033048Z 2025-05-07T20:32:05.8033252Z self = 2025-05-07T20:32:05.8034037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8034537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84950220>} 2025-05-07T20:32:05.8035332Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8035521Z context = 2025-05-07T20:32:05.8035525Z 2025-05-07T20:32:05.8035690Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8035963Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8036070Z module_map=module_map) 2025-05-07T20:32:05.8036238Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8036337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8036415Z E ^ 2025-05-07T20:32:05.8036771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8036778Z 2025-05-07T20:32:05.8037233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8037238Z 2025-05-07T20:32:05.8037338Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8037564Z self=, 2025-05-07T20:32:05.8037644Z T=1, 2025-05-07T20:32:05.8037728Z D=7168, 2025-05-07T20:32:05.8037811Z scale_ub=1200.0, 2025-05-07T20:32:05.8037900Z contiguous=False, 2025-05-07T20:32:05.8037992Z compiled=True, 2025-05-07T20:32:05.8038068Z ) 2025-05-07T20:32:05.8038284Z self = 2025-05-07T20:32:05.8038455Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8038460Z 2025-05-07T20:32:05.8038538Z @given( 2025-05-07T20:32:05.8038657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8038762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8038880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8039002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8039117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8039190Z ) 2025-05-07T20:32:05.8039437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8039531Z def test_silu_mul_quant( 2025-05-07T20:32:05.8039606Z self, 2025-05-07T20:32:05.8039767Z T: int, 2025-05-07T20:32:05.8039844Z D: int, 2025-05-07T20:32:05.8039940Z scale_ub: Optional[float], 2025-05-07T20:32:05.8040034Z contiguous: bool, 2025-05-07T20:32:05.8040118Z compiled: bool, 2025-05-07T20:32:05.8040197Z ) -> None: 2025-05-07T20:32:05.8040297Z torch.manual_seed(2025) 2025-05-07T20:32:05.8040370Z 2025-05-07T20:32:05.8040544Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8040620Z 2025-05-07T20:32:05.8040712Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8040841Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8040929Z x = x_sign * x_clamp 2025-05-07T20:32:05.8041008Z x0 = x[:, :D] 2025-05-07T20:32:05.8041097Z x1 = x[:, D:] 2025-05-07T20:32:05.8041170Z 2025-05-07T20:32:05.8041253Z if contiguous: 2025-05-07T20:32:05.8041348Z x0 = x0.contiguous() 2025-05-07T20:32:05.8041440Z x1 = x1.contiguous() 2025-05-07T20:32:05.8041515Z 2025-05-07T20:32:05.8041611Z if scale_ub is not None: 2025-05-07T20:32:05.8041714Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8041856Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8041932Z ) 2025-05-07T20:32:05.8042009Z else: 2025-05-07T20:32:05.8042108Z scale_ub_tensor = None 2025-05-07T20:32:05.8042179Z 2025-05-07T20:32:05.8042311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8042457Z op = silu_mul_quant 2025-05-07T20:32:05.8042541Z if compiled: 2025-05-07T20:32:05.8042637Z op = torch.compile(op) 2025-05-07T20:32:05.8042748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8042819Z 2025-05-07T20:32:05.8046733Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8046742Z 2025-05-07T20:32:05.8046852Z moe/activation_test.py:117: 2025-05-07T20:32:05.8046994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8047097Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8047199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8047675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8047775Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8048273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8048447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8048805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8049031Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8049371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8049469Z kernel = self.compile( 2025-05-07T20:32:05.8049854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8050029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8050160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8050165Z 2025-05-07T20:32:05.8050369Z self = 2025-05-07T20:32:05.8051350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8051862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a849518a0>} 2025-05-07T20:32:05.8052703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8052901Z context = 2025-05-07T20:32:05.8052906Z 2025-05-07T20:32:05.8053071Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8053339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8053452Z module_map=module_map) 2025-05-07T20:32:05.8053613Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8053716Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8053793Z E ^ 2025-05-07T20:32:05.8054145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8054150Z 2025-05-07T20:32:05.8054569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8054573Z 2025-05-07T20:32:05.8054674Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8054897Z self=, 2025-05-07T20:32:05.8054976Z T=1, 2025-05-07T20:32:05.8055055Z D=7168, 2025-05-07T20:32:05.8055140Z scale_ub=None, 2025-05-07T20:32:05.8055298Z contiguous=False, 2025-05-07T20:32:05.8055383Z compiled=True, 2025-05-07T20:32:05.8055460Z ) 2025-05-07T20:32:05.8055680Z self = 2025-05-07T20:32:05.8055842Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.8055846Z 2025-05-07T20:32:05.8055929Z @given( 2025-05-07T20:32:05.8056047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8056154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8056266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8056381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8056496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8056570Z ) 2025-05-07T20:32:05.8056813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8056909Z def test_silu_mul_quant( 2025-05-07T20:32:05.8057031Z self, 2025-05-07T20:32:05.8057109Z T: int, 2025-05-07T20:32:05.8057190Z D: int, 2025-05-07T20:32:05.8057287Z scale_ub: Optional[float], 2025-05-07T20:32:05.8057382Z contiguous: bool, 2025-05-07T20:32:05.8057470Z compiled: bool, 2025-05-07T20:32:05.8057549Z ) -> None: 2025-05-07T20:32:05.8057646Z torch.manual_seed(2025) 2025-05-07T20:32:05.8057719Z 2025-05-07T20:32:05.8057889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8057970Z 2025-05-07T20:32:05.8058062Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8058185Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8058276Z x = x_sign * x_clamp 2025-05-07T20:32:05.8058355Z x0 = x[:, :D] 2025-05-07T20:32:05.8058434Z x1 = x[:, D:] 2025-05-07T20:32:05.8058510Z 2025-05-07T20:32:05.8058593Z if contiguous: 2025-05-07T20:32:05.8058683Z x0 = x0.contiguous() 2025-05-07T20:32:05.8058782Z x1 = x1.contiguous() 2025-05-07T20:32:05.8058853Z 2025-05-07T20:32:05.8058947Z if scale_ub is not None: 2025-05-07T20:32:05.8059051Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8059184Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8059262Z ) 2025-05-07T20:32:05.8059339Z else: 2025-05-07T20:32:05.8059432Z scale_ub_tensor = None 2025-05-07T20:32:05.8059507Z 2025-05-07T20:32:05.8059716Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8059808Z op = silu_mul_quant 2025-05-07T20:32:05.8059895Z if compiled: 2025-05-07T20:32:05.8059993Z op = torch.compile(op) 2025-05-07T20:32:05.8060098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8060172Z 2025-05-07T20:32:05.8060264Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.8060385Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.8060463Z 2025-05-07T20:32:05.8060600Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8060703Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.8060800Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.8060919Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.8061059Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.8061131Z 2025-05-07T20:32:05.8061232Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.8061243Z 2025-05-07T20:32:05.8061344Z moe/activation_test.py:126: 2025-05-07T20:32:05.8061473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8061580Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.8061712Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.8062272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.8062424Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.8062781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8062999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8063363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.8063621Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.8064018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.8064268Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.8064639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.8064854Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.8065192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.8065274Z fn() 2025-05-07T20:32:05.8065671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.8065757Z self.fn.run( 2025-05-07T20:32:05.8066102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8066196Z kernel = self.compile( 2025-05-07T20:32:05.8066572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8066748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8066874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8066884Z 2025-05-07T20:32:05.8067089Z self = 2025-05-07T20:32:05.8067862Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8068446Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84952e80>} 2025-05-07T20:32:05.8069190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8069379Z context = 2025-05-07T20:32:05.8069383Z 2025-05-07T20:32:05.8069553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8069817Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8069927Z module_map=module_map) 2025-05-07T20:32:05.8070087Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8070189Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.8070269Z E ^ 2025-05-07T20:32:05.8070625Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8070630Z 2025-05-07T20:32:05.8071040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8071049Z 2025-05-07T20:32:05.8071152Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8071372Z self=, 2025-05-07T20:32:05.8071521Z T=1, 2025-05-07T20:32:05.8071599Z D=5120, 2025-05-07T20:32:05.8071686Z scale_ub=1200.0, 2025-05-07T20:32:05.8071776Z contiguous=False, 2025-05-07T20:32:05.8071860Z compiled=True, 2025-05-07T20:32:05.8071933Z ) 2025-05-07T20:32:05.8072156Z self = 2025-05-07T20:32:05.8072321Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8072327Z 2025-05-07T20:32:05.8072404Z @given( 2025-05-07T20:32:05.8072531Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8072631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8072750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8072865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8072978Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8073058Z ) 2025-05-07T20:32:05.8073302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8073443Z def test_silu_mul_quant( 2025-05-07T20:32:05.8073523Z self, 2025-05-07T20:32:05.8073601Z T: int, 2025-05-07T20:32:05.8073679Z D: int, 2025-05-07T20:32:05.8073783Z scale_ub: Optional[float], 2025-05-07T20:32:05.8073872Z contiguous: bool, 2025-05-07T20:32:05.8073962Z compiled: bool, 2025-05-07T20:32:05.8074041Z ) -> None: 2025-05-07T20:32:05.8074139Z torch.manual_seed(2025) 2025-05-07T20:32:05.8074219Z 2025-05-07T20:32:05.8074391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8074465Z 2025-05-07T20:32:05.8074559Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8074684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8074774Z x = x_sign * x_clamp 2025-05-07T20:32:05.8074863Z x0 = x[:, :D] 2025-05-07T20:32:05.8074944Z x1 = x[:, D:] 2025-05-07T20:32:05.8075017Z 2025-05-07T20:32:05.8075111Z if contiguous: 2025-05-07T20:32:05.8075202Z x0 = x0.contiguous() 2025-05-07T20:32:05.8075291Z x1 = x1.contiguous() 2025-05-07T20:32:05.8075369Z 2025-05-07T20:32:05.8075459Z if scale_ub is not None: 2025-05-07T20:32:05.8075568Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8075704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8075778Z ) 2025-05-07T20:32:05.8075857Z else: 2025-05-07T20:32:05.8076030Z scale_ub_tensor = None 2025-05-07T20:32:05.8076103Z 2025-05-07T20:32:05.8076238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8076326Z op = silu_mul_quant 2025-05-07T20:32:05.8076411Z if compiled: 2025-05-07T20:32:05.8076512Z op = torch.compile(op) 2025-05-07T20:32:05.8076616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8076688Z 2025-05-07T20:32:05.8076784Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8076791Z 2025-05-07T20:32:05.8076887Z moe/activation_test.py:117: 2025-05-07T20:32:05.8077021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8077121Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8077220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8077588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8077685Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8078176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8078276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8078630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8078858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8079243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8079339Z kernel = self.compile( 2025-05-07T20:32:05.8079720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8079891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8080024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8080029Z 2025-05-07T20:32:05.8080234Z self = 2025-05-07T20:32:05.8081226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8081857Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a84953a60>} 2025-05-07T20:32:05.8082846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8083062Z context = 2025-05-07T20:32:05.8083067Z 2025-05-07T20:32:05.8083252Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8083564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8083674Z module_map=module_map) 2025-05-07T20:32:05.8083849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8083955Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8084032Z E ^ 2025-05-07T20:32:05.8084468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8084472Z 2025-05-07T20:32:05.8084977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8084981Z 2025-05-07T20:32:05.8085089Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8085351Z self=, 2025-05-07T20:32:05.8085528Z T=1, 2025-05-07T20:32:05.8085612Z D=5120, 2025-05-07T20:32:05.8085698Z scale_ub=1200.0, 2025-05-07T20:32:05.8085786Z contiguous=False, 2025-05-07T20:32:05.8085872Z compiled=False, 2025-05-07T20:32:05.8085947Z ) 2025-05-07T20:32:05.8086164Z self = 2025-05-07T20:32:05.8086334Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.8086343Z 2025-05-07T20:32:05.8086420Z @given( 2025-05-07T20:32:05.8086541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8086645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8086758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8086879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8086991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8087065Z ) 2025-05-07T20:32:05.8087316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8087410Z def test_silu_mul_quant( 2025-05-07T20:32:05.8087486Z self, 2025-05-07T20:32:05.8087643Z T: int, 2025-05-07T20:32:05.8087734Z D: int, 2025-05-07T20:32:05.8087832Z scale_ub: Optional[float], 2025-05-07T20:32:05.8087924Z contiguous: bool, 2025-05-07T20:32:05.8088008Z compiled: bool, 2025-05-07T20:32:05.8088087Z ) -> None: 2025-05-07T20:32:05.8088237Z torch.manual_seed(2025) 2025-05-07T20:32:05.8088313Z 2025-05-07T20:32:05.8088480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8088556Z 2025-05-07T20:32:05.8088647Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8088775Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8088864Z x = x_sign * x_clamp 2025-05-07T20:32:05.8088945Z x0 = x[:, :D] 2025-05-07T20:32:05.8089028Z x1 = x[:, D:] 2025-05-07T20:32:05.8089099Z 2025-05-07T20:32:05.8089188Z if contiguous: 2025-05-07T20:32:05.8089284Z x0 = x0.contiguous() 2025-05-07T20:32:05.8089374Z x1 = x1.contiguous() 2025-05-07T20:32:05.8089445Z 2025-05-07T20:32:05.8089538Z if scale_ub is not None: 2025-05-07T20:32:05.8089644Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8089777Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8089859Z ) 2025-05-07T20:32:05.8089980Z else: 2025-05-07T20:32:05.8090080Z scale_ub_tensor = None 2025-05-07T20:32:05.8090152Z 2025-05-07T20:32:05.8090279Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8090372Z op = silu_mul_quant 2025-05-07T20:32:05.8090455Z if compiled: 2025-05-07T20:32:05.8090555Z op = torch.compile(op) 2025-05-07T20:32:05.8090663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8090734Z 2025-05-07T20:32:05.8090827Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8090831Z 2025-05-07T20:32:05.8090929Z moe/activation_test.py:117: 2025-05-07T20:32:05.8091059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8091164Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8091262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8091759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8091864Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8092219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8092439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8092781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8092956Z kernel = self.compile( 2025-05-07T20:32:05.8093345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8093518Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8093644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8093648Z 2025-05-07T20:32:05.8093856Z self = 2025-05-07T20:32:05.8094631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8095136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8420c9a0>} 2025-05-07T20:32:05.8095882Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8096070Z context = 2025-05-07T20:32:05.8096078Z 2025-05-07T20:32:05.8096240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8096500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8096657Z module_map=module_map) 2025-05-07T20:32:05.8096815Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8096914Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8096995Z E ^ 2025-05-07T20:32:05.8097345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8097350Z 2025-05-07T20:32:05.8097769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8097773Z 2025-05-07T20:32:05.8097877Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8098096Z self=, 2025-05-07T20:32:05.8098178Z T=16384, 2025-05-07T20:32:05.8098257Z D=5120, 2025-05-07T20:32:05.8098341Z scale_ub=1200.0, 2025-05-07T20:32:05.8098473Z contiguous=False, 2025-05-07T20:32:05.8098555Z compiled=True, 2025-05-07T20:32:05.8098629Z ) 2025-05-07T20:32:05.8098850Z self = 2025-05-07T20:32:05.8099027Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8099032Z 2025-05-07T20:32:05.8099116Z @given( 2025-05-07T20:32:05.8099236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8099342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8099459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8099574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8099687Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8099763Z ) 2025-05-07T20:32:05.8100003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8100098Z def test_silu_mul_quant( 2025-05-07T20:32:05.8100182Z self, 2025-05-07T20:32:05.8100262Z T: int, 2025-05-07T20:32:05.8100341Z D: int, 2025-05-07T20:32:05.8100438Z scale_ub: Optional[float], 2025-05-07T20:32:05.8100529Z contiguous: bool, 2025-05-07T20:32:05.8100616Z compiled: bool, 2025-05-07T20:32:05.8100694Z ) -> None: 2025-05-07T20:32:05.8100790Z torch.manual_seed(2025) 2025-05-07T20:32:05.8100867Z 2025-05-07T20:32:05.8101035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8101190Z 2025-05-07T20:32:05.8101285Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8101410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8101499Z x = x_sign * x_clamp 2025-05-07T20:32:05.8101581Z x0 = x[:, :D] 2025-05-07T20:32:05.8101660Z x1 = x[:, D:] 2025-05-07T20:32:05.8101737Z 2025-05-07T20:32:05.8101820Z if contiguous: 2025-05-07T20:32:05.8101910Z x0 = x0.contiguous() 2025-05-07T20:32:05.8102007Z x1 = x1.contiguous() 2025-05-07T20:32:05.8102079Z 2025-05-07T20:32:05.8102169Z if scale_ub is not None: 2025-05-07T20:32:05.8102278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8102409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8102484Z ) 2025-05-07T20:32:05.8102566Z else: 2025-05-07T20:32:05.8102658Z scale_ub_tensor = None 2025-05-07T20:32:05.8102732Z 2025-05-07T20:32:05.8102869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8102959Z op = silu_mul_quant 2025-05-07T20:32:05.8103044Z if compiled: 2025-05-07T20:32:05.8103148Z op = torch.compile(op) 2025-05-07T20:32:05.8103253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8103327Z 2025-05-07T20:32:05.8103418Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8103422Z 2025-05-07T20:32:05.8103519Z moe/activation_test.py:117: 2025-05-07T20:32:05.8103698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8103797Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8103894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8104264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8104357Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8104857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8104955Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8105310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8105531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8106187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8106394Z kernel = self.compile( 2025-05-07T20:32:05.8106778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8106954Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8107088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8107093Z 2025-05-07T20:32:05.8107301Z self = 2025-05-07T20:32:05.8108071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8108577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8420de40>} 2025-05-07T20:32:05.8109323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8109517Z context = 2025-05-07T20:32:05.8109522Z 2025-05-07T20:32:05.8109683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8110133Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8110242Z module_map=module_map) 2025-05-07T20:32:05.8110402Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8110504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8110583Z E ^ 2025-05-07T20:32:05.8110936Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8110946Z 2025-05-07T20:32:05.8111362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8111366Z 2025-05-07T20:32:05.8111472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8111701Z self=, 2025-05-07T20:32:05.8111782Z T=2048, 2025-05-07T20:32:05.8111867Z D=7168, 2025-05-07T20:32:05.8111954Z scale_ub=1200.0, 2025-05-07T20:32:05.8112047Z contiguous=False, 2025-05-07T20:32:05.8112144Z compiled=True, 2025-05-07T20:32:05.8112221Z ) 2025-05-07T20:32:05.8112439Z self = 2025-05-07T20:32:05.8112621Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8112626Z 2025-05-07T20:32:05.8112707Z @given( 2025-05-07T20:32:05.8112828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8112998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8113115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8113243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8113360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8113435Z ) 2025-05-07T20:32:05.8113683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8113780Z def test_silu_mul_quant( 2025-05-07T20:32:05.8113864Z self, 2025-05-07T20:32:05.8113953Z T: int, 2025-05-07T20:32:05.8114034Z D: int, 2025-05-07T20:32:05.8114134Z scale_ub: Optional[float], 2025-05-07T20:32:05.8114234Z contiguous: bool, 2025-05-07T20:32:05.8114324Z compiled: bool, 2025-05-07T20:32:05.8114404Z ) -> None: 2025-05-07T20:32:05.8114504Z torch.manual_seed(2025) 2025-05-07T20:32:05.8114580Z 2025-05-07T20:32:05.8114758Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8114880Z 2025-05-07T20:32:05.8114975Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8115108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8115199Z x = x_sign * x_clamp 2025-05-07T20:32:05.8115281Z x0 = x[:, :D] 2025-05-07T20:32:05.8115367Z x1 = x[:, D:] 2025-05-07T20:32:05.8115443Z 2025-05-07T20:32:05.8115529Z if contiguous: 2025-05-07T20:32:05.8115627Z x0 = x0.contiguous() 2025-05-07T20:32:05.8115724Z x1 = x1.contiguous() 2025-05-07T20:32:05.8115800Z 2025-05-07T20:32:05.8115900Z if scale_ub is not None: 2025-05-07T20:32:05.8116007Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8116147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8116227Z ) 2025-05-07T20:32:05.8116304Z else: 2025-05-07T20:32:05.8116405Z scale_ub_tensor = None 2025-05-07T20:32:05.8116488Z 2025-05-07T20:32:05.8116619Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8116715Z op = silu_mul_quant 2025-05-07T20:32:05.8116802Z if compiled: 2025-05-07T20:32:05.8116903Z op = torch.compile(op) 2025-05-07T20:32:05.8117017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8117094Z 2025-05-07T20:32:05.8117188Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8117193Z 2025-05-07T20:32:05.8117298Z moe/activation_test.py:117: 2025-05-07T20:32:05.8117533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8117643Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8117744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8118111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8118209Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8118704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8118810Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8119174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8119394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8119747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8119844Z kernel = self.compile( 2025-05-07T20:32:05.8120225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8120407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8120535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8120540Z 2025-05-07T20:32:05.8120795Z self = 2025-05-07T20:32:05.8121567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8122076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3a8420e980>} 2025-05-07T20:32:05.8122823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8123015Z context = 2025-05-07T20:32:05.8123020Z 2025-05-07T20:32:05.8123190Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8123499Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8123607Z module_map=module_map) 2025-05-07T20:32:05.8123776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8123878Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8123963Z E ^ 2025-05-07T20:32:05.8124322Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8124327Z 2025-05-07T20:32:05.8124739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8124744Z 2025-05-07T20:32:05.8124856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8125079Z self=, 2025-05-07T20:32:05.8125161Z T=1, 2025-05-07T20:32:05.8125250Z D=5120, 2025-05-07T20:32:05.8125337Z scale_ub=None, 2025-05-07T20:32:05.8125433Z contiguous=False, 2025-05-07T20:32:05.8125521Z compiled=False, 2025-05-07T20:32:05.8125597Z ) 2025-05-07T20:32:05.8125821Z self = 2025-05-07T20:32:05.8125989Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.8125994Z 2025-05-07T20:32:05.8126075Z @given( 2025-05-07T20:32:05.8126203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8126381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8126498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8126620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8126734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8126816Z ) 2025-05-07T20:32:05.8127060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8127159Z def test_silu_mul_quant( 2025-05-07T20:32:05.8127246Z self, 2025-05-07T20:32:05.8127326Z T: int, 2025-05-07T20:32:05.8127409Z D: int, 2025-05-07T20:32:05.8127516Z scale_ub: Optional[float], 2025-05-07T20:32:05.8127661Z contiguous: bool, 2025-05-07T20:32:05.8127750Z compiled: bool, 2025-05-07T20:32:05.8127837Z ) -> None: 2025-05-07T20:32:05.8127935Z torch.manual_seed(2025) 2025-05-07T20:32:05.8128010Z 2025-05-07T20:32:05.8128191Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8128269Z 2025-05-07T20:32:05.8128373Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8128499Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8128591Z x = x_sign * x_clamp 2025-05-07T20:32:05.8128679Z x0 = x[:, :D] 2025-05-07T20:32:05.8128761Z x1 = x[:, D:] 2025-05-07T20:32:05.8128836Z 2025-05-07T20:32:05.8128927Z if contiguous: 2025-05-07T20:32:05.8129070Z x0 = x0.contiguous() 2025-05-07T20:32:05.8129162Z x1 = x1.contiguous() 2025-05-07T20:32:05.8129246Z 2025-05-07T20:32:05.8129340Z if scale_ub is not None: 2025-05-07T20:32:05.8129450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8129591Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8129670Z ) 2025-05-07T20:32:05.8129749Z else: 2025-05-07T20:32:05.8129851Z scale_ub_tensor = None 2025-05-07T20:32:05.8129931Z 2025-05-07T20:32:05.8130067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8130159Z op = silu_mul_quant 2025-05-07T20:32:05.8130246Z if compiled: 2025-05-07T20:32:05.8130353Z op = torch.compile(op) 2025-05-07T20:32:05.8130460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8130537Z 2025-05-07T20:32:05.8130639Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8130646Z 2025-05-07T20:32:05.8130791Z moe/activation_test.py:117: 2025-05-07T20:32:05.8130924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8131032Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8131134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8131637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8131736Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8132098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8132325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8132665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8132767Z kernel = self.compile( 2025-05-07T20:32:05.8133152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8133333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8133468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8133473Z 2025-05-07T20:32:05.8133678Z self = 2025-05-07T20:32:05.8134525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8135036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9740220>} 2025-05-07T20:32:05.8135783Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8135983Z context = 2025-05-07T20:32:05.8135988Z 2025-05-07T20:32:05.8136153Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8136420Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8136531Z module_map=module_map) 2025-05-07T20:32:05.8136714Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8136828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8136927Z E ^ 2025-05-07T20:32:05.8137282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8137287Z 2025-05-07T20:32:05.8137705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8137754Z 2025-05-07T20:32:05.8137860Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8138088Z self=, 2025-05-07T20:32:05.8138169Z T=4096, 2025-05-07T20:32:05.8138250Z D=7168, 2025-05-07T20:32:05.8138341Z scale_ub=1200.0, 2025-05-07T20:32:05.8138431Z contiguous=False, 2025-05-07T20:32:05.8138516Z compiled=False, 2025-05-07T20:32:05.8138599Z ) 2025-05-07T20:32:05.8138822Z self = 2025-05-07T20:32:05.8138998Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.8139009Z 2025-05-07T20:32:05.8139093Z @given( 2025-05-07T20:32:05.8139213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8139319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8139436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8139596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8139719Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8139797Z ) 2025-05-07T20:32:05.8140041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8140143Z def test_silu_mul_quant( 2025-05-07T20:32:05.8140223Z self, 2025-05-07T20:32:05.8140309Z T: int, 2025-05-07T20:32:05.8140387Z D: int, 2025-05-07T20:32:05.8140491Z scale_ub: Optional[float], 2025-05-07T20:32:05.8140588Z contiguous: bool, 2025-05-07T20:32:05.8140677Z compiled: bool, 2025-05-07T20:32:05.8140759Z ) -> None: 2025-05-07T20:32:05.8140861Z torch.manual_seed(2025) 2025-05-07T20:32:05.8140939Z 2025-05-07T20:32:05.8141111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8141198Z 2025-05-07T20:32:05.8141293Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8141425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8141524Z x = x_sign * x_clamp 2025-05-07T20:32:05.8141609Z x0 = x[:, :D] 2025-05-07T20:32:05.8141694Z x1 = x[:, D:] 2025-05-07T20:32:05.8141777Z 2025-05-07T20:32:05.8141863Z if contiguous: 2025-05-07T20:32:05.8141961Z x0 = x0.contiguous() 2025-05-07T20:32:05.8142053Z x1 = x1.contiguous() 2025-05-07T20:32:05.8142130Z 2025-05-07T20:32:05.8142302Z if scale_ub is not None: 2025-05-07T20:32:05.8142411Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8142548Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8142633Z ) 2025-05-07T20:32:05.8142712Z else: 2025-05-07T20:32:05.8142807Z scale_ub_tensor = None 2025-05-07T20:32:05.8142889Z 2025-05-07T20:32:05.8143021Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8143114Z op = silu_mul_quant 2025-05-07T20:32:05.8143211Z if compiled: 2025-05-07T20:32:05.8143313Z op = torch.compile(op) 2025-05-07T20:32:05.8143425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8143501Z 2025-05-07T20:32:05.8143595Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8143599Z 2025-05-07T20:32:05.8143704Z moe/activation_test.py:117: 2025-05-07T20:32:05.8143836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8143942Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8144049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8144547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8144652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8145010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8145280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8145629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8145728Z kernel = self.compile( 2025-05-07T20:32:05.8146108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8146286Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8146421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8146425Z 2025-05-07T20:32:05.8146637Z self = 2025-05-07T20:32:05.8147407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8147974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9741440>} 2025-05-07T20:32:05.8148722Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8148920Z context = 2025-05-07T20:32:05.8148924Z 2025-05-07T20:32:05.8149094Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8149358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8149466Z module_map=module_map) 2025-05-07T20:32:05.8149632Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8149741Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8149826Z E ^ 2025-05-07T20:32:05.8150179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8150184Z 2025-05-07T20:32:05.8150597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8150602Z 2025-05-07T20:32:05.8150715Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8151015Z self=, 2025-05-07T20:32:05.8151103Z T=16384, 2025-05-07T20:32:05.8151181Z D=7168, 2025-05-07T20:32:05.8151267Z scale_ub=None, 2025-05-07T20:32:05.8151361Z contiguous=True, 2025-05-07T20:32:05.8151444Z compiled=True, 2025-05-07T20:32:05.8151520Z ) 2025-05-07T20:32:05.8151744Z self = 2025-05-07T20:32:05.8151917Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.8151927Z 2025-05-07T20:32:05.8152008Z @given( 2025-05-07T20:32:05.8152135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8152238Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8152361Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8152480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8152594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8152683Z ) 2025-05-07T20:32:05.8152927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8153023Z def test_silu_mul_quant( 2025-05-07T20:32:05.8153109Z self, 2025-05-07T20:32:05.8153190Z T: int, 2025-05-07T20:32:05.8153268Z D: int, 2025-05-07T20:32:05.8153376Z scale_ub: Optional[float], 2025-05-07T20:32:05.8153466Z contiguous: bool, 2025-05-07T20:32:05.8153552Z compiled: bool, 2025-05-07T20:32:05.8153687Z ) -> None: 2025-05-07T20:32:05.8153785Z torch.manual_seed(2025) 2025-05-07T20:32:05.8153867Z 2025-05-07T20:32:05.8154038Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8154116Z 2025-05-07T20:32:05.8154215Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8154342Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8154433Z x = x_sign * x_clamp 2025-05-07T20:32:05.8154520Z x0 = x[:, :D] 2025-05-07T20:32:05.8154609Z x1 = x[:, D:] 2025-05-07T20:32:05.8154685Z 2025-05-07T20:32:05.8154776Z if contiguous: 2025-05-07T20:32:05.8154868Z x0 = x0.contiguous() 2025-05-07T20:32:05.8154960Z x1 = x1.contiguous() 2025-05-07T20:32:05.8155042Z 2025-05-07T20:32:05.8155135Z if scale_ub is not None: 2025-05-07T20:32:05.8155245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8155389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8155514Z ) 2025-05-07T20:32:05.8155601Z else: 2025-05-07T20:32:05.8155698Z scale_ub_tensor = None 2025-05-07T20:32:05.8155775Z 2025-05-07T20:32:05.8155911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8156004Z op = silu_mul_quant 2025-05-07T20:32:05.8156093Z if compiled: 2025-05-07T20:32:05.8156201Z op = torch.compile(op) 2025-05-07T20:32:05.8156312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8156390Z 2025-05-07T20:32:05.8156491Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8156496Z 2025-05-07T20:32:05.8156595Z moe/activation_test.py:117: 2025-05-07T20:32:05.8156733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8156838Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8156963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8157364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8157463Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8157956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8158063Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8158419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8158729Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8159069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8159165Z kernel = self.compile( 2025-05-07T20:32:05.8159555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8159732Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8159864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8159876Z 2025-05-07T20:32:05.8160083Z self = 2025-05-07T20:32:05.8160859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8161367Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9742520>} 2025-05-07T20:32:05.8162110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8162352Z context = 2025-05-07T20:32:05.8162356Z 2025-05-07T20:32:05.8162521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8162783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8162895Z module_map=module_map) 2025-05-07T20:32:05.8163057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8163162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8163251Z E ^ 2025-05-07T20:32:05.8163610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8163615Z 2025-05-07T20:32:05.8164033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8164038Z 2025-05-07T20:32:05.8164144Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8164406Z self=, 2025-05-07T20:32:05.8164497Z T=4096, 2025-05-07T20:32:05.8164577Z D=5120, 2025-05-07T20:32:05.8164666Z scale_ub=None, 2025-05-07T20:32:05.8164755Z contiguous=False, 2025-05-07T20:32:05.8164842Z compiled=True, 2025-05-07T20:32:05.8164926Z ) 2025-05-07T20:32:05.8165144Z self = 2025-05-07T20:32:05.8165321Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.8165326Z 2025-05-07T20:32:05.8165408Z @given( 2025-05-07T20:32:05.8165526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8165626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8165750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8165869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8165991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8166071Z ) 2025-05-07T20:32:05.8166319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8166421Z def test_silu_mul_quant( 2025-05-07T20:32:05.8166503Z self, 2025-05-07T20:32:05.8166583Z T: int, 2025-05-07T20:32:05.8166667Z D: int, 2025-05-07T20:32:05.8166766Z scale_ub: Optional[float], 2025-05-07T20:32:05.8166859Z contiguous: bool, 2025-05-07T20:32:05.8167031Z compiled: bool, 2025-05-07T20:32:05.8167116Z ) -> None: 2025-05-07T20:32:05.8167214Z torch.manual_seed(2025) 2025-05-07T20:32:05.8167292Z 2025-05-07T20:32:05.8167461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8167598Z 2025-05-07T20:32:05.8167695Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8167820Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8167913Z x = x_sign * x_clamp 2025-05-07T20:32:05.8168001Z x0 = x[:, :D] 2025-05-07T20:32:05.8168081Z x1 = x[:, D:] 2025-05-07T20:32:05.8168160Z 2025-05-07T20:32:05.8168245Z if contiguous: 2025-05-07T20:32:05.8168339Z x0 = x0.contiguous() 2025-05-07T20:32:05.8168434Z x1 = x1.contiguous() 2025-05-07T20:32:05.8168510Z 2025-05-07T20:32:05.8168601Z if scale_ub is not None: 2025-05-07T20:32:05.8168717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8168858Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8168937Z ) 2025-05-07T20:32:05.8169020Z else: 2025-05-07T20:32:05.8169115Z scale_ub_tensor = None 2025-05-07T20:32:05.8169195Z 2025-05-07T20:32:05.8169326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8169419Z op = silu_mul_quant 2025-05-07T20:32:05.8173127Z if compiled: 2025-05-07T20:32:05.8173246Z op = torch.compile(op) 2025-05-07T20:32:05.8173429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8173501Z 2025-05-07T20:32:05.8173596Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8173601Z 2025-05-07T20:32:05.8173699Z moe/activation_test.py:117: 2025-05-07T20:32:05.8173829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8173932Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8174031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8174414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8174513Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8175004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8175104Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8175459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8175729Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8176070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8176164Z kernel = self.compile( 2025-05-07T20:32:05.8176543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8176725Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8176855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8176860Z 2025-05-07T20:32:05.8177068Z self = 2025-05-07T20:32:05.8177841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8178352Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9742c00>} 2025-05-07T20:32:05.8179173Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8179366Z context = 2025-05-07T20:32:05.8179371Z 2025-05-07T20:32:05.8179537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8179797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8179908Z module_map=module_map) 2025-05-07T20:32:05.8180068Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8180172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8180252Z E ^ 2025-05-07T20:32:05.8180606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8180611Z 2025-05-07T20:32:05.8181021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8181029Z 2025-05-07T20:32:05.8181136Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8181356Z self=, 2025-05-07T20:32:05.8181435Z T=4096, 2025-05-07T20:32:05.8181512Z D=5120, 2025-05-07T20:32:05.8181595Z scale_ub=1200.0, 2025-05-07T20:32:05.8181686Z contiguous=False, 2025-05-07T20:32:05.8181770Z compiled=False, 2025-05-07T20:32:05.8181842Z ) 2025-05-07T20:32:05.8182060Z self = 2025-05-07T20:32:05.8182306Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.8182311Z 2025-05-07T20:32:05.8182387Z @given( 2025-05-07T20:32:05.8182508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8182605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8182720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8182834Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8182950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8183027Z ) 2025-05-07T20:32:05.8183270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8183363Z def test_silu_mul_quant( 2025-05-07T20:32:05.8183444Z self, 2025-05-07T20:32:05.8183522Z T: int, 2025-05-07T20:32:05.8183599Z D: int, 2025-05-07T20:32:05.8183698Z scale_ub: Optional[float], 2025-05-07T20:32:05.8183833Z contiguous: bool, 2025-05-07T20:32:05.8183923Z compiled: bool, 2025-05-07T20:32:05.8184003Z ) -> None: 2025-05-07T20:32:05.8184099Z torch.manual_seed(2025) 2025-05-07T20:32:05.8184174Z 2025-05-07T20:32:05.8184340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8184414Z 2025-05-07T20:32:05.8184508Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8184630Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8184721Z x = x_sign * x_clamp 2025-05-07T20:32:05.8184803Z x0 = x[:, :D] 2025-05-07T20:32:05.8184882Z x1 = x[:, D:] 2025-05-07T20:32:05.8184954Z 2025-05-07T20:32:05.8185040Z if contiguous: 2025-05-07T20:32:05.8185130Z x0 = x0.contiguous() 2025-05-07T20:32:05.8185217Z x1 = x1.contiguous() 2025-05-07T20:32:05.8185293Z 2025-05-07T20:32:05.8185383Z if scale_ub is not None: 2025-05-07T20:32:05.8185490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8185628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8185703Z ) 2025-05-07T20:32:05.8185782Z else: 2025-05-07T20:32:05.8185874Z scale_ub_tensor = None 2025-05-07T20:32:05.8185946Z 2025-05-07T20:32:05.8186076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8186165Z op = silu_mul_quant 2025-05-07T20:32:05.8186251Z if compiled: 2025-05-07T20:32:05.8186432Z op = torch.compile(op) 2025-05-07T20:32:05.8186538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8186612Z 2025-05-07T20:32:05.8186703Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8186707Z 2025-05-07T20:32:05.8186814Z moe/activation_test.py:117: 2025-05-07T20:32:05.8186963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8187083Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8187184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8187688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8187787Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8188145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8188367Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8188710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8188806Z kernel = self.compile( 2025-05-07T20:32:05.8189187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8189361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8189489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8189539Z 2025-05-07T20:32:05.8189741Z self = 2025-05-07T20:32:05.8190516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8191023Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a92d8400>} 2025-05-07T20:32:05.8191767Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8191956Z context = 2025-05-07T20:32:05.8192003Z 2025-05-07T20:32:05.8192167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8192430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8192534Z module_map=module_map) 2025-05-07T20:32:05.8192694Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8192803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8192880Z E ^ 2025-05-07T20:32:05.8193238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8193243Z 2025-05-07T20:32:05.8193653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8193658Z 2025-05-07T20:32:05.8193759Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8193984Z self=, 2025-05-07T20:32:05.8194066Z T=4096, 2025-05-07T20:32:05.8194141Z D=5120, 2025-05-07T20:32:05.8194229Z scale_ub=1200.0, 2025-05-07T20:32:05.8194318Z contiguous=False, 2025-05-07T20:32:05.8194403Z compiled=True, 2025-05-07T20:32:05.8194476Z ) 2025-05-07T20:32:05.8194693Z self = 2025-05-07T20:32:05.8194866Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8194871Z 2025-05-07T20:32:05.8195024Z @given( 2025-05-07T20:32:05.8195144Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8195246Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8195359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8195475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8195590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8195664Z ) 2025-05-07T20:32:05.8195906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8196003Z def test_silu_mul_quant( 2025-05-07T20:32:05.8196078Z self, 2025-05-07T20:32:05.8196156Z T: int, 2025-05-07T20:32:05.8196231Z D: int, 2025-05-07T20:32:05.8196329Z scale_ub: Optional[float], 2025-05-07T20:32:05.8196419Z contiguous: bool, 2025-05-07T20:32:05.8196504Z compiled: bool, 2025-05-07T20:32:05.8196581Z ) -> None: 2025-05-07T20:32:05.8196678Z torch.manual_seed(2025) 2025-05-07T20:32:05.8196758Z 2025-05-07T20:32:05.8196952Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8197048Z 2025-05-07T20:32:05.8197141Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8197268Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8197355Z x = x_sign * x_clamp 2025-05-07T20:32:05.8197435Z x0 = x[:, :D] 2025-05-07T20:32:05.8197517Z x1 = x[:, D:] 2025-05-07T20:32:05.8197637Z 2025-05-07T20:32:05.8197720Z if contiguous: 2025-05-07T20:32:05.8197815Z x0 = x0.contiguous() 2025-05-07T20:32:05.8197903Z x1 = x1.contiguous() 2025-05-07T20:32:05.8197974Z 2025-05-07T20:32:05.8198066Z if scale_ub is not None: 2025-05-07T20:32:05.8198174Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8198307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8198385Z ) 2025-05-07T20:32:05.8198460Z else: 2025-05-07T20:32:05.8198559Z scale_ub_tensor = None 2025-05-07T20:32:05.8198634Z 2025-05-07T20:32:05.8198762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8198857Z op = silu_mul_quant 2025-05-07T20:32:05.8198941Z if compiled: 2025-05-07T20:32:05.8199040Z op = torch.compile(op) 2025-05-07T20:32:05.8199148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8199223Z 2025-05-07T20:32:05.8199359Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8199363Z 2025-05-07T20:32:05.8199464Z moe/activation_test.py:117: 2025-05-07T20:32:05.8199593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8199692Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8199792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8200157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8200256Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8200744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8200840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8201198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8201416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8201758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8201851Z kernel = self.compile( 2025-05-07T20:32:05.8202230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8202410Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8202611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8202615Z 2025-05-07T20:32:05.8202817Z self = 2025-05-07T20:32:05.8203590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8204093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a92d9620>} 2025-05-07T20:32:05.8204838Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8205025Z context = 2025-05-07T20:32:05.8205034Z 2025-05-07T20:32:05.8205200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8205459Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8205564Z module_map=module_map) 2025-05-07T20:32:05.8206021Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8206159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8206372Z E ^ 2025-05-07T20:32:05.8206759Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8206765Z 2025-05-07T20:32:05.8207175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8207179Z 2025-05-07T20:32:05.8207280Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8207504Z self=, 2025-05-07T20:32:05.8207642Z T=2048, 2025-05-07T20:32:05.8207720Z D=7168, 2025-05-07T20:32:05.8207801Z scale_ub=1200.0, 2025-05-07T20:32:05.8207889Z contiguous=False, 2025-05-07T20:32:05.8207972Z compiled=False, 2025-05-07T20:32:05.8208043Z ) 2025-05-07T20:32:05.8208259Z self = 2025-05-07T20:32:05.8208429Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.8208510Z 2025-05-07T20:32:05.8208587Z @given( 2025-05-07T20:32:05.8208703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8208797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8208911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8209025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8209134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8209207Z ) 2025-05-07T20:32:05.8209450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8209544Z def test_silu_mul_quant( 2025-05-07T20:32:05.8209619Z self, 2025-05-07T20:32:05.8209692Z T: int, 2025-05-07T20:32:05.8209769Z D: int, 2025-05-07T20:32:05.8209863Z scale_ub: Optional[float], 2025-05-07T20:32:05.8209949Z contiguous: bool, 2025-05-07T20:32:05.8210034Z compiled: bool, 2025-05-07T20:32:05.8210108Z ) -> None: 2025-05-07T20:32:05.8210208Z torch.manual_seed(2025) 2025-05-07T20:32:05.8210285Z 2025-05-07T20:32:05.8210449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8210522Z 2025-05-07T20:32:05.8210614Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8210735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8210820Z x = x_sign * x_clamp 2025-05-07T20:32:05.8210899Z x0 = x[:, :D] 2025-05-07T20:32:05.8210977Z x1 = x[:, D:] 2025-05-07T20:32:05.8211204Z 2025-05-07T20:32:05.8211285Z if contiguous: 2025-05-07T20:32:05.8211372Z x0 = x0.contiguous() 2025-05-07T20:32:05.8211459Z x1 = x1.contiguous() 2025-05-07T20:32:05.8211529Z 2025-05-07T20:32:05.8211617Z if scale_ub is not None: 2025-05-07T20:32:05.8211722Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8211852Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8211932Z ) 2025-05-07T20:32:05.8212006Z else: 2025-05-07T20:32:05.8212097Z scale_ub_tensor = None 2025-05-07T20:32:05.8212168Z 2025-05-07T20:32:05.8212297Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8212383Z op = silu_mul_quant 2025-05-07T20:32:05.8212467Z if compiled: 2025-05-07T20:32:05.8212564Z op = torch.compile(op) 2025-05-07T20:32:05.8212664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8212737Z 2025-05-07T20:32:05.8212832Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8212836Z 2025-05-07T20:32:05.8212930Z moe/activation_test.py:117: 2025-05-07T20:32:05.8213060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8213157Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8213253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8213750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8213893Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8214249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8214466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8214801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8214901Z kernel = self.compile( 2025-05-07T20:32:05.8215276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8215445Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8215573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8215577Z 2025-05-07T20:32:05.8215776Z self = 2025-05-07T20:32:05.8216588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8217082Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a92da480>} 2025-05-07T20:32:05.8217829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8218015Z context = 2025-05-07T20:32:05.8218020Z 2025-05-07T20:32:05.8218179Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8218448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8218553Z module_map=module_map) 2025-05-07T20:32:05.8218716Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8218813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8218888Z E ^ 2025-05-07T20:32:05.8219242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8219325Z 2025-05-07T20:32:05.8219732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8219737Z 2025-05-07T20:32:05.8219841Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8220058Z self=, 2025-05-07T20:32:05.8220136Z T=1, 2025-05-07T20:32:05.8220217Z D=7168, 2025-05-07T20:32:05.8220299Z scale_ub=None, 2025-05-07T20:32:05.8220383Z contiguous=True, 2025-05-07T20:32:05.8220469Z compiled=False, 2025-05-07T20:32:05.8220539Z ) 2025-05-07T20:32:05.8220753Z self = 2025-05-07T20:32:05.8220919Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8220924Z 2025-05-07T20:32:05.8221000Z @given( 2025-05-07T20:32:05.8221115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8221223Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8221337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8221454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8221566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8221640Z ) 2025-05-07T20:32:05.8221881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8221973Z def test_silu_mul_quant( 2025-05-07T20:32:05.8222089Z self, 2025-05-07T20:32:05.8222166Z T: int, 2025-05-07T20:32:05.8222239Z D: int, 2025-05-07T20:32:05.8222331Z scale_ub: Optional[float], 2025-05-07T20:32:05.8222420Z contiguous: bool, 2025-05-07T20:32:05.8222502Z compiled: bool, 2025-05-07T20:32:05.8222580Z ) -> None: 2025-05-07T20:32:05.8222673Z torch.manual_seed(2025) 2025-05-07T20:32:05.8222743Z 2025-05-07T20:32:05.8222911Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8222986Z 2025-05-07T20:32:05.8223073Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8223194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8223280Z x = x_sign * x_clamp 2025-05-07T20:32:05.8223359Z x0 = x[:, :D] 2025-05-07T20:32:05.8223441Z x1 = x[:, D:] 2025-05-07T20:32:05.8223509Z 2025-05-07T20:32:05.8223590Z if contiguous: 2025-05-07T20:32:05.8223683Z x0 = x0.contiguous() 2025-05-07T20:32:05.8223819Z x1 = x1.contiguous() 2025-05-07T20:32:05.8223888Z 2025-05-07T20:32:05.8223978Z if scale_ub is not None: 2025-05-07T20:32:05.8224082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8224219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8224290Z ) 2025-05-07T20:32:05.8224363Z else: 2025-05-07T20:32:05.8224456Z scale_ub_tensor = None 2025-05-07T20:32:05.8224527Z 2025-05-07T20:32:05.8224656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8224748Z op = silu_mul_quant 2025-05-07T20:32:05.8224829Z if compiled: 2025-05-07T20:32:05.8224924Z op = torch.compile(op) 2025-05-07T20:32:05.8225028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8225097Z 2025-05-07T20:32:05.8225186Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8225193Z 2025-05-07T20:32:05.8225285Z moe/activation_test.py:117: 2025-05-07T20:32:05.8225417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8225516Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8225611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8226105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8226202Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8226639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8226885Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8227249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8227342Z kernel = self.compile( 2025-05-07T20:32:05.8227718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8227895Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8228017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8228022Z 2025-05-07T20:32:05.8228230Z self = 2025-05-07T20:32:05.8228999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8229497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a92d9da0>} 2025-05-07T20:32:05.8230232Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8230465Z context = 2025-05-07T20:32:05.8230469Z 2025-05-07T20:32:05.8230629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8230887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8230994Z module_map=module_map) 2025-05-07T20:32:05.8231156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8231253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8231335Z E ^ 2025-05-07T20:32:05.8231682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8231687Z 2025-05-07T20:32:05.8232100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8232147Z 2025-05-07T20:32:05.8232250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8232470Z self=, 2025-05-07T20:32:05.8232551Z T=16384, 2025-05-07T20:32:05.8232628Z D=7168, 2025-05-07T20:32:05.8232711Z scale_ub=1200.0, 2025-05-07T20:32:05.8232796Z contiguous=False, 2025-05-07T20:32:05.8232878Z compiled=True, 2025-05-07T20:32:05.8232953Z ) 2025-05-07T20:32:05.8233171Z self = 2025-05-07T20:32:05.8233345Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8233350Z 2025-05-07T20:32:05.8233428Z @given( 2025-05-07T20:32:05.8233541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8233635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8233748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8233866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8233978Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8234053Z ) 2025-05-07T20:32:05.8234297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8234393Z def test_silu_mul_quant( 2025-05-07T20:32:05.8234469Z self, 2025-05-07T20:32:05.8234545Z T: int, 2025-05-07T20:32:05.8234623Z D: int, 2025-05-07T20:32:05.8234718Z scale_ub: Optional[float], 2025-05-07T20:32:05.8234891Z contiguous: bool, 2025-05-07T20:32:05.8234976Z compiled: bool, 2025-05-07T20:32:05.8235052Z ) -> None: 2025-05-07T20:32:05.8235151Z torch.manual_seed(2025) 2025-05-07T20:32:05.8235222Z 2025-05-07T20:32:05.8235390Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8235469Z 2025-05-07T20:32:05.8235557Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8235678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8235774Z x = x_sign * x_clamp 2025-05-07T20:32:05.8235852Z x0 = x[:, :D] 2025-05-07T20:32:05.8235930Z x1 = x[:, D:] 2025-05-07T20:32:05.8236009Z 2025-05-07T20:32:05.8236090Z if contiguous: 2025-05-07T20:32:05.8236185Z x0 = x0.contiguous() 2025-05-07T20:32:05.8236273Z x1 = x1.contiguous() 2025-05-07T20:32:05.8236347Z 2025-05-07T20:32:05.8236442Z if scale_ub is not None: 2025-05-07T20:32:05.8236552Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8236684Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8236761Z ) 2025-05-07T20:32:05.8236838Z else: 2025-05-07T20:32:05.8236933Z scale_ub_tensor = None 2025-05-07T20:32:05.8237007Z 2025-05-07T20:32:05.8237133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8237224Z op = silu_mul_quant 2025-05-07T20:32:05.8237360Z if compiled: 2025-05-07T20:32:05.8237460Z op = torch.compile(op) 2025-05-07T20:32:05.8237568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8237641Z 2025-05-07T20:32:05.8237732Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8237736Z 2025-05-07T20:32:05.8237837Z moe/activation_test.py:117: 2025-05-07T20:32:05.8237967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8238065Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8238173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8238538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8238629Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8239123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8239219Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8239620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8239842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8240181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8240281Z kernel = self.compile( 2025-05-07T20:32:05.8240661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8240835Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8240960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8240965Z 2025-05-07T20:32:05.8241167Z self = 2025-05-07T20:32:05.8241939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8242442Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9880a40>} 2025-05-07T20:32:05.8243351Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8243541Z context = 2025-05-07T20:32:05.8243546Z 2025-05-07T20:32:05.8243706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8243969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8244079Z module_map=module_map) 2025-05-07T20:32:05.8244247Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8244346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8244425Z E ^ 2025-05-07T20:32:05.8244779Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8244784Z 2025-05-07T20:32:05.8245198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8245202Z 2025-05-07T20:32:05.8245310Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8245532Z self=, 2025-05-07T20:32:05.8245611Z T=1, 2025-05-07T20:32:05.8245695Z D=7168, 2025-05-07T20:32:05.8245779Z scale_ub=None, 2025-05-07T20:32:05.8245863Z contiguous=False, 2025-05-07T20:32:05.8245951Z compiled=False, 2025-05-07T20:32:05.8246066Z ) 2025-05-07T20:32:05.8246281Z self = 2025-05-07T20:32:05.8246451Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.8246456Z 2025-05-07T20:32:05.8246532Z @given( 2025-05-07T20:32:05.8246657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8246757Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8246871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8246993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8247104Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8247183Z ) 2025-05-07T20:32:05.8247428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8247572Z def test_silu_mul_quant( 2025-05-07T20:32:05.8247651Z self, 2025-05-07T20:32:05.8247733Z T: int, 2025-05-07T20:32:05.8247812Z D: int, 2025-05-07T20:32:05.8247957Z scale_ub: Optional[float], 2025-05-07T20:32:05.8248050Z contiguous: bool, 2025-05-07T20:32:05.8248134Z compiled: bool, 2025-05-07T20:32:05.8248216Z ) -> None: 2025-05-07T20:32:05.8248310Z torch.manual_seed(2025) 2025-05-07T20:32:05.8248384Z 2025-05-07T20:32:05.8248560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8248634Z 2025-05-07T20:32:05.8248725Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8248861Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8248950Z x = x_sign * x_clamp 2025-05-07T20:32:05.8249026Z x0 = x[:, :D] 2025-05-07T20:32:05.8249111Z x1 = x[:, D:] 2025-05-07T20:32:05.8249185Z 2025-05-07T20:32:05.8249268Z if contiguous: 2025-05-07T20:32:05.8249363Z x0 = x0.contiguous() 2025-05-07T20:32:05.8249451Z x1 = x1.contiguous() 2025-05-07T20:32:05.8249521Z 2025-05-07T20:32:05.8249617Z if scale_ub is not None: 2025-05-07T20:32:05.8249725Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8249864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8249940Z ) 2025-05-07T20:32:05.8250015Z else: 2025-05-07T20:32:05.8250115Z scale_ub_tensor = None 2025-05-07T20:32:05.8250188Z 2025-05-07T20:32:05.8250315Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8250411Z op = silu_mul_quant 2025-05-07T20:32:05.8250575Z if compiled: 2025-05-07T20:32:05.8250675Z op = torch.compile(op) 2025-05-07T20:32:05.8250785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8250858Z 2025-05-07T20:32:05.8250948Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8250959Z 2025-05-07T20:32:05.8251057Z moe/activation_test.py:117: 2025-05-07T20:32:05.8251187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8251298Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8251398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8251892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8251993Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8252344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8252574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8252912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8253006Z kernel = self.compile( 2025-05-07T20:32:05.8253386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8253558Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8253730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8253734Z 2025-05-07T20:32:05.8253938Z self = 2025-05-07T20:32:05.8254706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8255215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a98818a0>} 2025-05-07T20:32:05.8255953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8256144Z context = 2025-05-07T20:32:05.8256190Z 2025-05-07T20:32:05.8256352Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8256614Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8256730Z module_map=module_map) 2025-05-07T20:32:05.8256915Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8257032Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8257119Z E ^ 2025-05-07T20:32:05.8257471Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8257476Z 2025-05-07T20:32:05.8257888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8257893Z 2025-05-07T20:32:05.8257997Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8258222Z self=, 2025-05-07T20:32:05.8258306Z T=2048, 2025-05-07T20:32:05.8258385Z D=7168, 2025-05-07T20:32:05.8258469Z scale_ub=None, 2025-05-07T20:32:05.8258561Z contiguous=False, 2025-05-07T20:32:05.8258644Z compiled=True, 2025-05-07T20:32:05.8258724Z ) 2025-05-07T20:32:05.8258938Z self = 2025-05-07T20:32:05.8259185Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.8259191Z 2025-05-07T20:32:05.8259273Z @given( 2025-05-07T20:32:05.8259389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8259487Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8259606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8259721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8259837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8259914Z ) 2025-05-07T20:32:05.8260154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8260253Z def test_silu_mul_quant( 2025-05-07T20:32:05.8260329Z self, 2025-05-07T20:32:05.8260405Z T: int, 2025-05-07T20:32:05.8260488Z D: int, 2025-05-07T20:32:05.8260586Z scale_ub: Optional[float], 2025-05-07T20:32:05.8260675Z contiguous: bool, 2025-05-07T20:32:05.8260767Z compiled: bool, 2025-05-07T20:32:05.8260852Z ) -> None: 2025-05-07T20:32:05.8260946Z torch.manual_seed(2025) 2025-05-07T20:32:05.8261024Z 2025-05-07T20:32:05.8261190Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8261264Z 2025-05-07T20:32:05.8261358Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8261479Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8261573Z x = x_sign * x_clamp 2025-05-07T20:32:05.8261651Z x0 = x[:, :D] 2025-05-07T20:32:05.8261780Z x1 = x[:, D:] 2025-05-07T20:32:05.8261862Z 2025-05-07T20:32:05.8261943Z if contiguous: 2025-05-07T20:32:05.8262032Z x0 = x0.contiguous() 2025-05-07T20:32:05.8262128Z x1 = x1.contiguous() 2025-05-07T20:32:05.8262196Z 2025-05-07T20:32:05.8262283Z if scale_ub is not None: 2025-05-07T20:32:05.8262392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8262526Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8262606Z ) 2025-05-07T20:32:05.8262687Z else: 2025-05-07T20:32:05.8262779Z scale_ub_tensor = None 2025-05-07T20:32:05.8262856Z 2025-05-07T20:32:05.8262983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8263072Z op = silu_mul_quant 2025-05-07T20:32:05.8263159Z if compiled: 2025-05-07T20:32:05.8263258Z op = torch.compile(op) 2025-05-07T20:32:05.8263359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8263483Z 2025-05-07T20:32:05.8263572Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8263577Z 2025-05-07T20:32:05.8263672Z moe/activation_test.py:117: 2025-05-07T20:32:05.8263806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8263906Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8264009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8264381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8264472Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8264963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8265062Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8265412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8265643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8265981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8266081Z kernel = self.compile( 2025-05-07T20:32:05.8266458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8266709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8266841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8266846Z 2025-05-07T20:32:05.8267046Z self = 2025-05-07T20:32:05.8267819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8268322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9882b60>} 2025-05-07T20:32:05.8269059Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8269259Z context = 2025-05-07T20:32:05.8269264Z 2025-05-07T20:32:05.8269425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8269694Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8269798Z module_map=module_map) 2025-05-07T20:32:05.8269956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8270106Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8270183Z E ^ 2025-05-07T20:32:05.8270532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8270544Z 2025-05-07T20:32:05.8270952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8270957Z 2025-05-07T20:32:05.8271061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8271288Z self=, 2025-05-07T20:32:05.8271368Z T=4096, 2025-05-07T20:32:05.8271446Z D=7168, 2025-05-07T20:32:05.8271533Z scale_ub=None, 2025-05-07T20:32:05.8271616Z contiguous=False, 2025-05-07T20:32:05.8271701Z compiled=True, 2025-05-07T20:32:05.8271776Z ) 2025-05-07T20:32:05.8271990Z self = 2025-05-07T20:32:05.8272233Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.8272238Z 2025-05-07T20:32:05.8272314Z @given( 2025-05-07T20:32:05.8272434Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8272538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8272652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8272771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8272886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8272966Z ) 2025-05-07T20:32:05.8273213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8273308Z def test_silu_mul_quant( 2025-05-07T20:32:05.8273382Z self, 2025-05-07T20:32:05.8273464Z T: int, 2025-05-07T20:32:05.8273541Z D: int, 2025-05-07T20:32:05.8273639Z scale_ub: Optional[float], 2025-05-07T20:32:05.8273735Z contiguous: bool, 2025-05-07T20:32:05.8273828Z compiled: bool, 2025-05-07T20:32:05.8273906Z ) -> None: 2025-05-07T20:32:05.8274007Z torch.manual_seed(2025) 2025-05-07T20:32:05.8274081Z 2025-05-07T20:32:05.8274248Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8274329Z 2025-05-07T20:32:05.8274420Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8274543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8274637Z x = x_sign * x_clamp 2025-05-07T20:32:05.8274799Z x0 = x[:, :D] 2025-05-07T20:32:05.8274887Z x1 = x[:, D:] 2025-05-07T20:32:05.8274959Z 2025-05-07T20:32:05.8275042Z if contiguous: 2025-05-07T20:32:05.8275137Z x0 = x0.contiguous() 2025-05-07T20:32:05.8275222Z x1 = x1.contiguous() 2025-05-07T20:32:05.8275291Z 2025-05-07T20:32:05.8275385Z if scale_ub is not None: 2025-05-07T20:32:05.8275489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8275625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8275709Z ) 2025-05-07T20:32:05.8275786Z else: 2025-05-07T20:32:05.8275878Z scale_ub_tensor = None 2025-05-07T20:32:05.8275957Z 2025-05-07T20:32:05.8276084Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8276179Z op = silu_mul_quant 2025-05-07T20:32:05.8276265Z if compiled: 2025-05-07T20:32:05.8276367Z op = torch.compile(op) 2025-05-07T20:32:05.8276479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8276551Z 2025-05-07T20:32:05.8276641Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8276645Z 2025-05-07T20:32:05.8276748Z moe/activation_test.py:117: 2025-05-07T20:32:05.8276878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8276976Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8277076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8277491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8277588Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8278077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8278171Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8278534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8278751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8279086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8279187Z kernel = self.compile( 2025-05-07T20:32:05.8279563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8279783Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8279907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8279911Z 2025-05-07T20:32:05.8280113Z self = 2025-05-07T20:32:05.8280890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8281387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a9883e20>} 2025-05-07T20:32:05.8282130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8282321Z context = 2025-05-07T20:32:05.8282326Z 2025-05-07T20:32:05.8282491Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8282751Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8282859Z module_map=module_map) 2025-05-07T20:32:05.8283026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8283199Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8283280Z E ^ 2025-05-07T20:32:05.8283634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8283639Z 2025-05-07T20:32:05.8284047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8284052Z 2025-05-07T20:32:05.8284163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8284381Z self=, 2025-05-07T20:32:05.8284464Z T=16384, 2025-05-07T20:32:05.8284547Z D=5120, 2025-05-07T20:32:05.8284631Z scale_ub=1200.0, 2025-05-07T20:32:05.8284714Z contiguous=False, 2025-05-07T20:32:05.8284807Z compiled=False, 2025-05-07T20:32:05.8284882Z ) 2025-05-07T20:32:05.8285095Z self = 2025-05-07T20:32:05.8285282Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.8285287Z 2025-05-07T20:32:05.8285363Z @given( 2025-05-07T20:32:05.8285485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8285582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8285696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8285818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8285974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8286049Z ) 2025-05-07T20:32:05.8286295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8286387Z def test_silu_mul_quant( 2025-05-07T20:32:05.8286470Z self, 2025-05-07T20:32:05.8286545Z T: int, 2025-05-07T20:32:05.8286622Z D: int, 2025-05-07T20:32:05.8286720Z scale_ub: Optional[float], 2025-05-07T20:32:05.8286806Z contiguous: bool, 2025-05-07T20:32:05.8286893Z compiled: bool, 2025-05-07T20:32:05.8286975Z ) -> None: 2025-05-07T20:32:05.8287069Z torch.manual_seed(2025) 2025-05-07T20:32:05.8287139Z 2025-05-07T20:32:05.8287310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8287384Z 2025-05-07T20:32:05.8287472Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8287648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8287783Z x = x_sign * x_clamp 2025-05-07T20:32:05.8287859Z x0 = x[:, :D] 2025-05-07T20:32:05.8287943Z x1 = x[:, D:] 2025-05-07T20:32:05.8288014Z 2025-05-07T20:32:05.8288103Z if contiguous: 2025-05-07T20:32:05.8288193Z x0 = x0.contiguous() 2025-05-07T20:32:05.8288283Z x1 = x1.contiguous() 2025-05-07T20:32:05.8288361Z 2025-05-07T20:32:05.8288449Z if scale_ub is not None: 2025-05-07T20:32:05.8288553Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8288693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8288767Z ) 2025-05-07T20:32:05.8288842Z else: 2025-05-07T20:32:05.8288940Z scale_ub_tensor = None 2025-05-07T20:32:05.8289010Z 2025-05-07T20:32:05.8289138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8289232Z op = silu_mul_quant 2025-05-07T20:32:05.8289316Z if compiled: 2025-05-07T20:32:05.8289421Z op = torch.compile(op) 2025-05-07T20:32:05.8289531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8289601Z 2025-05-07T20:32:05.8289697Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8289701Z 2025-05-07T20:32:05.8289796Z moe/activation_test.py:117: 2025-05-07T20:32:05.8289926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8290032Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8290128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8290701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8290802Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8291157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8291377Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8291719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8291813Z kernel = self.compile( 2025-05-07T20:32:05.8295745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8295939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8296067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8296080Z 2025-05-07T20:32:05.8296286Z self = 2025-05-07T20:32:05.8297064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8297569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a941cd60>} 2025-05-07T20:32:05.8298385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8298576Z context = 2025-05-07T20:32:05.8298581Z 2025-05-07T20:32:05.8298752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8299014Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8299121Z module_map=module_map) 2025-05-07T20:32:05.8299284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8299384Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8299461Z E ^ 2025-05-07T20:32:05.8299818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8299864Z 2025-05-07T20:32:05.8300278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8300282Z 2025-05-07T20:32:05.8300389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8300609Z self=, 2025-05-07T20:32:05.8300691Z T=16384, 2025-05-07T20:32:05.8300772Z D=5120, 2025-05-07T20:32:05.8300857Z scale_ub=1200.0, 2025-05-07T20:32:05.8300945Z contiguous=True, 2025-05-07T20:32:05.8301032Z compiled=True, 2025-05-07T20:32:05.8301107Z ) 2025-05-07T20:32:05.8301323Z self = 2025-05-07T20:32:05.8301501Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.8301505Z 2025-05-07T20:32:05.8301587Z @given( 2025-05-07T20:32:05.8301708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8301807Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8301921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8302039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8302154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8302231Z ) 2025-05-07T20:32:05.8302553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8302649Z def test_silu_mul_quant( 2025-05-07T20:32:05.8302730Z self, 2025-05-07T20:32:05.8302809Z T: int, 2025-05-07T20:32:05.8302887Z D: int, 2025-05-07T20:32:05.8302991Z scale_ub: Optional[float], 2025-05-07T20:32:05.8303085Z contiguous: bool, 2025-05-07T20:32:05.8303171Z compiled: bool, 2025-05-07T20:32:05.8303254Z ) -> None: 2025-05-07T20:32:05.8303349Z torch.manual_seed(2025) 2025-05-07T20:32:05.8303428Z 2025-05-07T20:32:05.8303597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8303671Z 2025-05-07T20:32:05.8303763Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8303890Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8303979Z x = x_sign * x_clamp 2025-05-07T20:32:05.8304061Z x0 = x[:, :D] 2025-05-07T20:32:05.8304146Z x1 = x[:, D:] 2025-05-07T20:32:05.8304220Z 2025-05-07T20:32:05.8304315Z if contiguous: 2025-05-07T20:32:05.8304408Z x0 = x0.contiguous() 2025-05-07T20:32:05.8304501Z x1 = x1.contiguous() 2025-05-07T20:32:05.8304577Z 2025-05-07T20:32:05.8304667Z if scale_ub is not None: 2025-05-07T20:32:05.8304773Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8304913Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8304989Z ) 2025-05-07T20:32:05.8305138Z else: 2025-05-07T20:32:05.8305235Z scale_ub_tensor = None 2025-05-07T20:32:05.8305307Z 2025-05-07T20:32:05.8305435Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8305529Z op = silu_mul_quant 2025-05-07T20:32:05.8305947Z if compiled: 2025-05-07T20:32:05.8306094Z op = torch.compile(op) 2025-05-07T20:32:05.8306201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8306272Z 2025-05-07T20:32:05.8306371Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8306376Z 2025-05-07T20:32:05.8306469Z moe/activation_test.py:117: 2025-05-07T20:32:05.8306596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8306697Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8306793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8307154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8307348Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8307836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8307937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8308285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8308501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8308844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8308936Z kernel = self.compile( 2025-05-07T20:32:05.8309315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8309485Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8309610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8309620Z 2025-05-07T20:32:05.8309825Z self = 2025-05-07T20:32:05.8310595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8311906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a941e200>} 2025-05-07T20:32:05.8312658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8312845Z context = 2025-05-07T20:32:05.8312855Z 2025-05-07T20:32:05.8313019Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8313278Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8313385Z module_map=module_map) 2025-05-07T20:32:05.8313544Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8313641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8313721Z E ^ 2025-05-07T20:32:05.8314078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8314083Z 2025-05-07T20:32:05.8314498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8314502Z 2025-05-07T20:32:05.8314603Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8314819Z self=, 2025-05-07T20:32:05.8314962Z T=16384, 2025-05-07T20:32:05.8315036Z D=5120, 2025-05-07T20:32:05.8315115Z scale_ub=None, 2025-05-07T20:32:05.8315200Z contiguous=False, 2025-05-07T20:32:05.8315280Z compiled=True, 2025-05-07T20:32:05.8315354Z ) 2025-05-07T20:32:05.8315571Z self = 2025-05-07T20:32:05.8315742Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.8315747Z 2025-05-07T20:32:05.8315832Z @given( 2025-05-07T20:32:05.8315952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8316049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8316165Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8316278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8316387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8316460Z ) 2025-05-07T20:32:05.8316744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8316836Z def test_silu_mul_quant( 2025-05-07T20:32:05.8316915Z self, 2025-05-07T20:32:05.8316989Z T: int, 2025-05-07T20:32:05.8317067Z D: int, 2025-05-07T20:32:05.8317163Z scale_ub: Optional[float], 2025-05-07T20:32:05.8317253Z contiguous: bool, 2025-05-07T20:32:05.8317337Z compiled: bool, 2025-05-07T20:32:05.8317412Z ) -> None: 2025-05-07T20:32:05.8317507Z torch.manual_seed(2025) 2025-05-07T20:32:05.8317579Z 2025-05-07T20:32:05.8317745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8317817Z 2025-05-07T20:32:05.8317906Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8318027Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8318111Z x = x_sign * x_clamp 2025-05-07T20:32:05.8318192Z x0 = x[:, :D] 2025-05-07T20:32:05.8318269Z x1 = x[:, D:] 2025-05-07T20:32:05.8318347Z 2025-05-07T20:32:05.8318430Z if contiguous: 2025-05-07T20:32:05.8318518Z x0 = x0.contiguous() 2025-05-07T20:32:05.8318606Z x1 = x1.contiguous() 2025-05-07T20:32:05.8318677Z 2025-05-07T20:32:05.8318763Z if scale_ub is not None: 2025-05-07T20:32:05.8318873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8319003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8319077Z ) 2025-05-07T20:32:05.8319234Z else: 2025-05-07T20:32:05.8319326Z scale_ub_tensor = None 2025-05-07T20:32:05.8319394Z 2025-05-07T20:32:05.8319524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8319610Z op = silu_mul_quant 2025-05-07T20:32:05.8319693Z if compiled: 2025-05-07T20:32:05.8319792Z op = torch.compile(op) 2025-05-07T20:32:05.8319894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8319976Z 2025-05-07T20:32:05.8320064Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8320068Z 2025-05-07T20:32:05.8320161Z moe/activation_test.py:117: 2025-05-07T20:32:05.8320293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8320393Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8320489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8320857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8320951Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8321442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8321537Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8321888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8322111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8322494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8322585Z kernel = self.compile( 2025-05-07T20:32:05.8322962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8323131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8323264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8323269Z 2025-05-07T20:32:05.8323468Z self = 2025-05-07T20:32:05.8324238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8324783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a941ed40>} 2025-05-07T20:32:05.8325521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8325711Z context = 2025-05-07T20:32:05.8325720Z 2025-05-07T20:32:05.8325881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8326141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8326246Z module_map=module_map) 2025-05-07T20:32:05.8326405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8326502Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8326582Z E ^ 2025-05-07T20:32:05.8326929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8326934Z 2025-05-07T20:32:05.8327343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8327347Z 2025-05-07T20:32:05.8327450Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8327799Z self=, 2025-05-07T20:32:05.8327882Z T=2048, 2025-05-07T20:32:05.8327958Z D=5120, 2025-05-07T20:32:05.8328042Z scale_ub=None, 2025-05-07T20:32:05.8328125Z contiguous=False, 2025-05-07T20:32:05.8328207Z compiled=True, 2025-05-07T20:32:05.8328283Z ) 2025-05-07T20:32:05.8328496Z self = 2025-05-07T20:32:05.8328666Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.8328675Z 2025-05-07T20:32:05.8328756Z @given( 2025-05-07T20:32:05.8328876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8328976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8329087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8329201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8329313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8329386Z ) 2025-05-07T20:32:05.8329638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8329730Z def test_silu_mul_quant( 2025-05-07T20:32:05.8329806Z self, 2025-05-07T20:32:05.8329885Z T: int, 2025-05-07T20:32:05.8329962Z D: int, 2025-05-07T20:32:05.8330059Z scale_ub: Optional[float], 2025-05-07T20:32:05.8330147Z contiguous: bool, 2025-05-07T20:32:05.8330229Z compiled: bool, 2025-05-07T20:32:05.8330356Z ) -> None: 2025-05-07T20:32:05.8330444Z torch.manual_seed(2025) 2025-05-07T20:32:05.8330515Z 2025-05-07T20:32:05.8330681Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8330751Z 2025-05-07T20:32:05.8330838Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8330961Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8331046Z x = x_sign * x_clamp 2025-05-07T20:32:05.8331125Z x0 = x[:, :D] 2025-05-07T20:32:05.8331210Z x1 = x[:, D:] 2025-05-07T20:32:05.8331279Z 2025-05-07T20:32:05.8331358Z if contiguous: 2025-05-07T20:32:05.8331449Z x0 = x0.contiguous() 2025-05-07T20:32:05.8331534Z x1 = x1.contiguous() 2025-05-07T20:32:05.8331602Z 2025-05-07T20:32:05.8331691Z if scale_ub is not None: 2025-05-07T20:32:05.8331795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8331930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8332044Z ) 2025-05-07T20:32:05.8332118Z else: 2025-05-07T20:32:05.8332211Z scale_ub_tensor = None 2025-05-07T20:32:05.8332281Z 2025-05-07T20:32:05.8332406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8332496Z op = silu_mul_quant 2025-05-07T20:32:05.8332576Z if compiled: 2025-05-07T20:32:05.8332673Z op = torch.compile(op) 2025-05-07T20:32:05.8332778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8332854Z 2025-05-07T20:32:05.8332942Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8332949Z 2025-05-07T20:32:05.8333040Z moe/activation_test.py:117: 2025-05-07T20:32:05.8333167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8333266Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8333361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8333724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8333823Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8334314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8334409Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8334766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8335130Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8335472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8335562Z kernel = self.compile( 2025-05-07T20:32:05.8335937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8336111Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8336239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8336243Z 2025-05-07T20:32:05.8336445Z self = 2025-05-07T20:32:05.8337263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8337768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a93fc7c0>} 2025-05-07T20:32:05.8338512Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8338742Z context = 2025-05-07T20:32:05.8338748Z 2025-05-07T20:32:05.8338912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8339170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8339276Z module_map=module_map) 2025-05-07T20:32:05.8339437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8339538Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8339618Z E ^ 2025-05-07T20:32:05.8339967Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8339972Z 2025-05-07T20:32:05.8340378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8340383Z 2025-05-07T20:32:05.8340486Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8340746Z self=, 2025-05-07T20:32:05.8340824Z T=2048, 2025-05-07T20:32:05.8340901Z D=5120, 2025-05-07T20:32:05.8340982Z scale_ub=1200.0, 2025-05-07T20:32:05.8341072Z contiguous=False, 2025-05-07T20:32:05.8341152Z compiled=True, 2025-05-07T20:32:05.8341225Z ) 2025-05-07T20:32:05.8341444Z self = 2025-05-07T20:32:05.8341620Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8341625Z 2025-05-07T20:32:05.8341701Z @given( 2025-05-07T20:32:05.8341824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8341921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8342039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8342154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8342267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8342351Z ) 2025-05-07T20:32:05.8342591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8342683Z def test_silu_mul_quant( 2025-05-07T20:32:05.8342766Z self, 2025-05-07T20:32:05.8342842Z T: int, 2025-05-07T20:32:05.8342917Z D: int, 2025-05-07T20:32:05.8343019Z scale_ub: Optional[float], 2025-05-07T20:32:05.8343105Z contiguous: bool, 2025-05-07T20:32:05.8343188Z compiled: bool, 2025-05-07T20:32:05.8343348Z ) -> None: 2025-05-07T20:32:05.8343442Z torch.manual_seed(2025) 2025-05-07T20:32:05.8343515Z 2025-05-07T20:32:05.8343687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8343759Z 2025-05-07T20:32:05.8343852Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8343974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8344062Z x = x_sign * x_clamp 2025-05-07T20:32:05.8344147Z x0 = x[:, :D] 2025-05-07T20:32:05.8344224Z x1 = x[:, D:] 2025-05-07T20:32:05.8344295Z 2025-05-07T20:32:05.8344378Z if contiguous: 2025-05-07T20:32:05.8344466Z x0 = x0.contiguous() 2025-05-07T20:32:05.8344555Z x1 = x1.contiguous() 2025-05-07T20:32:05.8344633Z 2025-05-07T20:32:05.8344724Z if scale_ub is not None: 2025-05-07T20:32:05.8344826Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8344964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8345036Z ) 2025-05-07T20:32:05.8345109Z else: 2025-05-07T20:32:05.8345198Z scale_ub_tensor = None 2025-05-07T20:32:05.8345267Z 2025-05-07T20:32:05.8345395Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8345481Z op = silu_mul_quant 2025-05-07T20:32:05.8345561Z if compiled: 2025-05-07T20:32:05.8345659Z op = torch.compile(op) 2025-05-07T20:32:05.8345810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8345880Z 2025-05-07T20:32:05.8345970Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8345974Z 2025-05-07T20:32:05.8346068Z moe/activation_test.py:117: 2025-05-07T20:32:05.8346198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8346293Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8346388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8346754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8346841Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8347329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8347428Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8347778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8348047Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8348380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8348474Z kernel = self.compile( 2025-05-07T20:32:05.8348851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8349025Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8349149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8349158Z 2025-05-07T20:32:05.8349357Z self = 2025-05-07T20:32:05.8350120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8350626Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a93fd300>} 2025-05-07T20:32:05.8351365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8351628Z context = 2025-05-07T20:32:05.8351634Z 2025-05-07T20:32:05.8351796Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8352053Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8352159Z module_map=module_map) 2025-05-07T20:32:05.8352317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8352420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8352502Z E ^ 2025-05-07T20:32:05.8352850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8352855Z 2025-05-07T20:32:05.8353266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8353270Z 2025-05-07T20:32:05.8353372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8353592Z self=, 2025-05-07T20:32:05.8353670Z T=4096, 2025-05-07T20:32:05.8353747Z D=5120, 2025-05-07T20:32:05.8353830Z scale_ub=1200.0, 2025-05-07T20:32:05.8353914Z contiguous=True, 2025-05-07T20:32:05.8353991Z compiled=True, 2025-05-07T20:32:05.8354069Z ) 2025-05-07T20:32:05.8354284Z self = 2025-05-07T20:32:05.8354499Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.8354503Z 2025-05-07T20:32:05.8354582Z @given( 2025-05-07T20:32:05.8354698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8354794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8354911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8355025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8355144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8355215Z ) 2025-05-07T20:32:05.8355454Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8355548Z def test_silu_mul_quant( 2025-05-07T20:32:05.8355622Z self, 2025-05-07T20:32:05.8355694Z T: int, 2025-05-07T20:32:05.8355767Z D: int, 2025-05-07T20:32:05.8355861Z scale_ub: Optional[float], 2025-05-07T20:32:05.8355947Z contiguous: bool, 2025-05-07T20:32:05.8356075Z compiled: bool, 2025-05-07T20:32:05.8356152Z ) -> None: 2025-05-07T20:32:05.8356244Z torch.manual_seed(2025) 2025-05-07T20:32:05.8356315Z 2025-05-07T20:32:05.8356484Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8356562Z 2025-05-07T20:32:05.8356652Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8356773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8356885Z x = x_sign * x_clamp 2025-05-07T20:32:05.8356975Z x0 = x[:, :D] 2025-05-07T20:32:05.8357078Z x1 = x[:, D:] 2025-05-07T20:32:05.8357157Z 2025-05-07T20:32:05.8357241Z if contiguous: 2025-05-07T20:32:05.8357335Z x0 = x0.contiguous() 2025-05-07T20:32:05.8357424Z x1 = x1.contiguous() 2025-05-07T20:32:05.8357498Z 2025-05-07T20:32:05.8357590Z if scale_ub is not None: 2025-05-07T20:32:05.8357697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8357837Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8357919Z ) 2025-05-07T20:32:05.8357988Z else: 2025-05-07T20:32:05.8358078Z scale_ub_tensor = None 2025-05-07T20:32:05.8358151Z 2025-05-07T20:32:05.8358275Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8358363Z op = silu_mul_quant 2025-05-07T20:32:05.8358451Z if compiled: 2025-05-07T20:32:05.8358549Z op = torch.compile(op) 2025-05-07T20:32:05.8358739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8358813Z 2025-05-07T20:32:05.8358902Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8358906Z 2025-05-07T20:32:05.8359006Z moe/activation_test.py:117: 2025-05-07T20:32:05.8359133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8359230Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8359333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8359701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8359798Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8360286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8360382Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8360743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8360962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8361297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8361395Z kernel = self.compile( 2025-05-07T20:32:05.8361773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8361994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8362119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8362123Z 2025-05-07T20:32:05.8362326Z self = 2025-05-07T20:32:05.8363102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8363601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a93fdbc0>} 2025-05-07T20:32:05.8364346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8364580Z context = 2025-05-07T20:32:05.8364585Z 2025-05-07T20:32:05.8364750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8365013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8365117Z module_map=module_map) 2025-05-07T20:32:05.8365287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8365384Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8365463Z E ^ 2025-05-07T20:32:05.8365820Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8365824Z 2025-05-07T20:32:05.8366231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8366238Z 2025-05-07T20:32:05.8366349Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8366571Z self=, 2025-05-07T20:32:05.8366648Z T=128, 2025-05-07T20:32:05.8366728Z D=5120, 2025-05-07T20:32:05.8366810Z scale_ub=1200.0, 2025-05-07T20:32:05.8366896Z contiguous=False, 2025-05-07T20:32:05.8366985Z compiled=True, 2025-05-07T20:32:05.8367060Z ) 2025-05-07T20:32:05.8367273Z self = 2025-05-07T20:32:05.8367598Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8367603Z 2025-05-07T20:32:05.8367681Z @given( 2025-05-07T20:32:05.8367805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8367910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8368027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8368147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8368264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8368339Z ) 2025-05-07T20:32:05.8368584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8368678Z def test_silu_mul_quant( 2025-05-07T20:32:05.8368755Z self, 2025-05-07T20:32:05.8368835Z T: int, 2025-05-07T20:32:05.8368914Z D: int, 2025-05-07T20:32:05.8369016Z scale_ub: Optional[float], 2025-05-07T20:32:05.8369103Z contiguous: bool, 2025-05-07T20:32:05.8369197Z compiled: bool, 2025-05-07T20:32:05.8369282Z ) -> None: 2025-05-07T20:32:05.8369376Z torch.manual_seed(2025) 2025-05-07T20:32:05.8369449Z 2025-05-07T20:32:05.8369622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8369699Z 2025-05-07T20:32:05.8369789Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8369918Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8370054Z x = x_sign * x_clamp 2025-05-07T20:32:05.8370137Z x0 = x[:, :D] 2025-05-07T20:32:05.8370224Z x1 = x[:, D:] 2025-05-07T20:32:05.8370298Z 2025-05-07T20:32:05.8370384Z if contiguous: 2025-05-07T20:32:05.8370482Z x0 = x0.contiguous() 2025-05-07T20:32:05.8370571Z x1 = x1.contiguous() 2025-05-07T20:32:05.8370650Z 2025-05-07T20:32:05.8370740Z if scale_ub is not None: 2025-05-07T20:32:05.8370845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8370986Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8371059Z ) 2025-05-07T20:32:05.8371131Z else: 2025-05-07T20:32:05.8371228Z scale_ub_tensor = None 2025-05-07T20:32:05.8371300Z 2025-05-07T20:32:05.8371429Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8371525Z op = silu_mul_quant 2025-05-07T20:32:05.8371609Z if compiled: 2025-05-07T20:32:05.8371826Z op = torch.compile(op) 2025-05-07T20:32:05.8371937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8372008Z 2025-05-07T20:32:05.8372105Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8372109Z 2025-05-07T20:32:05.8372205Z moe/activation_test.py:117: 2025-05-07T20:32:05.8372334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8372436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8372531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8372898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8372993Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8373480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8373586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8373937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8374162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8374502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8374594Z kernel = self.compile( 2025-05-07T20:32:05.8374970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8375224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8375352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8375356Z 2025-05-07T20:32:05.8375563Z self = 2025-05-07T20:32:05.8376333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8376843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a93feb60>} 2025-05-07T20:32:05.8380195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8380436Z context = 2025-05-07T20:32:05.8380441Z 2025-05-07T20:32:05.8380632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8380944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8381058Z module_map=module_map) 2025-05-07T20:32:05.8381284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8381383Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8381457Z E ^ 2025-05-07T20:32:05.8381816Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8381820Z 2025-05-07T20:32:05.8382236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8382240Z 2025-05-07T20:32:05.8382372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8382591Z self=, 2025-05-07T20:32:05.8382671Z T=16384, 2025-05-07T20:32:05.8382755Z D=7168, 2025-05-07T20:32:05.8382841Z scale_ub=1200.0, 2025-05-07T20:32:05.8382923Z contiguous=True, 2025-05-07T20:32:05.8383011Z compiled=True, 2025-05-07T20:32:05.8383083Z ) 2025-05-07T20:32:05.8383300Z self = 2025-05-07T20:32:05.8383528Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.8383533Z 2025-05-07T20:32:05.8383612Z @given( 2025-05-07T20:32:05.8383737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8383836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8383950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8384074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8384188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8384262Z ) 2025-05-07T20:32:05.8384509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8384601Z def test_silu_mul_quant( 2025-05-07T20:32:05.8384682Z self, 2025-05-07T20:32:05.8384761Z T: int, 2025-05-07T20:32:05.8384838Z D: int, 2025-05-07T20:32:05.8384941Z scale_ub: Optional[float], 2025-05-07T20:32:05.8385035Z contiguous: bool, 2025-05-07T20:32:05.8385118Z compiled: bool, 2025-05-07T20:32:05.8385201Z ) -> None: 2025-05-07T20:32:05.8385294Z torch.manual_seed(2025) 2025-05-07T20:32:05.8385368Z 2025-05-07T20:32:05.8385539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8385612Z 2025-05-07T20:32:05.8385702Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8385830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8385961Z x = x_sign * x_clamp 2025-05-07T20:32:05.8386045Z x0 = x[:, :D] 2025-05-07T20:32:05.8386133Z x1 = x[:, D:] 2025-05-07T20:32:05.8386208Z 2025-05-07T20:32:05.8386301Z if contiguous: 2025-05-07T20:32:05.8386394Z x0 = x0.contiguous() 2025-05-07T20:32:05.8386482Z x1 = x1.contiguous() 2025-05-07T20:32:05.8386559Z 2025-05-07T20:32:05.8386649Z if scale_ub is not None: 2025-05-07T20:32:05.8386759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8386897Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8386969Z ) 2025-05-07T20:32:05.8387042Z else: 2025-05-07T20:32:05.8387142Z scale_ub_tensor = None 2025-05-07T20:32:05.8387214Z 2025-05-07T20:32:05.8387342Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8387437Z op = silu_mul_quant 2025-05-07T20:32:05.8387520Z if compiled: 2025-05-07T20:32:05.8387706Z op = torch.compile(op) 2025-05-07T20:32:05.8387816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8387890Z 2025-05-07T20:32:05.8387990Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8387995Z 2025-05-07T20:32:05.8388091Z moe/activation_test.py:117: 2025-05-07T20:32:05.8388222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8388332Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8388472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8388840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8388937Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8389427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8389530Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8389887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8390108Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8390446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8390539Z kernel = self.compile( 2025-05-07T20:32:05.8390919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8391143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8391269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8391274Z 2025-05-07T20:32:05.8391482Z self = 2025-05-07T20:32:05.8392253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8392758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8e18720>} 2025-05-07T20:32:05.8393500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8393693Z context = 2025-05-07T20:32:05.8393698Z 2025-05-07T20:32:05.8393866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8394125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8394240Z module_map=module_map) 2025-05-07T20:32:05.8394442Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8394545Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8394630Z E ^ 2025-05-07T20:32:05.8394986Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8394991Z 2025-05-07T20:32:05.8395401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8395415Z 2025-05-07T20:32:05.8395523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8395746Z self=, 2025-05-07T20:32:05.8395829Z T=16384, 2025-05-07T20:32:05.8395912Z D=5120, 2025-05-07T20:32:05.8396000Z scale_ub=1200.0, 2025-05-07T20:32:05.8396094Z contiguous=True, 2025-05-07T20:32:05.8396183Z compiled=False, 2025-05-07T20:32:05.8396259Z ) 2025-05-07T20:32:05.8396541Z self = 2025-05-07T20:32:05.8396719Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8396724Z 2025-05-07T20:32:05.8396811Z @given( 2025-05-07T20:32:05.8396931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8397032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8397155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8397342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8397456Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8397536Z ) 2025-05-07T20:32:05.8397777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8397870Z def test_silu_mul_quant( 2025-05-07T20:32:05.8397950Z self, 2025-05-07T20:32:05.8398025Z T: int, 2025-05-07T20:32:05.8398098Z D: int, 2025-05-07T20:32:05.8398201Z scale_ub: Optional[float], 2025-05-07T20:32:05.8398293Z contiguous: bool, 2025-05-07T20:32:05.8398387Z compiled: bool, 2025-05-07T20:32:05.8398465Z ) -> None: 2025-05-07T20:32:05.8398562Z torch.manual_seed(2025) 2025-05-07T20:32:05.8398643Z 2025-05-07T20:32:05.8398811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8398885Z 2025-05-07T20:32:05.8398985Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8399157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8399247Z x = x_sign * x_clamp 2025-05-07T20:32:05.8399331Z x0 = x[:, :D] 2025-05-07T20:32:05.8399410Z x1 = x[:, D:] 2025-05-07T20:32:05.8399485Z 2025-05-07T20:32:05.8399573Z if contiguous: 2025-05-07T20:32:05.8399662Z x0 = x0.contiguous() 2025-05-07T20:32:05.8399754Z x1 = x1.contiguous() 2025-05-07T20:32:05.8399829Z 2025-05-07T20:32:05.8399919Z if scale_ub is not None: 2025-05-07T20:32:05.8400033Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8400169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8400249Z ) 2025-05-07T20:32:05.8400332Z else: 2025-05-07T20:32:05.8400426Z scale_ub_tensor = None 2025-05-07T20:32:05.8400501Z 2025-05-07T20:32:05.8400637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8400732Z op = silu_mul_quant 2025-05-07T20:32:05.8400825Z if compiled: 2025-05-07T20:32:05.8400932Z op = torch.compile(op) 2025-05-07T20:32:05.8401039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8401115Z 2025-05-07T20:32:05.8401212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8401217Z 2025-05-07T20:32:05.8401312Z moe/activation_test.py:117: 2025-05-07T20:32:05.8401445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8401587Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8401689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8402188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8402283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8402635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8402861Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8403203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8403307Z kernel = self.compile( 2025-05-07T20:32:05.8403686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8403861Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8404047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8404053Z 2025-05-07T20:32:05.8404258Z self = 2025-05-07T20:32:05.8405035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8405577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8e19d00>} 2025-05-07T20:32:05.8406662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8406866Z context = 2025-05-07T20:32:05.8406875Z 2025-05-07T20:32:05.8407039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8407309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8407419Z module_map=module_map) 2025-05-07T20:32:05.8407650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8407758Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8407929Z E ^ 2025-05-07T20:32:05.8408291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8408295Z 2025-05-07T20:32:05.8408711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8408716Z 2025-05-07T20:32:05.8408818Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8409045Z self=, 2025-05-07T20:32:05.8409125Z T=1, 2025-05-07T20:32:05.8409204Z D=7168, 2025-05-07T20:32:05.8409294Z scale_ub=1200.0, 2025-05-07T20:32:05.8409381Z contiguous=False, 2025-05-07T20:32:05.8409469Z compiled=False, 2025-05-07T20:32:05.8409542Z ) 2025-05-07T20:32:05.8409759Z self = 2025-05-07T20:32:05.8409933Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.8409942Z 2025-05-07T20:32:05.8410020Z @given( 2025-05-07T20:32:05.8410140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8410243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8410357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8410472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8410588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8410663Z ) 2025-05-07T20:32:05.8410979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8411075Z def test_silu_mul_quant( 2025-05-07T20:32:05.8411149Z self, 2025-05-07T20:32:05.8411230Z T: int, 2025-05-07T20:32:05.8411304Z D: int, 2025-05-07T20:32:05.8411400Z scale_ub: Optional[float], 2025-05-07T20:32:05.8411495Z contiguous: bool, 2025-05-07T20:32:05.8411579Z compiled: bool, 2025-05-07T20:32:05.8411661Z ) -> None: 2025-05-07T20:32:05.8411760Z torch.manual_seed(2025) 2025-05-07T20:32:05.8411831Z 2025-05-07T20:32:05.8411997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8412075Z 2025-05-07T20:32:05.8412166Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8412293Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8412380Z x = x_sign * x_clamp 2025-05-07T20:32:05.8412459Z x0 = x[:, :D] 2025-05-07T20:32:05.8412546Z x1 = x[:, D:] 2025-05-07T20:32:05.8412686Z 2025-05-07T20:32:05.8412772Z if contiguous: 2025-05-07T20:32:05.8412866Z x0 = x0.contiguous() 2025-05-07T20:32:05.8412953Z x1 = x1.contiguous() 2025-05-07T20:32:05.8413026Z 2025-05-07T20:32:05.8413123Z if scale_ub is not None: 2025-05-07T20:32:05.8413226Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8413359Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8413498Z ) 2025-05-07T20:32:05.8413573Z else: 2025-05-07T20:32:05.8413671Z scale_ub_tensor = None 2025-05-07T20:32:05.8413741Z 2025-05-07T20:32:05.8413869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8413961Z op = silu_mul_quant 2025-05-07T20:32:05.8414044Z if compiled: 2025-05-07T20:32:05.8414143Z op = torch.compile(op) 2025-05-07T20:32:05.8414251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8414324Z 2025-05-07T20:32:05.8414415Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8414419Z 2025-05-07T20:32:05.8414522Z moe/activation_test.py:117: 2025-05-07T20:32:05.8414648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8414748Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8414845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8415342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8415486Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8415839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8416056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8416400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8416495Z kernel = self.compile( 2025-05-07T20:32:05.8416902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8417100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8417223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8417227Z 2025-05-07T20:32:05.8417440Z self = 2025-05-07T20:32:05.8418206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8418754Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8e193a0>} 2025-05-07T20:32:05.8419498Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8419690Z context = 2025-05-07T20:32:05.8419694Z 2025-05-07T20:32:05.8419868Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8424106Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8424224Z module_map=module_map) 2025-05-07T20:32:05.8424386Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8424489Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8424565Z E ^ 2025-05-07T20:32:05.8424924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8424929Z 2025-05-07T20:32:05.8425412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8425417Z 2025-05-07T20:32:05.8425521Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8425742Z self=, 2025-05-07T20:32:05.8425817Z T=4096, 2025-05-07T20:32:05.8425892Z D=7168, 2025-05-07T20:32:05.8426022Z scale_ub=1200.0, 2025-05-07T20:32:05.8426105Z contiguous=False, 2025-05-07T20:32:05.8426191Z compiled=True, 2025-05-07T20:32:05.8426263Z ) 2025-05-07T20:32:05.8426481Z self = 2025-05-07T20:32:05.8426654Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8426659Z 2025-05-07T20:32:05.8426735Z @given( 2025-05-07T20:32:05.8426851Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8426957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8427071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8427184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8427296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8427371Z ) 2025-05-07T20:32:05.8427618Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8427757Z def test_silu_mul_quant( 2025-05-07T20:32:05.8427830Z self, 2025-05-07T20:32:05.8427908Z T: int, 2025-05-07T20:32:05.8427980Z D: int, 2025-05-07T20:32:05.8428076Z scale_ub: Optional[float], 2025-05-07T20:32:05.8428166Z contiguous: bool, 2025-05-07T20:32:05.8428251Z compiled: bool, 2025-05-07T20:32:05.8428328Z ) -> None: 2025-05-07T20:32:05.8428429Z torch.manual_seed(2025) 2025-05-07T20:32:05.8428498Z 2025-05-07T20:32:05.8428669Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8428746Z 2025-05-07T20:32:05.8428838Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8428968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8429054Z x = x_sign * x_clamp 2025-05-07T20:32:05.8429133Z x0 = x[:, :D] 2025-05-07T20:32:05.8429213Z x1 = x[:, D:] 2025-05-07T20:32:05.8429286Z 2025-05-07T20:32:05.8429368Z if contiguous: 2025-05-07T20:32:05.8429466Z x0 = x0.contiguous() 2025-05-07T20:32:05.8429551Z x1 = x1.contiguous() 2025-05-07T20:32:05.8429621Z 2025-05-07T20:32:05.8429710Z if scale_ub is not None: 2025-05-07T20:32:05.8429813Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8429946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8430021Z ) 2025-05-07T20:32:05.8430097Z else: 2025-05-07T20:32:05.8430190Z scale_ub_tensor = None 2025-05-07T20:32:05.8430261Z 2025-05-07T20:32:05.8430434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8430526Z op = silu_mul_quant 2025-05-07T20:32:05.8430610Z if compiled: 2025-05-07T20:32:05.8430710Z op = torch.compile(op) 2025-05-07T20:32:05.8430819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8430887Z 2025-05-07T20:32:05.8430975Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8430979Z 2025-05-07T20:32:05.8431081Z moe/activation_test.py:117: 2025-05-07T20:32:05.8431208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8431308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8431402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8431770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8431865Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8432429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8432525Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8432879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8433099Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8433443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8433578Z kernel = self.compile( 2025-05-07T20:32:05.8433955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8434130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8434254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8434259Z 2025-05-07T20:32:05.8434468Z self = 2025-05-07T20:32:05.8435239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8435738Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8d982c0>} 2025-05-07T20:32:05.8436525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8436713Z context = 2025-05-07T20:32:05.8436717Z 2025-05-07T20:32:05.8436903Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8437195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8437300Z module_map=module_map) 2025-05-07T20:32:05.8437463Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8437561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8437638Z E ^ 2025-05-07T20:32:05.8437990Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8438002Z 2025-05-07T20:32:05.8438409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8438414Z 2025-05-07T20:32:05.8438517Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8438732Z self=, 2025-05-07T20:32:05.8438810Z T=128, 2025-05-07T20:32:05.8438888Z D=7168, 2025-05-07T20:32:05.8439013Z scale_ub=1200.0, 2025-05-07T20:32:05.8439100Z contiguous=False, 2025-05-07T20:32:05.8439184Z compiled=True, 2025-05-07T20:32:05.8439257Z ) 2025-05-07T20:32:05.8439474Z self = 2025-05-07T20:32:05.8439642Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.8439647Z 2025-05-07T20:32:05.8439718Z @given( 2025-05-07T20:32:05.8439845Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8439940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8440050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8440166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8440276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8440350Z ) 2025-05-07T20:32:05.8440590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8440685Z def test_silu_mul_quant( 2025-05-07T20:32:05.8440813Z self, 2025-05-07T20:32:05.8440889Z T: int, 2025-05-07T20:32:05.8440964Z D: int, 2025-05-07T20:32:05.8441062Z scale_ub: Optional[float], 2025-05-07T20:32:05.8441148Z contiguous: bool, 2025-05-07T20:32:05.8441229Z compiled: bool, 2025-05-07T20:32:05.8441307Z ) -> None: 2025-05-07T20:32:05.8441399Z torch.manual_seed(2025) 2025-05-07T20:32:05.8441515Z 2025-05-07T20:32:05.8441688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8441759Z 2025-05-07T20:32:05.8441847Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8441974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8442061Z x = x_sign * x_clamp 2025-05-07T20:32:05.8442142Z x0 = x[:, :D] 2025-05-07T20:32:05.8442222Z x1 = x[:, D:] 2025-05-07T20:32:05.8442290Z 2025-05-07T20:32:05.8442378Z if contiguous: 2025-05-07T20:32:05.8442469Z x0 = x0.contiguous() 2025-05-07T20:32:05.8442557Z x1 = x1.contiguous() 2025-05-07T20:32:05.8442630Z 2025-05-07T20:32:05.8442716Z if scale_ub is not None: 2025-05-07T20:32:05.8442817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8442954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8443026Z ) 2025-05-07T20:32:05.8443099Z else: 2025-05-07T20:32:05.8443193Z scale_ub_tensor = None 2025-05-07T20:32:05.8443308Z 2025-05-07T20:32:05.8443438Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8443525Z op = silu_mul_quant 2025-05-07T20:32:05.8443608Z if compiled: 2025-05-07T20:32:05.8443706Z op = torch.compile(op) 2025-05-07T20:32:05.8443810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8443878Z 2025-05-07T20:32:05.8443972Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8443976Z 2025-05-07T20:32:05.8444075Z moe/activation_test.py:117: 2025-05-07T20:32:05.8444201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8444299Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8444397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8444765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8444855Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8445352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8445449Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8445798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8446014Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8446404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8446495Z kernel = self.compile( 2025-05-07T20:32:05.8446875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8447075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8447222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8447232Z 2025-05-07T20:32:05.8447435Z self = 2025-05-07T20:32:05.8448273Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8448832Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8d98e00>} 2025-05-07T20:32:05.8449572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8449758Z context = 2025-05-07T20:32:05.8449767Z 2025-05-07T20:32:05.8449967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8450224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8450336Z module_map=module_map) 2025-05-07T20:32:05.8450496Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8450592Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8450673Z E ^ 2025-05-07T20:32:05.8451027Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8451031Z 2025-05-07T20:32:05.8451444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8451448Z 2025-05-07T20:32:05.8451550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8451766Z self=, 2025-05-07T20:32:05.8451894Z T=2048, 2025-05-07T20:32:05.8451971Z D=7168, 2025-05-07T20:32:05.8452051Z scale_ub=None, 2025-05-07T20:32:05.8452138Z contiguous=True, 2025-05-07T20:32:05.8452219Z compiled=True, 2025-05-07T20:32:05.8452291Z ) 2025-05-07T20:32:05.8452511Z self = 2025-05-07T20:32:05.8452675Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.8452680Z 2025-05-07T20:32:05.8452760Z @given( 2025-05-07T20:32:05.8452881Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8452977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8453091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8453203Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8453313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8453389Z ) 2025-05-07T20:32:05.8453627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8453729Z def test_silu_mul_quant( 2025-05-07T20:32:05.8453808Z self, 2025-05-07T20:32:05.8453886Z T: int, 2025-05-07T20:32:05.8453965Z D: int, 2025-05-07T20:32:05.8454060Z scale_ub: Optional[float], 2025-05-07T20:32:05.8454147Z contiguous: bool, 2025-05-07T20:32:05.8454232Z compiled: bool, 2025-05-07T20:32:05.8454317Z ) -> None: 2025-05-07T20:32:05.8454410Z torch.manual_seed(2025) 2025-05-07T20:32:05.8454485Z 2025-05-07T20:32:05.8454704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8454780Z 2025-05-07T20:32:05.8454869Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8454992Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8455077Z x = x_sign * x_clamp 2025-05-07T20:32:05.8455156Z x0 = x[:, :D] 2025-05-07T20:32:05.8455238Z x1 = x[:, D:] 2025-05-07T20:32:05.8455311Z 2025-05-07T20:32:05.8455401Z if contiguous: 2025-05-07T20:32:05.8455490Z x0 = x0.contiguous() 2025-05-07T20:32:05.8455581Z x1 = x1.contiguous() 2025-05-07T20:32:05.8455656Z 2025-05-07T20:32:05.8455743Z if scale_ub is not None: 2025-05-07T20:32:05.8455847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8455984Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8456059Z ) 2025-05-07T20:32:05.8456135Z else: 2025-05-07T20:32:05.8456236Z scale_ub_tensor = None 2025-05-07T20:32:05.8456352Z 2025-05-07T20:32:05.8456479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8456570Z op = silu_mul_quant 2025-05-07T20:32:05.8456656Z if compiled: 2025-05-07T20:32:05.8456755Z op = torch.compile(op) 2025-05-07T20:32:05.8456873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8456958Z 2025-05-07T20:32:05.8457111Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8457119Z 2025-05-07T20:32:05.8457215Z moe/activation_test.py:117: 2025-05-07T20:32:05.8457342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8457445Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8457539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8457903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8457999Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8458488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8458586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8458936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8459153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8459535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8459632Z kernel = self.compile( 2025-05-07T20:32:05.8460006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8460179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8460307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8460314Z 2025-05-07T20:32:05.8460519Z self = 2025-05-07T20:32:05.8461284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8461784Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8d99a80>} 2025-05-07T20:32:05.8462526Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8462712Z context = 2025-05-07T20:32:05.8462717Z 2025-05-07T20:32:05.8462945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8463204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8463309Z module_map=module_map) 2025-05-07T20:32:05.8463469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8463568Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8463649Z E ^ 2025-05-07T20:32:05.8464001Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8464006Z 2025-05-07T20:32:05.8464411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8464423Z 2025-05-07T20:32:05.8464524Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8464742Z self=, 2025-05-07T20:32:05.8464827Z T=16384, 2025-05-07T20:32:05.8464950Z D=5120, 2025-05-07T20:32:05.8465035Z scale_ub=None, 2025-05-07T20:32:05.8465119Z contiguous=False, 2025-05-07T20:32:05.8465203Z compiled=False, 2025-05-07T20:32:05.8465275Z ) 2025-05-07T20:32:05.8465491Z self = 2025-05-07T20:32:05.8465664Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.8465711Z 2025-05-07T20:32:05.8465786Z @given( 2025-05-07T20:32:05.8465904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8466003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8466119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8466233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8466344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8466423Z ) 2025-05-07T20:32:05.8466668Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8466760Z def test_silu_mul_quant( 2025-05-07T20:32:05.8466842Z self, 2025-05-07T20:32:05.8466914Z T: int, 2025-05-07T20:32:05.8466987Z D: int, 2025-05-07T20:32:05.8467088Z scale_ub: Optional[float], 2025-05-07T20:32:05.8467176Z contiguous: bool, 2025-05-07T20:32:05.8467265Z compiled: bool, 2025-05-07T20:32:05.8467343Z ) -> None: 2025-05-07T20:32:05.8467438Z torch.manual_seed(2025) 2025-05-07T20:32:05.8467558Z 2025-05-07T20:32:05.8467724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8467797Z 2025-05-07T20:32:05.8467892Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8468014Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8469819Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8469825Z 2025-05-07T20:32:05.8469939Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.8469949Z 2025-05-07T20:32:05.8470050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8470270Z self=, 2025-05-07T20:32:05.8470349Z T=4096, 2025-05-07T20:32:05.8470427Z D=7168, 2025-05-07T20:32:05.8470510Z scale_ub=1200.0, 2025-05-07T20:32:05.8470594Z contiguous=True, 2025-05-07T20:32:05.8470677Z compiled=True, 2025-05-07T20:32:05.8470750Z ) 2025-05-07T20:32:05.8471008Z self = 2025-05-07T20:32:05.8471180Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.8471185Z 2025-05-07T20:32:05.8471261Z @given( 2025-05-07T20:32:05.8471375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8471477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8471589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8471713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8471823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8471896Z ) 2025-05-07T20:32:05.8472138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8472230Z def test_silu_mul_quant( 2025-05-07T20:32:05.8472305Z self, 2025-05-07T20:32:05.8472386Z T: int, 2025-05-07T20:32:05.8472461Z D: int, 2025-05-07T20:32:05.8472559Z scale_ub: Optional[float], 2025-05-07T20:32:05.8472696Z contiguous: bool, 2025-05-07T20:32:05.8472781Z compiled: bool, 2025-05-07T20:32:05.8472859Z ) -> None: 2025-05-07T20:32:05.8472954Z torch.manual_seed(2025) 2025-05-07T20:32:05.8473027Z 2025-05-07T20:32:05.8473195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8473268Z 2025-05-07T20:32:05.8473363Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8473527Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8475311Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8475317Z 2025-05-07T20:32:05.8475436Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.8475441Z 2025-05-07T20:32:05.8475542Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8475757Z self=, 2025-05-07T20:32:05.8475836Z T=16384, 2025-05-07T20:32:05.8475954Z D=7168, 2025-05-07T20:32:05.8476035Z scale_ub=None, 2025-05-07T20:32:05.8476120Z contiguous=False, 2025-05-07T20:32:05.8476201Z compiled=False, 2025-05-07T20:32:05.8476276Z ) 2025-05-07T20:32:05.8476489Z self = 2025-05-07T20:32:05.8476658Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.8476662Z 2025-05-07T20:32:05.8476739Z @given( 2025-05-07T20:32:05.8476855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8476953Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8477069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8477182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8477291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8477368Z ) 2025-05-07T20:32:05.8477606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8477705Z def test_silu_mul_quant( 2025-05-07T20:32:05.8477782Z self, 2025-05-07T20:32:05.8477859Z T: int, 2025-05-07T20:32:05.8477935Z D: int, 2025-05-07T20:32:05.8478032Z scale_ub: Optional[float], 2025-05-07T20:32:05.8478118Z contiguous: bool, 2025-05-07T20:32:05.8478204Z compiled: bool, 2025-05-07T20:32:05.8478280Z ) -> None: 2025-05-07T20:32:05.8478368Z torch.manual_seed(2025) 2025-05-07T20:32:05.8478441Z 2025-05-07T20:32:05.8478649Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8480432Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8480443Z 2025-05-07T20:32:05.8480558Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8480562Z 2025-05-07T20:32:05.8480668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8480883Z self=, 2025-05-07T20:32:05.8480961Z T=2048, 2025-05-07T20:32:05.8481040Z D=7168, 2025-05-07T20:32:05.8481167Z scale_ub=1200.0, 2025-05-07T20:32:05.8481252Z contiguous=True, 2025-05-07T20:32:05.8481337Z compiled=True, 2025-05-07T20:32:05.8481410Z ) 2025-05-07T20:32:05.8481624Z self = 2025-05-07T20:32:05.8481794Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.8481799Z 2025-05-07T20:32:05.8481916Z @given( 2025-05-07T20:32:05.8482035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8482131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8482243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8482358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8482466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8482538Z ) 2025-05-07T20:32:05.8482783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8482876Z def test_silu_mul_quant( 2025-05-07T20:32:05.8482958Z self, 2025-05-07T20:32:05.8483036Z T: int, 2025-05-07T20:32:05.8483117Z D: int, 2025-05-07T20:32:05.8483220Z scale_ub: Optional[float], 2025-05-07T20:32:05.8483307Z contiguous: bool, 2025-05-07T20:32:05.8483392Z compiled: bool, 2025-05-07T20:32:05.8483480Z ) -> None: 2025-05-07T20:32:05.8483576Z torch.manual_seed(2025) 2025-05-07T20:32:05.8483697Z 2025-05-07T20:32:05.8483868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8483940Z 2025-05-07T20:32:05.8484035Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8484157Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8485921Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8485933Z 2025-05-07T20:32:05.8486047Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.8486056Z 2025-05-07T20:32:05.8486158Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8486381Z self=, 2025-05-07T20:32:05.8486457Z T=2048, 2025-05-07T20:32:05.8486529Z D=7168, 2025-05-07T20:32:05.8486614Z scale_ub=None, 2025-05-07T20:32:05.8486694Z contiguous=True, 2025-05-07T20:32:05.8486775Z compiled=False, 2025-05-07T20:32:05.8486854Z ) 2025-05-07T20:32:05.8487109Z self = 2025-05-07T20:32:05.8487287Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8487292Z 2025-05-07T20:32:05.8487368Z @given( 2025-05-07T20:32:05.8487486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8487637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8487750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8487863Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8487987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8488059Z ) 2025-05-07T20:32:05.8488301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8488400Z def test_silu_mul_quant( 2025-05-07T20:32:05.8488473Z self, 2025-05-07T20:32:05.8488554Z T: int, 2025-05-07T20:32:05.8488633Z D: int, 2025-05-07T20:32:05.8488728Z scale_ub: Optional[float], 2025-05-07T20:32:05.8488823Z contiguous: bool, 2025-05-07T20:32:05.8488955Z compiled: bool, 2025-05-07T20:32:05.8489036Z ) -> None: 2025-05-07T20:32:05.8489133Z torch.manual_seed(2025) 2025-05-07T20:32:05.8489202Z 2025-05-07T20:32:05.8489367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8489444Z 2025-05-07T20:32:05.8489533Z > x_sign = torch.sign(x) 2025-05-07T20:32:05.8491293Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8491345Z 2025-05-07T20:32:05.8491464Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:05.8491468Z 2025-05-07T20:32:05.8491576Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8491794Z self=, 2025-05-07T20:32:05.8491871Z T=1, 2025-05-07T20:32:05.8491949Z D=7168, 2025-05-07T20:32:05.8492030Z scale_ub=1200.0, 2025-05-07T20:32:05.8492112Z contiguous=True, 2025-05-07T20:32:05.8492267Z compiled=False, 2025-05-07T20:32:05.8492341Z ) 2025-05-07T20:32:05.8492554Z self = 2025-05-07T20:32:05.8492721Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8492726Z 2025-05-07T20:32:05.8492800Z @given( 2025-05-07T20:32:05.8492920Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8493019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8493137Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8493256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8493367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8493441Z ) 2025-05-07T20:32:05.8493687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8493783Z def test_silu_mul_quant( 2025-05-07T20:32:05.8493862Z self, 2025-05-07T20:32:05.8493949Z T: int, 2025-05-07T20:32:05.8494025Z D: int, 2025-05-07T20:32:05.8494118Z scale_ub: Optional[float], 2025-05-07T20:32:05.8494214Z contiguous: bool, 2025-05-07T20:32:05.8494297Z compiled: bool, 2025-05-07T20:32:05.8494377Z ) -> None: 2025-05-07T20:32:05.8494470Z torch.manual_seed(2025) 2025-05-07T20:32:05.8494545Z 2025-05-07T20:32:05.8494711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8494785Z 2025-05-07T20:32:05.8494924Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8495051Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8495139Z x = x_sign * x_clamp 2025-05-07T20:32:05.8495218Z x0 = x[:, :D] 2025-05-07T20:32:05.8495300Z x1 = x[:, D:] 2025-05-07T20:32:05.8495374Z 2025-05-07T20:32:05.8495459Z if contiguous: 2025-05-07T20:32:05.8495554Z x0 = x0.contiguous() 2025-05-07T20:32:05.8495645Z x1 = x1.contiguous() 2025-05-07T20:32:05.8495723Z 2025-05-07T20:32:05.8495820Z if scale_ub is not None: 2025-05-07T20:32:05.8495925Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8496060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8496135Z ) 2025-05-07T20:32:05.8496213Z else: 2025-05-07T20:32:05.8496315Z scale_ub_tensor = None 2025-05-07T20:32:05.8496388Z 2025-05-07T20:32:05.8496515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8496661Z op = silu_mul_quant 2025-05-07T20:32:05.8496749Z if compiled: 2025-05-07T20:32:05.8496846Z op = torch.compile(op) 2025-05-07T20:32:05.8496957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8497029Z 2025-05-07T20:32:05.8497118Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8497129Z 2025-05-07T20:32:05.8497224Z moe/activation_test.py:117: 2025-05-07T20:32:05.8497354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8497496Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8497593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8498090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8498192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8498551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8498779Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8499117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8499213Z kernel = self.compile( 2025-05-07T20:32:05.8499600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8499817Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8499943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8499948Z 2025-05-07T20:32:05.8500156Z self = 2025-05-07T20:32:05.8500931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8501434Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8a01440>} 2025-05-07T20:32:05.8502177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8502377Z context = 2025-05-07T20:32:05.8502381Z 2025-05-07T20:32:05.8502544Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8502806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8502920Z module_map=module_map) 2025-05-07T20:32:05.8503119Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8503220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8503304Z E ^ 2025-05-07T20:32:05.8503658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8503663Z 2025-05-07T20:32:05.8504078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8504085Z 2025-05-07T20:32:05.8504188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8504408Z self=, 2025-05-07T20:32:05.8504494Z T=128, 2025-05-07T20:32:05.8504571Z D=5120, 2025-05-07T20:32:05.8504652Z scale_ub=None, 2025-05-07T20:32:05.8504743Z contiguous=True, 2025-05-07T20:32:05.8504828Z compiled=False, 2025-05-07T20:32:05.8504905Z ) 2025-05-07T20:32:05.8505118Z self = 2025-05-07T20:32:05.8505404Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8505409Z 2025-05-07T20:32:05.8505493Z @given( 2025-05-07T20:32:05.8505903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8506039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8506160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8506274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8506474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8506552Z ) 2025-05-07T20:32:05.8506794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8506892Z def test_silu_mul_quant( 2025-05-07T20:32:05.8506967Z self, 2025-05-07T20:32:05.8507043Z T: int, 2025-05-07T20:32:05.8507121Z D: int, 2025-05-07T20:32:05.8507219Z scale_ub: Optional[float], 2025-05-07T20:32:05.8507308Z contiguous: bool, 2025-05-07T20:32:05.8507403Z compiled: bool, 2025-05-07T20:32:05.8507482Z ) -> None: 2025-05-07T20:32:05.8507574Z torch.manual_seed(2025) 2025-05-07T20:32:05.8507648Z 2025-05-07T20:32:05.8507812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8507885Z 2025-05-07T20:32:05.8507980Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8508101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8508262Z x = x_sign * x_clamp 2025-05-07T20:32:05.8508341Z x0 = x[:, :D] 2025-05-07T20:32:05.8508421Z x1 = x[:, D:] 2025-05-07T20:32:05.8508499Z 2025-05-07T20:32:05.8508581Z if contiguous: 2025-05-07T20:32:05.8508671Z x0 = x0.contiguous() 2025-05-07T20:32:05.8508763Z x1 = x1.contiguous() 2025-05-07T20:32:05.8508834Z 2025-05-07T20:32:05.8508924Z if scale_ub is not None: 2025-05-07T20:32:05.8509036Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8509173Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8509249Z ) 2025-05-07T20:32:05.8509333Z else: 2025-05-07T20:32:05.8509426Z scale_ub_tensor = None 2025-05-07T20:32:05.8509505Z 2025-05-07T20:32:05.8509632Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8509720Z op = silu_mul_quant 2025-05-07T20:32:05.8509808Z if compiled: 2025-05-07T20:32:05.8509913Z op = torch.compile(op) 2025-05-07T20:32:05.8510020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8510100Z 2025-05-07T20:32:05.8510187Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8510192Z 2025-05-07T20:32:05.8510289Z moe/activation_test.py:117: 2025-05-07T20:32:05.8510421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8510520Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8510617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8511183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8511283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8511642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8511859Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8512202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8512300Z kernel = self.compile( 2025-05-07T20:32:05.8512677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8512856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8512982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8512990Z 2025-05-07T20:32:05.8513258Z self = 2025-05-07T20:32:05.8514038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8514536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8a02660>} 2025-05-07T20:32:05.8515329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8515517Z context = 2025-05-07T20:32:05.8515521Z 2025-05-07T20:32:05.8515687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8515952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8516058Z module_map=module_map) 2025-05-07T20:32:05.8516222Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8516320Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8516396Z E ^ 2025-05-07T20:32:05.8516794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8516799Z 2025-05-07T20:32:05.8517256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8517261Z 2025-05-07T20:32:05.8517368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8517587Z self=, 2025-05-07T20:32:05.8517667Z T=128, 2025-05-07T20:32:05.8517754Z D=7168, 2025-05-07T20:32:05.8517836Z scale_ub=None, 2025-05-07T20:32:05.8517920Z contiguous=True, 2025-05-07T20:32:05.8518008Z compiled=False, 2025-05-07T20:32:05.8518081Z ) 2025-05-07T20:32:05.8518296Z self = 2025-05-07T20:32:05.8518468Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8518473Z 2025-05-07T20:32:05.8518552Z @given( 2025-05-07T20:32:05.8518674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8518771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8518882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8519001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8519117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8519189Z ) 2025-05-07T20:32:05.8519476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8519573Z def test_silu_mul_quant( 2025-05-07T20:32:05.8519646Z self, 2025-05-07T20:32:05.8519722Z T: int, 2025-05-07T20:32:05.8519797Z D: int, 2025-05-07T20:32:05.8519899Z scale_ub: Optional[float], 2025-05-07T20:32:05.8519986Z contiguous: bool, 2025-05-07T20:32:05.8520071Z compiled: bool, 2025-05-07T20:32:05.8520151Z ) -> None: 2025-05-07T20:32:05.8520243Z torch.manual_seed(2025) 2025-05-07T20:32:05.8520319Z 2025-05-07T20:32:05.8520490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8520563Z 2025-05-07T20:32:05.8520652Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8520781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8520868Z x = x_sign * x_clamp 2025-05-07T20:32:05.8520949Z x0 = x[:, :D] 2025-05-07T20:32:05.8521033Z x1 = x[:, D:] 2025-05-07T20:32:05.8521103Z 2025-05-07T20:32:05.8521193Z if contiguous: 2025-05-07T20:32:05.8521343Z x0 = x0.contiguous() 2025-05-07T20:32:05.8521434Z x1 = x1.contiguous() 2025-05-07T20:32:05.8521511Z 2025-05-07T20:32:05.8521604Z if scale_ub is not None: 2025-05-07T20:32:05.8521709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8521848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8521923Z ) 2025-05-07T20:32:05.8522062Z else: 2025-05-07T20:32:05.8522165Z scale_ub_tensor = None 2025-05-07T20:32:05.8522237Z 2025-05-07T20:32:05.8522364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8522460Z op = silu_mul_quant 2025-05-07T20:32:05.8522543Z if compiled: 2025-05-07T20:32:05.8522640Z op = torch.compile(op) 2025-05-07T20:32:05.8522751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8522824Z 2025-05-07T20:32:05.8522918Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8522925Z 2025-05-07T20:32:05.8523022Z moe/activation_test.py:117: 2025-05-07T20:32:05.8523150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8523254Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8523347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8523841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8523985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8524338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8524561Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8524900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8524996Z kernel = self.compile( 2025-05-07T20:32:05.8525382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8525554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8525680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8525690Z 2025-05-07T20:32:05.8525891Z self = 2025-05-07T20:32:05.8526667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8527175Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8a036a0>} 2025-05-07T20:32:05.8528017Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8528212Z context = 2025-05-07T20:32:05.8528216Z 2025-05-07T20:32:05.8528377Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8528636Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8528752Z module_map=module_map) 2025-05-07T20:32:05.8528910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8529013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8529089Z E ^ 2025-05-07T20:32:05.8529440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8529445Z 2025-05-07T20:32:05.8529905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8529910Z 2025-05-07T20:32:05.8530014Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8530233Z self=, 2025-05-07T20:32:05.8530318Z T=2048, 2025-05-07T20:32:05.8530393Z D=7168, 2025-05-07T20:32:05.8530479Z scale_ub=1200.0, 2025-05-07T20:32:05.8530563Z contiguous=True, 2025-05-07T20:32:05.8530689Z compiled=False, 2025-05-07T20:32:05.8530769Z ) 2025-05-07T20:32:05.8530986Z self = 2025-05-07T20:32:05.8531158Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8531163Z 2025-05-07T20:32:05.8531244Z @given( 2025-05-07T20:32:05.8531361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8531460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8531579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8531697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8531813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8531889Z ) 2025-05-07T20:32:05.8532130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8532225Z def test_silu_mul_quant( 2025-05-07T20:32:05.8532298Z self, 2025-05-07T20:32:05.8532418Z T: int, 2025-05-07T20:32:05.8532498Z D: int, 2025-05-07T20:32:05.8532595Z scale_ub: Optional[float], 2025-05-07T20:32:05.8532684Z contiguous: bool, 2025-05-07T20:32:05.8532770Z compiled: bool, 2025-05-07T20:32:05.8532847Z ) -> None: 2025-05-07T20:32:05.8532938Z torch.manual_seed(2025) 2025-05-07T20:32:05.8533012Z 2025-05-07T20:32:05.8533179Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8534964Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8534975Z 2025-05-07T20:32:05.8535092Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8535096Z 2025-05-07T20:32:05.8535198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8535416Z self=, 2025-05-07T20:32:05.8535490Z T=1, 2025-05-07T20:32:05.8535572Z D=5120, 2025-05-07T20:32:05.8535652Z scale_ub=1200.0, 2025-05-07T20:32:05.8535777Z contiguous=True, 2025-05-07T20:32:05.8535869Z compiled=False, 2025-05-07T20:32:05.8535938Z ) 2025-05-07T20:32:05.8536152Z self = 2025-05-07T20:32:05.8536318Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8536323Z 2025-05-07T20:32:05.8536400Z @given( 2025-05-07T20:32:05.8536521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8536624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8536735Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8536881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8537004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8537090Z ) 2025-05-07T20:32:05.8537335Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8537428Z def test_silu_mul_quant( 2025-05-07T20:32:05.8537499Z self, 2025-05-07T20:32:05.8537582Z T: int, 2025-05-07T20:32:05.8537698Z D: int, 2025-05-07T20:32:05.8537797Z scale_ub: Optional[float], 2025-05-07T20:32:05.8537886Z contiguous: bool, 2025-05-07T20:32:05.8537967Z compiled: bool, 2025-05-07T20:32:05.8538047Z ) -> None: 2025-05-07T20:32:05.8538141Z torch.manual_seed(2025) 2025-05-07T20:32:05.8538211Z 2025-05-07T20:32:05.8538381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8538499Z 2025-05-07T20:32:05.8538588Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8538718Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8538805Z x = x_sign * x_clamp 2025-05-07T20:32:05.8538884Z x0 = x[:, :D] 2025-05-07T20:32:05.8538969Z x1 = x[:, D:] 2025-05-07T20:32:05.8539041Z 2025-05-07T20:32:05.8539127Z if contiguous: 2025-05-07T20:32:05.8539217Z x0 = x0.contiguous() 2025-05-07T20:32:05.8539307Z x1 = x1.contiguous() 2025-05-07T20:32:05.8539385Z 2025-05-07T20:32:05.8539474Z if scale_ub is not None: 2025-05-07T20:32:05.8539578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8539715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8539791Z ) 2025-05-07T20:32:05.8539868Z else: 2025-05-07T20:32:05.8539967Z scale_ub_tensor = None 2025-05-07T20:32:05.8540040Z 2025-05-07T20:32:05.8540212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8540301Z op = silu_mul_quant 2025-05-07T20:32:05.8540383Z if compiled: 2025-05-07T20:32:05.8540482Z op = torch.compile(op) 2025-05-07T20:32:05.8540587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8540659Z 2025-05-07T20:32:05.8540758Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8540763Z 2025-05-07T20:32:05.8540854Z moe/activation_test.py:117: 2025-05-07T20:32:05.8544582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8544703Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8544802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8545312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8545410Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8545768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8545999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8546340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8546439Z kernel = self.compile( 2025-05-07T20:32:05.8546884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8547062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8547192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8547197Z 2025-05-07T20:32:05.8547399Z self = 2025-05-07T20:32:05.8548172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8548685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8c90b80>} 2025-05-07T20:32:05.8549475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8549668Z context = 2025-05-07T20:32:05.8549673Z 2025-05-07T20:32:05.8549835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8550100Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8550206Z module_map=module_map) 2025-05-07T20:32:05.8550409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8550510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8550588Z E ^ 2025-05-07T20:32:05.8550939Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8550943Z 2025-05-07T20:32:05.8551358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8551362Z 2025-05-07T20:32:05.8551470Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8551691Z self=, 2025-05-07T20:32:05.8551769Z T=2048, 2025-05-07T20:32:05.8551843Z D=5120, 2025-05-07T20:32:05.8551931Z scale_ub=None, 2025-05-07T20:32:05.8552015Z contiguous=True, 2025-05-07T20:32:05.8552099Z compiled=False, 2025-05-07T20:32:05.8552176Z ) 2025-05-07T20:32:05.8552392Z self = 2025-05-07T20:32:05.8552626Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8552630Z 2025-05-07T20:32:05.8552707Z @given( 2025-05-07T20:32:05.8552826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8552929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8553043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8553159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8553276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8553350Z ) 2025-05-07T20:32:05.8553595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8553694Z def test_silu_mul_quant( 2025-05-07T20:32:05.8553770Z self, 2025-05-07T20:32:05.8553849Z T: int, 2025-05-07T20:32:05.8553925Z D: int, 2025-05-07T20:32:05.8554023Z scale_ub: Optional[float], 2025-05-07T20:32:05.8554122Z contiguous: bool, 2025-05-07T20:32:05.8554207Z compiled: bool, 2025-05-07T20:32:05.8554286Z ) -> None: 2025-05-07T20:32:05.8554384Z torch.manual_seed(2025) 2025-05-07T20:32:05.8554457Z 2025-05-07T20:32:05.8554625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8554702Z 2025-05-07T20:32:05.8554794Z > x_sign = torch.sign(x) 2025-05-07T20:32:05.8556624Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8556635Z 2025-05-07T20:32:05.8556754Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:05.8556758Z 2025-05-07T20:32:05.8556861Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8557083Z self=, 2025-05-07T20:32:05.8557161Z T=16384, 2025-05-07T20:32:05.8557240Z D=5120, 2025-05-07T20:32:05.8557323Z scale_ub=None, 2025-05-07T20:32:05.8557407Z contiguous=True, 2025-05-07T20:32:05.8557497Z compiled=False, 2025-05-07T20:32:05.8557635Z ) 2025-05-07T20:32:05.8557851Z self = 2025-05-07T20:32:05.8558027Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8558031Z 2025-05-07T20:32:05.8558108Z @given( 2025-05-07T20:32:05.8558225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8558324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8558479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8558596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8558708Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8558782Z ) 2025-05-07T20:32:05.8559024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8559117Z def test_silu_mul_quant( 2025-05-07T20:32:05.8559192Z self, 2025-05-07T20:32:05.8559272Z T: int, 2025-05-07T20:32:05.8559351Z D: int, 2025-05-07T20:32:05.8559448Z scale_ub: Optional[float], 2025-05-07T20:32:05.8559540Z contiguous: bool, 2025-05-07T20:32:05.8559624Z compiled: bool, 2025-05-07T20:32:05.8559703Z ) -> None: 2025-05-07T20:32:05.8559801Z torch.manual_seed(2025) 2025-05-07T20:32:05.8559874Z 2025-05-07T20:32:05.8560041Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8561861Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8561869Z 2025-05-07T20:32:05.8561988Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8561993Z 2025-05-07T20:32:05.8562095Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8562315Z self=, 2025-05-07T20:32:05.8562395Z T=4096, 2025-05-07T20:32:05.8562471Z D=5120, 2025-05-07T20:32:05.8562553Z scale_ub=None, 2025-05-07T20:32:05.8562648Z contiguous=True, 2025-05-07T20:32:05.8562733Z compiled=False, 2025-05-07T20:32:05.8562808Z ) 2025-05-07T20:32:05.8563024Z self = 2025-05-07T20:32:05.8563193Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8563197Z 2025-05-07T20:32:05.8563277Z @given( 2025-05-07T20:32:05.8563392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8563531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8563650Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8563764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8563876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8563951Z ) 2025-05-07T20:32:05.8564191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8564289Z def test_silu_mul_quant( 2025-05-07T20:32:05.8564372Z self, 2025-05-07T20:32:05.8564447Z T: int, 2025-05-07T20:32:05.8564525Z D: int, 2025-05-07T20:32:05.8564621Z scale_ub: Optional[float], 2025-05-07T20:32:05.8564709Z contiguous: bool, 2025-05-07T20:32:05.8564795Z compiled: bool, 2025-05-07T20:32:05.8564871Z ) -> None: 2025-05-07T20:32:05.8564964Z torch.manual_seed(2025) 2025-05-07T20:32:05.8565040Z 2025-05-07T20:32:05.8565205Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8567019Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8567063Z 2025-05-07T20:32:05.8567183Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8567187Z 2025-05-07T20:32:05.8567289Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8567511Z self=, 2025-05-07T20:32:05.8567668Z T=2048, 2025-05-07T20:32:05.8567747Z D=5120, 2025-05-07T20:32:05.8567829Z scale_ub=None, 2025-05-07T20:32:05.8567921Z contiguous=False, 2025-05-07T20:32:05.8568007Z compiled=False, 2025-05-07T20:32:05.8568078Z ) 2025-05-07T20:32:05.8568293Z self = 2025-05-07T20:32:05.8568466Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.8568470Z 2025-05-07T20:32:05.8568548Z @given( 2025-05-07T20:32:05.8568664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8568812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8568924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8569041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8569153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8569225Z ) 2025-05-07T20:32:05.8569470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8569562Z def test_silu_mul_quant( 2025-05-07T20:32:05.8569639Z self, 2025-05-07T20:32:05.8569722Z T: int, 2025-05-07T20:32:05.8569797Z D: int, 2025-05-07T20:32:05.8569896Z scale_ub: Optional[float], 2025-05-07T20:32:05.8569990Z contiguous: bool, 2025-05-07T20:32:05.8570074Z compiled: bool, 2025-05-07T20:32:05.8570151Z ) -> None: 2025-05-07T20:32:05.8570248Z torch.manual_seed(2025) 2025-05-07T20:32:05.8570320Z 2025-05-07T20:32:05.8570487Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8572302Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8572308Z 2025-05-07T20:32:05.8572427Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8572431Z 2025-05-07T20:32:05.8572532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8572752Z self=, 2025-05-07T20:32:05.8572835Z T=4096, 2025-05-07T20:32:05.8572917Z D=7168, 2025-05-07T20:32:05.8572999Z scale_ub=None, 2025-05-07T20:32:05.8573086Z contiguous=True, 2025-05-07T20:32:05.8573169Z compiled=True, 2025-05-07T20:32:05.8573242Z ) 2025-05-07T20:32:05.8573460Z self = 2025-05-07T20:32:05.8573626Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.8573631Z 2025-05-07T20:32:05.8573710Z @given( 2025-05-07T20:32:05.8573828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8573966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8574081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8574196Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8574306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8574382Z ) 2025-05-07T20:32:05.8574624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8574763Z def test_silu_mul_quant( 2025-05-07T20:32:05.8574838Z self, 2025-05-07T20:32:05.8574913Z T: int, 2025-05-07T20:32:05.8574991Z D: int, 2025-05-07T20:32:05.8575088Z scale_ub: Optional[float], 2025-05-07T20:32:05.8575175Z contiguous: bool, 2025-05-07T20:32:05.8575262Z compiled: bool, 2025-05-07T20:32:05.8575339Z ) -> None: 2025-05-07T20:32:05.8575431Z torch.manual_seed(2025) 2025-05-07T20:32:05.8575508Z 2025-05-07T20:32:05.8575680Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8577452Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8577503Z 2025-05-07T20:32:05.8577622Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8577626Z 2025-05-07T20:32:05.8577727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8577947Z self=, 2025-05-07T20:32:05.8578024Z T=2048, 2025-05-07T20:32:05.8578105Z D=5120, 2025-05-07T20:32:05.8578190Z scale_ub=1200.0, 2025-05-07T20:32:05.8578275Z contiguous=False, 2025-05-07T20:32:05.8578361Z compiled=False, 2025-05-07T20:32:05.8578435Z ) 2025-05-07T20:32:05.8578650Z self = 2025-05-07T20:32:05.8578827Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.8578832Z 2025-05-07T20:32:05.8578911Z @given( 2025-05-07T20:32:05.8579031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8579129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8579244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8579358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8579474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8579546Z ) 2025-05-07T20:32:05.8579832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8579933Z def test_silu_mul_quant( 2025-05-07T20:32:05.8580009Z self, 2025-05-07T20:32:05.8580085Z T: int, 2025-05-07T20:32:05.8580163Z D: int, 2025-05-07T20:32:05.8580259Z scale_ub: Optional[float], 2025-05-07T20:32:05.8580345Z contiguous: bool, 2025-05-07T20:32:05.8580432Z compiled: bool, 2025-05-07T20:32:05.8580510Z ) -> None: 2025-05-07T20:32:05.8580603Z torch.manual_seed(2025) 2025-05-07T20:32:05.8580683Z 2025-05-07T20:32:05.8580850Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8582662Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8582669Z 2025-05-07T20:32:05.8582786Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8582791Z 2025-05-07T20:32:05.8582894Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8583112Z self=, 2025-05-07T20:32:05.8583231Z T=4096, 2025-05-07T20:32:05.8583310Z D=7168, 2025-05-07T20:32:05.8583392Z scale_ub=1200.0, 2025-05-07T20:32:05.8583475Z contiguous=True, 2025-05-07T20:32:05.8583560Z compiled=False, 2025-05-07T20:32:05.8583633Z ) 2025-05-07T20:32:05.8583846Z self = 2025-05-07T20:32:05.8584017Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8584022Z 2025-05-07T20:32:05.8584101Z @given( 2025-05-07T20:32:05.8584222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8584318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8584430Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8584545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8584657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8584730Z ) 2025-05-07T20:32:05.8584974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8585112Z def test_silu_mul_quant( 2025-05-07T20:32:05.8585187Z self, 2025-05-07T20:32:05.8585265Z T: int, 2025-05-07T20:32:05.8585341Z D: int, 2025-05-07T20:32:05.8585439Z scale_ub: Optional[float], 2025-05-07T20:32:05.8585528Z contiguous: bool, 2025-05-07T20:32:05.8585613Z compiled: bool, 2025-05-07T20:32:05.8585692Z ) -> None: 2025-05-07T20:32:05.8585788Z torch.manual_seed(2025) 2025-05-07T20:32:05.8585860Z 2025-05-07T20:32:05.8586030Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8587800Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8587812Z 2025-05-07T20:32:05.8587934Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8587938Z 2025-05-07T20:32:05.8588038Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8588322Z self=, 2025-05-07T20:32:05.8588408Z T=16384, 2025-05-07T20:32:05.8588484Z D=7168, 2025-05-07T20:32:05.8588571Z scale_ub=None, 2025-05-07T20:32:05.8588657Z contiguous=False, 2025-05-07T20:32:05.8588740Z compiled=True, 2025-05-07T20:32:05.8588820Z ) 2025-05-07T20:32:05.8589032Z self = 2025-05-07T20:32:05.8589204Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.8589212Z 2025-05-07T20:32:05.8589293Z @given( 2025-05-07T20:32:05.8589410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8589506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8589620Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8589733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8589847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8589919Z ) 2025-05-07T20:32:05.8590202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8590298Z def test_silu_mul_quant( 2025-05-07T20:32:05.8590373Z self, 2025-05-07T20:32:05.8590449Z T: int, 2025-05-07T20:32:05.8590527Z D: int, 2025-05-07T20:32:05.8590625Z scale_ub: Optional[float], 2025-05-07T20:32:05.8590713Z contiguous: bool, 2025-05-07T20:32:05.8590801Z compiled: bool, 2025-05-07T20:32:05.8590923Z ) -> None: 2025-05-07T20:32:05.8591016Z torch.manual_seed(2025) 2025-05-07T20:32:05.8591092Z 2025-05-07T20:32:05.8591257Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8593032Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8593038Z 2025-05-07T20:32:05.8593155Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8593160Z 2025-05-07T20:32:05.8593262Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8593524Z self=, 2025-05-07T20:32:05.8593603Z T=4096, 2025-05-07T20:32:05.8593683Z D=7168, 2025-05-07T20:32:05.8593765Z scale_ub=None, 2025-05-07T20:32:05.8593849Z contiguous=True, 2025-05-07T20:32:05.8593934Z compiled=False, 2025-05-07T20:32:05.8594006Z ) 2025-05-07T20:32:05.8594220Z self = 2025-05-07T20:32:05.8594394Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8594401Z 2025-05-07T20:32:05.8594478Z @given( 2025-05-07T20:32:05.8594598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8594694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8594806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8594925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8595036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8595116Z ) 2025-05-07T20:32:05.8595360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8595452Z def test_silu_mul_quant( 2025-05-07T20:32:05.8595527Z self, 2025-05-07T20:32:05.8595610Z T: int, 2025-05-07T20:32:05.8595685Z D: int, 2025-05-07T20:32:05.8595786Z scale_ub: Optional[float], 2025-05-07T20:32:05.8595874Z contiguous: bool, 2025-05-07T20:32:05.8595959Z compiled: bool, 2025-05-07T20:32:05.8596084Z ) -> None: 2025-05-07T20:32:05.8596180Z torch.manual_seed(2025) 2025-05-07T20:32:05.8596252Z 2025-05-07T20:32:05.8596417Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8598189Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8598199Z 2025-05-07T20:32:05.8598318Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8598323Z 2025-05-07T20:32:05.8598424Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8598682Z self=, 2025-05-07T20:32:05.8598764Z T=16384, 2025-05-07T20:32:05.8598840Z D=7168, 2025-05-07T20:32:05.8598925Z scale_ub=None, 2025-05-07T20:32:05.8599011Z contiguous=True, 2025-05-07T20:32:05.8599094Z compiled=False, 2025-05-07T20:32:05.8599171Z ) 2025-05-07T20:32:05.8599385Z self = 2025-05-07T20:32:05.8599599Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.8599604Z 2025-05-07T20:32:05.8599682Z @given( 2025-05-07T20:32:05.8599797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8599893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8600007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8600122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8600237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8600313Z ) 2025-05-07T20:32:05.8600552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8600646Z def test_silu_mul_quant( 2025-05-07T20:32:05.8600721Z self, 2025-05-07T20:32:05.8600798Z T: int, 2025-05-07T20:32:05.8600876Z D: int, 2025-05-07T20:32:05.8600972Z scale_ub: Optional[float], 2025-05-07T20:32:05.8601061Z contiguous: bool, 2025-05-07T20:32:05.8601193Z compiled: bool, 2025-05-07T20:32:05.8601271Z ) -> None: 2025-05-07T20:32:05.8601364Z torch.manual_seed(2025) 2025-05-07T20:32:05.8601439Z 2025-05-07T20:32:05.8601603Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8603379Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8603385Z 2025-05-07T20:32:05.8603500Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8603510Z 2025-05-07T20:32:05.8603613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8603831Z self=, 2025-05-07T20:32:05.8603908Z T=16384, 2025-05-07T20:32:05.8603988Z D=7168, 2025-05-07T20:32:05.8604071Z scale_ub=1200.0, 2025-05-07T20:32:05.8604156Z contiguous=True, 2025-05-07T20:32:05.8604242Z compiled=False, 2025-05-07T20:32:05.8604315Z ) 2025-05-07T20:32:05.8604567Z self = 2025-05-07T20:32:05.8604745Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8604750Z 2025-05-07T20:32:05.8604829Z @given( 2025-05-07T20:32:05.8604947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8605045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8605159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8605277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8605393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8605468Z ) 2025-05-07T20:32:05.8605939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8606076Z def test_silu_mul_quant( 2025-05-07T20:32:05.8606154Z self, 2025-05-07T20:32:05.8606234Z T: int, 2025-05-07T20:32:05.8606308Z D: int, 2025-05-07T20:32:05.8606405Z scale_ub: Optional[float], 2025-05-07T20:32:05.8606495Z contiguous: bool, 2025-05-07T20:32:05.8606661Z compiled: bool, 2025-05-07T20:32:05.8606742Z ) -> None: 2025-05-07T20:32:05.8606835Z torch.manual_seed(2025) 2025-05-07T20:32:05.8606924Z 2025-05-07T20:32:05.8607115Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8609021Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8609100Z 2025-05-07T20:32:05.8609218Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8609226Z 2025-05-07T20:32:05.8609328Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8609545Z self=, 2025-05-07T20:32:05.8609630Z T=128, 2025-05-07T20:32:05.8609702Z D=5120, 2025-05-07T20:32:05.8609788Z scale_ub=1200.0, 2025-05-07T20:32:05.8609872Z contiguous=False, 2025-05-07T20:32:05.8609951Z compiled=False, 2025-05-07T20:32:05.8610036Z ) 2025-05-07T20:32:05.8610317Z self = 2025-05-07T20:32:05.8610496Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.8610501Z 2025-05-07T20:32:05.8610578Z @given( 2025-05-07T20:32:05.8610693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8610790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8610906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8611021Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8611139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8611213Z ) 2025-05-07T20:32:05.8611453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8611548Z def test_silu_mul_quant( 2025-05-07T20:32:05.8611620Z self, 2025-05-07T20:32:05.8611698Z T: int, 2025-05-07T20:32:05.8611772Z D: int, 2025-05-07T20:32:05.8611872Z scale_ub: Optional[float], 2025-05-07T20:32:05.8611966Z contiguous: bool, 2025-05-07T20:32:05.8612049Z compiled: bool, 2025-05-07T20:32:05.8612126Z ) -> None: 2025-05-07T20:32:05.8612222Z torch.manual_seed(2025) 2025-05-07T20:32:05.8612294Z 2025-05-07T20:32:05.8612459Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8612538Z 2025-05-07T20:32:05.8612630Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8612812Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8612908Z x = x_sign * x_clamp 2025-05-07T20:32:05.8612988Z x0 = x[:, :D] 2025-05-07T20:32:05.8613070Z x1 = x[:, D:] 2025-05-07T20:32:05.8613139Z 2025-05-07T20:32:05.8613220Z if contiguous: 2025-05-07T20:32:05.8613313Z x0 = x0.contiguous() 2025-05-07T20:32:05.8613401Z x1 = x1.contiguous() 2025-05-07T20:32:05.8613470Z 2025-05-07T20:32:05.8613562Z if scale_ub is not None: 2025-05-07T20:32:05.8613671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8613805Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8613886Z ) 2025-05-07T20:32:05.8613963Z else: 2025-05-07T20:32:05.8614055Z scale_ub_tensor = None 2025-05-07T20:32:05.8614134Z 2025-05-07T20:32:05.8614264Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8614356Z op = silu_mul_quant 2025-05-07T20:32:05.8614441Z if compiled: 2025-05-07T20:32:05.8614583Z op = torch.compile(op) 2025-05-07T20:32:05.8614695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8614765Z 2025-05-07T20:32:05.8614854Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8614859Z 2025-05-07T20:32:05.8614959Z moe/activation_test.py:117: 2025-05-07T20:32:05.8615088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8615185Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8615335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8615834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8615934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8616290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8616510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8616855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8616950Z kernel = self.compile( 2025-05-07T20:32:05.8617329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8617507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8617699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8617704Z 2025-05-07T20:32:05.8617912Z self = 2025-05-07T20:32:05.8618687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8619198Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a89237e0>} 2025-05-07T20:32:05.8619949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8620141Z context = 2025-05-07T20:32:05.8620150Z 2025-05-07T20:32:05.8620318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8620579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8620687Z module_map=module_map) 2025-05-07T20:32:05.8620847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8620945Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8621066Z E ^ 2025-05-07T20:32:05.8621423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8621428Z 2025-05-07T20:32:05.8621839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8621849Z 2025-05-07T20:32:05.8621952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8622173Z self=, 2025-05-07T20:32:05.8622259Z T=2048, 2025-05-07T20:32:05.8622335Z D=7168, 2025-05-07T20:32:05.8622414Z scale_ub=None, 2025-05-07T20:32:05.8622503Z contiguous=False, 2025-05-07T20:32:05.8622586Z compiled=False, 2025-05-07T20:32:05.8622657Z ) 2025-05-07T20:32:05.8622875Z self = 2025-05-07T20:32:05.8623046Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.8623053Z 2025-05-07T20:32:05.8623166Z @given( 2025-05-07T20:32:05.8623293Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8623391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8623506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8623618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8623728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8623852Z ) 2025-05-07T20:32:05.8624094Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8624185Z def test_silu_mul_quant( 2025-05-07T20:32:05.8624260Z self, 2025-05-07T20:32:05.8624339Z T: int, 2025-05-07T20:32:05.8624416Z D: int, 2025-05-07T20:32:05.8624515Z scale_ub: Optional[float], 2025-05-07T20:32:05.8624601Z contiguous: bool, 2025-05-07T20:32:05.8624690Z compiled: bool, 2025-05-07T20:32:05.8624766Z ) -> None: 2025-05-07T20:32:05.8624863Z torch.manual_seed(2025) 2025-05-07T20:32:05.8624943Z 2025-05-07T20:32:05.8625110Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8626896Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8626963Z 2025-05-07T20:32:05.8627097Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8627102Z 2025-05-07T20:32:05.8627209Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8627435Z self=, 2025-05-07T20:32:05.8627511Z T=128, 2025-05-07T20:32:05.8627584Z D=7168, 2025-05-07T20:32:05.8627672Z scale_ub=1200.0, 2025-05-07T20:32:05.8627756Z contiguous=True, 2025-05-07T20:32:05.8627842Z compiled=True, 2025-05-07T20:32:05.8627916Z ) 2025-05-07T20:32:05.8628129Z self = 2025-05-07T20:32:05.8628299Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.8628308Z 2025-05-07T20:32:05.8628384Z @given( 2025-05-07T20:32:05.8628500Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8628603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8628715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8628826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8628943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8629059Z ) 2025-05-07T20:32:05.8629305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8629396Z def test_silu_mul_quant( 2025-05-07T20:32:05.8629473Z self, 2025-05-07T20:32:05.8629555Z T: int, 2025-05-07T20:32:05.8629630Z D: int, 2025-05-07T20:32:05.8629728Z scale_ub: Optional[float], 2025-05-07T20:32:05.8629820Z contiguous: bool, 2025-05-07T20:32:05.8629904Z compiled: bool, 2025-05-07T20:32:05.8629986Z ) -> None: 2025-05-07T20:32:05.8630083Z torch.manual_seed(2025) 2025-05-07T20:32:05.8630155Z 2025-05-07T20:32:05.8630319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8630397Z 2025-05-07T20:32:05.8630485Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8630612Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8630697Z x = x_sign * x_clamp 2025-05-07T20:32:05.8630778Z x0 = x[:, :D] 2025-05-07T20:32:05.8630904Z x1 = x[:, D:] 2025-05-07T20:32:05.8630977Z 2025-05-07T20:32:05.8631060Z if contiguous: 2025-05-07T20:32:05.8631158Z x0 = x0.contiguous() 2025-05-07T20:32:05.8631245Z x1 = x1.contiguous() 2025-05-07T20:32:05.8631316Z 2025-05-07T20:32:05.8631412Z if scale_ub is not None: 2025-05-07T20:32:05.8631516Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.8631649Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.8631771Z ) 2025-05-07T20:32:05.8631845Z else: 2025-05-07T20:32:05.8631937Z scale_ub_tensor = None 2025-05-07T20:32:05.8632015Z 2025-05-07T20:32:05.8632141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.8632235Z op = silu_mul_quant 2025-05-07T20:32:05.8632317Z if compiled: 2025-05-07T20:32:05.8632414Z op = torch.compile(op) 2025-05-07T20:32:05.8632526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8632603Z 2025-05-07T20:32:05.8632692Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.8632697Z 2025-05-07T20:32:05.8632798Z moe/activation_test.py:117: 2025-05-07T20:32:05.8632925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8633022Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.8633124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.8633490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.8633630Z return fn(*args, **kwargs) 2025-05-07T20:32:05.8634121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.8634218Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.8634579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.8634803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.8635140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.8635236Z kernel = self.compile( 2025-05-07T20:32:05.8635614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.8635792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8635921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.8635925Z 2025-05-07T20:32:05.8636129Z self = 2025-05-07T20:32:05.8637021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.8637525Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f39a8786a20>} 2025-05-07T20:32:05.8638279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.8638474Z context = 2025-05-07T20:32:05.8638478Z 2025-05-07T20:32:05.8638645Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.8638905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8639010Z module_map=module_map) 2025-05-07T20:32:05.8639177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.8639278Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.8639394Z E ^ 2025-05-07T20:32:05.8639752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.8639757Z 2025-05-07T20:32:05.8640167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8640171Z 2025-05-07T20:32:05.8640278Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8640538Z self=, 2025-05-07T20:32:05.8640616Z T=128, 2025-05-07T20:32:05.8640695Z D=7168, 2025-05-07T20:32:05.8640776Z scale_ub=1200.0, 2025-05-07T20:32:05.8640858Z contiguous=True, 2025-05-07T20:32:05.8640946Z compiled=False, 2025-05-07T20:32:05.8641017Z ) 2025-05-07T20:32:05.8641232Z self = 2025-05-07T20:32:05.8641409Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.8641414Z 2025-05-07T20:32:05.8641490Z @given( 2025-05-07T20:32:05.8641607Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8641707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8641820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8641940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8642051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8642170Z ) 2025-05-07T20:32:05.8642415Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8642507Z def test_silu_mul_quant( 2025-05-07T20:32:05.8642592Z self, 2025-05-07T20:32:05.8642666Z T: int, 2025-05-07T20:32:05.8642744Z D: int, 2025-05-07T20:32:05.8642844Z scale_ub: Optional[float], 2025-05-07T20:32:05.8642933Z contiguous: bool, 2025-05-07T20:32:05.8643020Z compiled: bool, 2025-05-07T20:32:05.8643106Z ) -> None: 2025-05-07T20:32:05.8643198Z torch.manual_seed(2025) 2025-05-07T20:32:05.8643270Z 2025-05-07T20:32:05.8643437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8643510Z 2025-05-07T20:32:05.8643598Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8643724Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8645531Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8645542Z 2025-05-07T20:32:05.8645669Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.8645673Z 2025-05-07T20:32:05.8645776Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8646003Z self=, 2025-05-07T20:32:05.8646080Z T=128, 2025-05-07T20:32:05.8646158Z D=5120, 2025-05-07T20:32:05.8646244Z scale_ub=1200.0, 2025-05-07T20:32:05.8646328Z contiguous=True, 2025-05-07T20:32:05.8646411Z compiled=True, 2025-05-07T20:32:05.8646485Z ) 2025-05-07T20:32:05.8646703Z self = 2025-05-07T20:32:05.8646885Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.8646890Z 2025-05-07T20:32:05.8646977Z @given( 2025-05-07T20:32:05.8647117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8647218Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8647375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8647491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8647660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8647729Z ) 2025-05-07T20:32:05.8647969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8648066Z def test_silu_mul_quant( 2025-05-07T20:32:05.8648140Z self, 2025-05-07T20:32:05.8648286Z T: int, 2025-05-07T20:32:05.8648363Z D: int, 2025-05-07T20:32:05.8648460Z scale_ub: Optional[float], 2025-05-07T20:32:05.8648547Z contiguous: bool, 2025-05-07T20:32:05.8648627Z compiled: bool, 2025-05-07T20:32:05.8648703Z ) -> None: 2025-05-07T20:32:05.8648804Z torch.manual_seed(2025) 2025-05-07T20:32:05.8648878Z 2025-05-07T20:32:05.8649043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8649123Z 2025-05-07T20:32:05.8649221Z x_sign = torch.sign(x) 2025-05-07T20:32:05.8649346Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.8651109Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8651160Z 2025-05-07T20:32:05.8651277Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.8651281Z 2025-05-07T20:32:05.8651389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.8651608Z self=, 2025-05-07T20:32:05.8651693Z T=128, 2025-05-07T20:32:05.8651769Z D=7168, 2025-05-07T20:32:05.8651850Z scale_ub=None, 2025-05-07T20:32:05.8651937Z contiguous=True, 2025-05-07T20:32:05.8652016Z compiled=True, 2025-05-07T20:32:05.8652084Z ) 2025-05-07T20:32:05.8652303Z self = 2025-05-07T20:32:05.8652466Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.8652476Z 2025-05-07T20:32:05.8652552Z @given( 2025-05-07T20:32:05.8652670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.8652768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.8652880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.8653000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.8653109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.8653188Z ) 2025-05-07T20:32:05.8653474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.8653567Z def test_silu_mul_quant( 2025-05-07T20:32:05.8653645Z self, 2025-05-07T20:32:05.8653719Z T: int, 2025-05-07T20:32:05.8653791Z D: int, 2025-05-07T20:32:05.8653890Z scale_ub: Optional[float], 2025-05-07T20:32:05.8653977Z contiguous: bool, 2025-05-07T20:32:05.8654061Z compiled: bool, 2025-05-07T20:32:05.8654140Z ) -> None: 2025-05-07T20:32:05.8654238Z torch.manual_seed(2025) 2025-05-07T20:32:05.8654310Z 2025-05-07T20:32:05.8654479Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.8656285Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.8656295Z 2025-05-07T20:32:05.8656410Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.8656543Z =============================== warnings summary =============================== 2025-05-07T20:32:05.8656890Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:05.8657187Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:05.8657481Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:05.8658362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:05.8658593Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:05.8658597Z 2025-05-07T20:32:05.8658816Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:05.8658983Z ================= 1 failed, 1 deselected, 3 warnings in 16.33s ================= 2025-05-07T20:32:07.4783557Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:07.5403056Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:07.5403348Z 2025-05-07T20:32:09.5420030Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:11.6823013Z ============================= test session starts ============================== 2025-05-07T20:32:11.6823686Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:11.6824209Z cachedir: .pytest_cache 2025-05-07T20:32:11.6824778Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:11.6825539Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:11.6825950Z plugins: hypothesis-6.131.14 2025-05-07T20:32:13.2679182Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:13.4196153Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:13.4197274Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:13.4197848Z 2025-05-07T20:32:15.7672315Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7673595Z self=, 2025-05-07T20:32:15.7674418Z T=1, 2025-05-07T20:32:15.7674798Z D=5120, 2025-05-07T20:32:15.7675179Z scale_ub=None, 2025-05-07T20:32:15.7675612Z contiguous=True, 2025-05-07T20:32:15.7676061Z compiled=True, 2025-05-07T20:32:15.7676464Z ) 2025-05-07T20:32:15.7677127Z self = 2025-05-07T20:32:15.7678102Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:15.7678619Z 2025-05-07T20:32:15.7678794Z @given( 2025-05-07T20:32:15.7679252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7679752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7680067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7680401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7680825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7681122Z ) 2025-05-07T20:32:15.7681472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7681927Z def test_silu_mul_quant( 2025-05-07T20:32:15.7682176Z self, 2025-05-07T20:32:15.7682372Z T: int, 2025-05-07T20:32:15.7682577Z D: int, 2025-05-07T20:32:15.7682805Z scale_ub: Optional[float], 2025-05-07T20:32:15.7683166Z contiguous: bool, 2025-05-07T20:32:15.7683410Z compiled: bool, 2025-05-07T20:32:15.7683649Z ) -> None: 2025-05-07T20:32:15.7683876Z torch.manual_seed(2025) 2025-05-07T20:32:15.7684119Z 2025-05-07T20:32:15.7684420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7684765Z 2025-05-07T20:32:15.7684969Z x_sign = torch.sign(x) 2025-05-07T20:32:15.7685264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.7685584Z x = x_sign * x_clamp 2025-05-07T20:32:15.7685834Z x0 = x[:, :D] 2025-05-07T20:32:15.7686047Z x1 = x[:, D:] 2025-05-07T20:32:15.7686269Z 2025-05-07T20:32:15.7686463Z if contiguous: 2025-05-07T20:32:15.7686699Z x0 = x0.contiguous() 2025-05-07T20:32:15.7686966Z x1 = x1.contiguous() 2025-05-07T20:32:15.7687212Z 2025-05-07T20:32:15.7687405Z if scale_ub is not None: 2025-05-07T20:32:15.7687901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.7688245Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.7688556Z ) 2025-05-07T20:32:15.7688761Z else: 2025-05-07T20:32:15.7688981Z scale_ub_tensor = None 2025-05-07T20:32:15.7689247Z 2025-05-07T20:32:15.7689485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.7689814Z op = silu_mul_quant 2025-05-07T20:32:15.7690075Z if compiled: 2025-05-07T20:32:15.7690329Z op = torch.compile(op) 2025-05-07T20:32:15.7690633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7690919Z 2025-05-07T20:32:15.7691117Z y_fp8, y_scale = fn() 2025-05-07T20:32:15.7691411Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:15.7691712Z 2025-05-07T20:32:15.7691954Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.7692299Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:15.7692610Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:15.7692924Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:15.7693290Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.7693617Z 2025-05-07T20:32:15.7693832Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:15.7694031Z 2025-05-07T20:32:15.7694135Z moe/activation_test.py:126: 2025-05-07T20:32:15.7694496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7694845Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:15.7695175Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.7695972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:15.7696736Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:15.7697290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.7697980Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.7698669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:15.7699389Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.7700236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:15.7700988Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.7701720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:15.7702356Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:15.7703007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:15.7703526Z fn() 2025-05-07T20:32:15.7704030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:15.7704617Z self.fn.run( 2025-05-07T20:32:15.7705087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.7705953Z kernel = self.compile( 2025-05-07T20:32:15.7706524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.7707179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.7707575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7707804Z 2025-05-07T20:32:15.7708009Z self = 2025-05-07T20:32:15.7709190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.7710583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05c48553a0>} 2025-05-07T20:32:15.7711929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.7712962Z context = 2025-05-07T20:32:15.7713248Z 2025-05-07T20:32:15.7713415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.7713942Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.7714417Z module_map=module_map) 2025-05-07T20:32:15.7714791Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.7715148Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:15.7715420Z E ^ 2025-05-07T20:32:15.7715888Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.7716337Z 2025-05-07T20:32:15.7716824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.7717347Z 2025-05-07T20:32:15.7717454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7717871Z self=, 2025-05-07T20:32:15.7718277Z T=2048, 2025-05-07T20:32:15.7718467Z D=5120, 2025-05-07T20:32:15.7718666Z scale_ub=1200.0, 2025-05-07T20:32:15.7718897Z contiguous=True, 2025-05-07T20:32:15.7719127Z compiled=False, 2025-05-07T20:32:15.7719337Z ) 2025-05-07T20:32:16.6881882Z self = 2025-05-07T20:32:16.6882494Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6882880Z 2025-05-07T20:32:16.6882987Z @given( 2025-05-07T20:32:16.6883225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6883542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6884169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6884512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6884848Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6885148Z ) 2025-05-07T20:32:16.6885507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6885950Z def test_silu_mul_quant( 2025-05-07T20:32:16.6886199Z self, 2025-05-07T20:32:16.6886501Z T: int, 2025-05-07T20:32:16.6886703Z D: int, 2025-05-07T20:32:16.6886925Z scale_ub: Optional[float], 2025-05-07T20:32:16.6887206Z contiguous: bool, 2025-05-07T20:32:16.6887446Z compiled: bool, 2025-05-07T20:32:16.6887810Z ) -> None: 2025-05-07T20:32:16.6888042Z torch.manual_seed(2025) 2025-05-07T20:32:16.6888292Z 2025-05-07T20:32:16.6888579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6888939Z 2025-05-07T20:32:16.6889148Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6889453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6889777Z x = x_sign * x_clamp 2025-05-07T20:32:16.6890025Z x0 = x[:, :D] 2025-05-07T20:32:16.6890258Z x1 = x[:, D:] 2025-05-07T20:32:16.6890482Z 2025-05-07T20:32:16.6890685Z if contiguous: 2025-05-07T20:32:16.6890927Z x0 = x0.contiguous() 2025-05-07T20:32:16.6891207Z x1 = x1.contiguous() 2025-05-07T20:32:16.6891566Z 2025-05-07T20:32:16.6891762Z if scale_ub is not None: 2025-05-07T20:32:16.6892048Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6892398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6892709Z ) 2025-05-07T20:32:16.6892912Z else: 2025-05-07T20:32:16.6893133Z scale_ub_tensor = None 2025-05-07T20:32:16.6893389Z 2025-05-07T20:32:16.6893636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6893962Z op = silu_mul_quant 2025-05-07T20:32:16.6894220Z if compiled: 2025-05-07T20:32:16.6894485Z op = torch.compile(op) 2025-05-07T20:32:16.6894796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6895076Z 2025-05-07T20:32:16.6895294Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6895469Z 2025-05-07T20:32:16.6895574Z moe/activation_test.py:117: 2025-05-07T20:32:16.6895887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6896243Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6896531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6897236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6897947Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6898569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6899270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6899942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6900487Z kernel = self.compile( 2025-05-07T20:32:16.6901036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6901708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6902117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6902348Z 2025-05-07T20:32:16.6902567Z self = 2025-05-07T20:32:16.6903699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6905100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05c45102c0>} 2025-05-07T20:32:16.6906835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6907950Z context = 2025-05-07T20:32:16.6908237Z 2025-05-07T20:32:16.6908405Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6908927Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6909395Z module_map=module_map) 2025-05-07T20:32:16.6909775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6910174Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6910438Z E ^ 2025-05-07T20:32:16.6910903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6911350Z 2025-05-07T20:32:16.6911766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6912400Z 2025-05-07T20:32:16.6912506Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6912925Z self=, 2025-05-07T20:32:16.6913333Z T=2048, 2025-05-07T20:32:16.6913527Z D=5120, 2025-05-07T20:32:16.6913735Z scale_ub=1200.0, 2025-05-07T20:32:16.6913973Z contiguous=True, 2025-05-07T20:32:16.6914202Z compiled=True, 2025-05-07T20:32:16.6914422Z ) 2025-05-07T20:32:16.6914754Z self = 2025-05-07T20:32:16.6915258Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.6915541Z 2025-05-07T20:32:16.6915624Z @given( 2025-05-07T20:32:16.6915871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6916198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6916510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6916855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6917199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6917490Z ) 2025-05-07T20:32:16.6917853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6918303Z def test_silu_mul_quant( 2025-05-07T20:32:16.6918551Z self, 2025-05-07T20:32:16.6918760Z T: int, 2025-05-07T20:32:16.6918973Z D: int, 2025-05-07T20:32:16.6919198Z scale_ub: Optional[float], 2025-05-07T20:32:16.6919553Z contiguous: bool, 2025-05-07T20:32:16.6919814Z compiled: bool, 2025-05-07T20:32:16.6920043Z ) -> None: 2025-05-07T20:32:16.6920274Z torch.manual_seed(2025) 2025-05-07T20:32:16.6920529Z 2025-05-07T20:32:16.6920817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6921165Z 2025-05-07T20:32:16.6921376Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6921684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6922003Z x = x_sign * x_clamp 2025-05-07T20:32:16.6922258Z x0 = x[:, :D] 2025-05-07T20:32:16.6922491Z x1 = x[:, D:] 2025-05-07T20:32:16.6922704Z 2025-05-07T20:32:16.6922906Z if contiguous: 2025-05-07T20:32:16.6923151Z x0 = x0.contiguous() 2025-05-07T20:32:16.6923420Z x1 = x1.contiguous() 2025-05-07T20:32:16.6923670Z 2025-05-07T20:32:16.6923878Z if scale_ub is not None: 2025-05-07T20:32:16.6924155Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6924577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6924901Z ) 2025-05-07T20:32:16.6925102Z else: 2025-05-07T20:32:16.6925325Z scale_ub_tensor = None 2025-05-07T20:32:16.6925590Z 2025-05-07T20:32:16.6925834Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6926158Z op = silu_mul_quant 2025-05-07T20:32:16.6926424Z if compiled: 2025-05-07T20:32:16.6926733Z op = torch.compile(op) 2025-05-07T20:32:16.6927037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6927325Z 2025-05-07T20:32:16.6927613Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.6927895Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.6928187Z 2025-05-07T20:32:16.6928426Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6928752Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.6929048Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.6929366Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.6929717Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.6930027Z 2025-05-07T20:32:16.6930226Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.6930419Z 2025-05-07T20:32:16.6930522Z moe/activation_test.py:126: 2025-05-07T20:32:16.6930813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6931201Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.6931528Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.6932310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.6933069Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.6933623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6934316Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6935002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.6935728Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.6936487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:16.6937244Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.6937973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.6938629Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.6939287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.6939811Z fn() 2025-05-07T20:32:16.6940331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.6940922Z self.fn.run( 2025-05-07T20:32:16.6941399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6941934Z kernel = self.compile( 2025-05-07T20:32:16.6942490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6943153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6943553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6943791Z 2025-05-07T20:32:16.6944003Z self = 2025-05-07T20:32:16.6945135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6946509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05c4510900>} 2025-05-07T20:32:16.6947852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6948917Z context = 2025-05-07T20:32:16.6949212Z 2025-05-07T20:32:16.6949378Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6949916Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6950441Z module_map=module_map) 2025-05-07T20:32:16.6950805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6951166Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.6951439Z E ^ 2025-05-07T20:32:16.6951902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6952357Z 2025-05-07T20:32:16.6952772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6953339Z 2025-05-07T20:32:16.6953447Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6953866Z self=, 2025-05-07T20:32:16.6954270Z T=16384, 2025-05-07T20:32:16.6954473Z D=7168, 2025-05-07T20:32:16.6954678Z scale_ub=1200.0, 2025-05-07T20:32:16.6954904Z contiguous=False, 2025-05-07T20:32:16.6955144Z compiled=False, 2025-05-07T20:32:16.6955357Z ) 2025-05-07T20:32:17.4852970Z self = 2025-05-07T20:32:17.4853743Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.4854138Z 2025-05-07T20:32:17.4854259Z @given( 2025-05-07T20:32:17.4854574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4854982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4855381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4855706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4856043Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4856332Z ) 2025-05-07T20:32:17.4856682Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4857125Z def test_silu_mul_quant( 2025-05-07T20:32:17.4857371Z self, 2025-05-07T20:32:17.4857564Z T: int, 2025-05-07T20:32:17.4858075Z D: int, 2025-05-07T20:32:17.4858304Z scale_ub: Optional[float], 2025-05-07T20:32:17.4858572Z contiguous: bool, 2025-05-07T20:32:17.4858815Z compiled: bool, 2025-05-07T20:32:17.4859046Z ) -> None: 2025-05-07T20:32:17.4859268Z torch.manual_seed(2025) 2025-05-07T20:32:17.4859504Z 2025-05-07T20:32:17.4859780Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4860129Z 2025-05-07T20:32:17.4860359Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4860642Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4860954Z x = x_sign * x_clamp 2025-05-07T20:32:17.4861201Z x0 = x[:, :D] 2025-05-07T20:32:17.4861416Z x1 = x[:, D:] 2025-05-07T20:32:17.4861621Z 2025-05-07T20:32:17.4861809Z if contiguous: 2025-05-07T20:32:17.4862044Z x0 = x0.contiguous() 2025-05-07T20:32:17.4862296Z x1 = x1.contiguous() 2025-05-07T20:32:17.4862536Z 2025-05-07T20:32:17.4862814Z if scale_ub is not None: 2025-05-07T20:32:17.4863085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4863423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4863735Z ) 2025-05-07T20:32:17.4863922Z else: 2025-05-07T20:32:17.4864133Z scale_ub_tensor = None 2025-05-07T20:32:17.4864670Z 2025-05-07T20:32:17.4864899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4865303Z op = silu_mul_quant 2025-05-07T20:32:17.4865550Z if compiled: 2025-05-07T20:32:17.4865790Z op = torch.compile(op) 2025-05-07T20:32:17.4866086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4866359Z 2025-05-07T20:32:17.4866551Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4866718Z 2025-05-07T20:32:17.4866818Z moe/activation_test.py:117: 2025-05-07T20:32:17.4867115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4867477Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4867754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4868454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4869153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4869691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4870463Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4871130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4871661Z kernel = self.compile( 2025-05-07T20:32:17.4879373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4880059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4880471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4880703Z 2025-05-07T20:32:17.4880914Z self = 2025-05-07T20:32:17.4882000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4883391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bf51bc40>} 2025-05-07T20:32:17.4884730Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4885825Z context = 2025-05-07T20:32:17.4886126Z 2025-05-07T20:32:17.4886294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4886818Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4887289Z module_map=module_map) 2025-05-07T20:32:17.4887748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4888115Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4888386Z E ^ 2025-05-07T20:32:17.4888851Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4889309Z 2025-05-07T20:32:17.4889724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4890272Z 2025-05-07T20:32:17.4890401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4890870Z self=, 2025-05-07T20:32:17.4891275Z T=1, 2025-05-07T20:32:17.4891470Z D=7168, 2025-05-07T20:32:17.4891674Z scale_ub=None, 2025-05-07T20:32:17.4891891Z contiguous=True, 2025-05-07T20:32:17.4892123Z compiled=True, 2025-05-07T20:32:17.4892327Z ) 2025-05-07T20:32:17.4892638Z self = 2025-05-07T20:32:17.4893158Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4893413Z 2025-05-07T20:32:17.4893491Z @given( 2025-05-07T20:32:17.4893716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4894023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4894340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4894678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4895009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4895305Z ) 2025-05-07T20:32:17.4895661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4896106Z def test_silu_mul_quant( 2025-05-07T20:32:17.4896350Z self, 2025-05-07T20:32:17.4896554Z T: int, 2025-05-07T20:32:17.4896758Z D: int, 2025-05-07T20:32:17.4896977Z scale_ub: Optional[float], 2025-05-07T20:32:17.4897254Z contiguous: bool, 2025-05-07T20:32:17.4897552Z compiled: bool, 2025-05-07T20:32:17.4897777Z ) -> None: 2025-05-07T20:32:17.4898001Z torch.manual_seed(2025) 2025-05-07T20:32:17.4898264Z 2025-05-07T20:32:17.4898538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4898892Z 2025-05-07T20:32:17.4899096Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4899385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4899701Z x = x_sign * x_clamp 2025-05-07T20:32:17.4899952Z x0 = x[:, :D] 2025-05-07T20:32:17.4900175Z x1 = x[:, D:] 2025-05-07T20:32:17.4900396Z 2025-05-07T20:32:17.4900590Z if contiguous: 2025-05-07T20:32:17.4900834Z x0 = x0.contiguous() 2025-05-07T20:32:17.4901094Z x1 = x1.contiguous() 2025-05-07T20:32:17.4901340Z 2025-05-07T20:32:17.4901540Z if scale_ub is not None: 2025-05-07T20:32:17.4901813Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4902164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4902478Z ) 2025-05-07T20:32:17.4902672Z else: 2025-05-07T20:32:17.4902895Z scale_ub_tensor = None 2025-05-07T20:32:17.4903154Z 2025-05-07T20:32:17.4903390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4903711Z op = silu_mul_quant 2025-05-07T20:32:17.4903963Z if compiled: 2025-05-07T20:32:17.4904210Z op = torch.compile(op) 2025-05-07T20:32:17.4904560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4904843Z 2025-05-07T20:32:17.4905039Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4905334Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4905997Z 2025-05-07T20:32:17.4906248Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4906587Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4906891Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4907220Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4907581Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4907906Z 2025-05-07T20:32:17.4908117Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4908315Z 2025-05-07T20:32:17.4908418Z moe/activation_test.py:126: 2025-05-07T20:32:17.4908725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4909073Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4909485Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4910272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4911031Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4911585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4912492Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4913188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4913917Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4914672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4915421Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4916151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4916795Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4917401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4917991Z fn() 2025-05-07T20:32:17.4918503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4919089Z self.fn.run( 2025-05-07T20:32:17.4919554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4920099Z kernel = self.compile( 2025-05-07T20:32:17.4920689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4921348Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4921744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4921981Z 2025-05-07T20:32:17.4922192Z self = 2025-05-07T20:32:17.4923271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4924654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bf2e79c0>} 2025-05-07T20:32:17.4926057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4927082Z context = 2025-05-07T20:32:17.4927380Z 2025-05-07T20:32:17.4927609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4928139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4928603Z module_map=module_map) 2025-05-07T20:32:17.4928981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4929342Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4929610Z E ^ 2025-05-07T20:32:17.4930068Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4930521Z 2025-05-07T20:32:17.4930940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4931491Z 2025-05-07T20:32:17.4931606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4932016Z self=, 2025-05-07T20:32:17.4932427Z T=4096, 2025-05-07T20:32:17.4932626Z D=5120, 2025-05-07T20:32:17.4932830Z scale_ub=None, 2025-05-07T20:32:17.4933049Z contiguous=False, 2025-05-07T20:32:17.4933285Z compiled=False, 2025-05-07T20:32:17.4933535Z ) 2025-05-07T20:32:18.3961840Z self = 2025-05-07T20:32:18.3962437Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.3962833Z 2025-05-07T20:32:18.3962927Z @given( 2025-05-07T20:32:18.3963175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.3963510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.3963833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.3964195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.3964541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.3964843Z ) 2025-05-07T20:32:18.3965207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.3965656Z def test_silu_mul_quant( 2025-05-07T20:32:18.3965911Z self, 2025-05-07T20:32:18.3966119Z T: int, 2025-05-07T20:32:18.3966324Z D: int, 2025-05-07T20:32:18.3966841Z scale_ub: Optional[float], 2025-05-07T20:32:18.3967125Z contiguous: bool, 2025-05-07T20:32:18.3967369Z compiled: bool, 2025-05-07T20:32:18.3967725Z ) -> None: 2025-05-07T20:32:18.3967958Z torch.manual_seed(2025) 2025-05-07T20:32:18.3968205Z 2025-05-07T20:32:18.3968488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.3968845Z 2025-05-07T20:32:18.3969042Z x_sign = torch.sign(x) 2025-05-07T20:32:18.3969345Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.3969670Z x = x_sign * x_clamp 2025-05-07T20:32:18.3969914Z x0 = x[:, :D] 2025-05-07T20:32:18.3970141Z x1 = x[:, D:] 2025-05-07T20:32:18.3970367Z 2025-05-07T20:32:18.3970591Z if contiguous: 2025-05-07T20:32:18.3970856Z x0 = x0.contiguous() 2025-05-07T20:32:18.3971127Z x1 = x1.contiguous() 2025-05-07T20:32:18.3971375Z 2025-05-07T20:32:18.3971570Z if scale_ub is not None: 2025-05-07T20:32:18.3971893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.3972241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.3972560Z ) 2025-05-07T20:32:18.3972759Z else: 2025-05-07T20:32:18.3972984Z scale_ub_tensor = None 2025-05-07T20:32:18.3973255Z 2025-05-07T20:32:18.3973494Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.3973815Z op = silu_mul_quant 2025-05-07T20:32:18.3974154Z if compiled: 2025-05-07T20:32:18.3974419Z op = torch.compile(op) 2025-05-07T20:32:18.3974721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.3975009Z 2025-05-07T20:32:18.3975213Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.3975383Z 2025-05-07T20:32:18.3975489Z moe/activation_test.py:117: 2025-05-07T20:32:18.3975803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.3976159Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.3976446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.3977153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.3977860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.3978414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.3979179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.3979856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.3980402Z kernel = self.compile( 2025-05-07T20:32:18.3980958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.3981613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.3982121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.3982354Z 2025-05-07T20:32:18.3982571Z self = 2025-05-07T20:32:18.3983652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.3985074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bf2c40e0>} 2025-05-07T20:32:18.3986433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.3987471Z context = 2025-05-07T20:32:18.3987811Z 2025-05-07T20:32:18.3987986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.3988510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.3988983Z module_map=module_map) 2025-05-07T20:32:18.3989353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.3989710Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.3989973Z E ^ 2025-05-07T20:32:18.3990443Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.3990897Z 2025-05-07T20:32:18.3991320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.3991837Z 2025-05-07T20:32:18.3991943Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.3992367Z self=, 2025-05-07T20:32:18.3992775Z T=4096, 2025-05-07T20:32:18.3992974Z D=7168, 2025-05-07T20:32:18.3993172Z scale_ub=None, 2025-05-07T20:32:18.3993397Z contiguous=False, 2025-05-07T20:32:18.3993627Z compiled=False, 2025-05-07T20:32:18.3993839Z ) 2025-05-07T20:32:18.3994163Z self = 2025-05-07T20:32:18.3994706Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.3994983Z 2025-05-07T20:32:18.3995065Z @given( 2025-05-07T20:32:18.3995304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.3995625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.3995932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.3996267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.3996606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.3996901Z ) 2025-05-07T20:32:18.3997250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.3997695Z def test_silu_mul_quant( 2025-05-07T20:32:18.3997950Z self, 2025-05-07T20:32:18.3998149Z T: int, 2025-05-07T20:32:18.3998355Z D: int, 2025-05-07T20:32:18.3998579Z scale_ub: Optional[float], 2025-05-07T20:32:18.3998849Z contiguous: bool, 2025-05-07T20:32:18.3999099Z compiled: bool, 2025-05-07T20:32:18.3999330Z ) -> None: 2025-05-07T20:32:18.3999587Z torch.manual_seed(2025) 2025-05-07T20:32:18.3999840Z 2025-05-07T20:32:18.4000119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.4000468Z 2025-05-07T20:32:18.4000702Z x_sign = torch.sign(x) 2025-05-07T20:32:18.4001017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.4001326Z x = x_sign * x_clamp 2025-05-07T20:32:18.4001576Z x0 = x[:, :D] 2025-05-07T20:32:18.4001845Z x1 = x[:, D:] 2025-05-07T20:32:18.4002060Z 2025-05-07T20:32:18.4002246Z if contiguous: 2025-05-07T20:32:18.4002479Z x0 = x0.contiguous() 2025-05-07T20:32:18.4002740Z x1 = x1.contiguous() 2025-05-07T20:32:18.4002978Z 2025-05-07T20:32:18.4003173Z if scale_ub is not None: 2025-05-07T20:32:18.4003448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.4003785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.4004106Z ) 2025-05-07T20:32:18.4004311Z else: 2025-05-07T20:32:18.4004523Z scale_ub_tensor = None 2025-05-07T20:32:18.4004786Z 2025-05-07T20:32:18.4005021Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.4005337Z op = silu_mul_quant 2025-05-07T20:32:18.4006089Z if compiled: 2025-05-07T20:32:18.4006354Z op = torch.compile(op) 2025-05-07T20:32:18.4006651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.4007019Z 2025-05-07T20:32:18.4007215Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.4007380Z 2025-05-07T20:32:18.4007487Z moe/activation_test.py:117: 2025-05-07T20:32:18.4007840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4008179Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.4008465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.4009157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.4009861Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.4010403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.4011091Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.4011753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.4012297Z kernel = self.compile( 2025-05-07T20:32:18.4012842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.4013497Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.4013899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4014135Z 2025-05-07T20:32:18.4014410Z self = 2025-05-07T20:32:18.4015497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.4016866Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bf2c4ea0>} 2025-05-07T20:32:18.4018216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.4019248Z context = 2025-05-07T20:32:18.4019537Z 2025-05-07T20:32:18.4019710Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.4020310Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.4020808Z module_map=module_map) 2025-05-07T20:32:18.4021179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.4021541Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.4021807Z E ^ 2025-05-07T20:32:18.4022279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.4022798Z 2025-05-07T20:32:18.4023216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.4023731Z 2025-05-07T20:32:18.4023842Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.4024255Z self=, 2025-05-07T20:32:18.4024663Z T=128, 2025-05-07T20:32:18.4024855Z D=7168, 2025-05-07T20:32:18.4025058Z scale_ub=None, 2025-05-07T20:32:18.4025278Z contiguous=False, 2025-05-07T20:32:18.4025510Z compiled=True, 2025-05-07T20:32:18.4025719Z ) 2025-05-07T20:32:18.4457620Z self = 2025-05-07T20:32:18.4458140Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:18.4458418Z 2025-05-07T20:32:18.4458500Z @given( 2025-05-07T20:32:18.4458738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.4459275Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.4459591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.4459922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.4460254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.4460543Z ) 2025-05-07T20:32:18.4460892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.4461338Z def test_silu_mul_quant( 2025-05-07T20:32:18.4461597Z self, 2025-05-07T20:32:18.4461798Z T: int, 2025-05-07T20:32:18.4462005Z D: int, 2025-05-07T20:32:18.4462232Z scale_ub: Optional[float], 2025-05-07T20:32:18.4462502Z contiguous: bool, 2025-05-07T20:32:18.4462746Z compiled: bool, 2025-05-07T20:32:18.4462977Z ) -> None: 2025-05-07T20:32:18.4463195Z torch.manual_seed(2025) 2025-05-07T20:32:18.4463441Z 2025-05-07T20:32:18.4463721Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.4464076Z 2025-05-07T20:32:18.4464267Z x_sign = torch.sign(x) 2025-05-07T20:32:18.4464561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.4464875Z x = x_sign * x_clamp 2025-05-07T20:32:18.4465116Z x0 = x[:, :D] 2025-05-07T20:32:18.4465337Z x1 = x[:, D:] 2025-05-07T20:32:18.4465555Z 2025-05-07T20:32:18.4465742Z if contiguous: 2025-05-07T20:32:18.4466053Z x0 = x0.contiguous() 2025-05-07T20:32:18.4466321Z x1 = x1.contiguous() 2025-05-07T20:32:18.4466561Z 2025-05-07T20:32:18.4466760Z if scale_ub is not None: 2025-05-07T20:32:18.4467039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.4467374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.4467689Z ) 2025-05-07T20:32:18.4467890Z else: 2025-05-07T20:32:18.4468098Z scale_ub_tensor = None 2025-05-07T20:32:18.4468361Z 2025-05-07T20:32:18.4468596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.4468908Z op = silu_mul_quant 2025-05-07T20:32:18.4469164Z if compiled: 2025-05-07T20:32:18.4469416Z op = torch.compile(op) 2025-05-07T20:32:18.4469716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.4469989Z 2025-05-07T20:32:18.4470187Z y_fp8, y_scale = fn() 2025-05-07T20:32:18.4470484Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:18.4470844Z 2025-05-07T20:32:18.4471092Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.4471434Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:18.4471728Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:18.4472061Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:18.4472427Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.4472827Z 2025-05-07T20:32:18.4473030Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:18.4473235Z 2025-05-07T20:32:18.4473338Z moe/activation_test.py:126: 2025-05-07T20:32:18.4473643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4473976Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:18.4474308Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.4475107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:18.4475863Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:18.4476410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.4477100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.4477788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:18.4478564Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.4479315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:18.4480064Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.4480846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:18.4481484Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:18.4482087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:18.4482610Z fn() 2025-05-07T20:32:18.4483122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:18.4483704Z self.fn.run( 2025-05-07T20:32:18.4484192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.4484734Z kernel = self.compile( 2025-05-07T20:32:18.4485280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.4485931Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.4486375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4486616Z 2025-05-07T20:32:18.4486826Z self = 2025-05-07T20:32:18.4488035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.4496675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05becbf060>} 2025-05-07T20:32:18.4498096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.4499136Z context = 2025-05-07T20:32:18.4499429Z 2025-05-07T20:32:18.4499673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.4500197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.4500662Z module_map=module_map) 2025-05-07T20:32:18.4501040Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.4501411Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:18.4501727Z E ^ 2025-05-07T20:32:18.4502208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.4502673Z 2025-05-07T20:32:18.4503097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.4503618Z 2025-05-07T20:32:18.4503735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.4504158Z self=, 2025-05-07T20:32:18.4504579Z T=128, 2025-05-07T20:32:18.4504787Z D=7168, 2025-05-07T20:32:18.4504996Z scale_ub=None, 2025-05-07T20:32:18.4505221Z contiguous=False, 2025-05-07T20:32:18.4505463Z compiled=False, 2025-05-07T20:32:18.4505986Z ) 2025-05-07T20:32:18.7467932Z self = 2025-05-07T20:32:18.7468501Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.7469260Z 2025-05-07T20:32:18.7469365Z @given( 2025-05-07T20:32:18.7469676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.7470082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.7470492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.7470913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.7471246Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.7471545Z ) 2025-05-07T20:32:18.7471912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.7472367Z def test_silu_mul_quant( 2025-05-07T20:32:18.7472626Z self, 2025-05-07T20:32:18.7472829Z T: int, 2025-05-07T20:32:18.7473041Z D: int, 2025-05-07T20:32:18.7473276Z scale_ub: Optional[float], 2025-05-07T20:32:18.7473554Z contiguous: bool, 2025-05-07T20:32:18.7473813Z compiled: bool, 2025-05-07T20:32:18.7474055Z ) -> None: 2025-05-07T20:32:18.7474286Z torch.manual_seed(2025) 2025-05-07T20:32:18.7474544Z 2025-05-07T20:32:18.7474835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.7475189Z 2025-05-07T20:32:18.7475397Z x_sign = torch.sign(x) 2025-05-07T20:32:18.7475705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.7476021Z x = x_sign * x_clamp 2025-05-07T20:32:18.7476276Z x0 = x[:, :D] 2025-05-07T20:32:18.7476507Z x1 = x[:, D:] 2025-05-07T20:32:18.7476824Z 2025-05-07T20:32:18.7477032Z if contiguous: 2025-05-07T20:32:18.7477282Z x0 = x0.contiguous() 2025-05-07T20:32:18.7477549Z x1 = x1.contiguous() 2025-05-07T20:32:18.7477805Z 2025-05-07T20:32:18.7478016Z if scale_ub is not None: 2025-05-07T20:32:18.7478305Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.7478648Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.7478972Z ) 2025-05-07T20:32:18.7479183Z else: 2025-05-07T20:32:18.7479404Z scale_ub_tensor = None 2025-05-07T20:32:18.7479677Z 2025-05-07T20:32:18.7479924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.7480246Z op = silu_mul_quant 2025-05-07T20:32:18.7480511Z if compiled: 2025-05-07T20:32:18.7480774Z op = torch.compile(op) 2025-05-07T20:32:18.7481074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.7481364Z 2025-05-07T20:32:18.7481676Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.7481850Z 2025-05-07T20:32:18.7481958Z moe/activation_test.py:117: 2025-05-07T20:32:18.7482267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.7482615Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.7482909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.7483606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.7484390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.7484943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.7485635Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.7486309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.7486858Z kernel = self.compile( 2025-05-07T20:32:18.7487414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.7488177Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.7488587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.7488820Z 2025-05-07T20:32:18.7489039Z self = 2025-05-07T20:32:18.7490182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.7491576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be790cc0>} 2025-05-07T20:32:18.7492936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.7493977Z context = 2025-05-07T20:32:18.7494270Z 2025-05-07T20:32:18.7494450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.7494979Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.7495466Z module_map=module_map) 2025-05-07T20:32:18.7495842Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.7496208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.7496473Z E ^ 2025-05-07T20:32:18.7496945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.7497394Z 2025-05-07T20:32:18.7497861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.7498373Z 2025-05-07T20:32:18.7498488Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.7498901Z self=, 2025-05-07T20:32:18.7499315Z T=4096, 2025-05-07T20:32:18.7499516Z D=5120, 2025-05-07T20:32:18.7499712Z scale_ub=1200.0, 2025-05-07T20:32:18.7499953Z contiguous=True, 2025-05-07T20:32:18.7500189Z compiled=False, 2025-05-07T20:32:18.7500401Z ) 2025-05-07T20:32:18.7500768Z self = 2025-05-07T20:32:18.7501279Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.7501554Z 2025-05-07T20:32:18.7501663Z @given( 2025-05-07T20:32:18.7501895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.7502222Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.7502578Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.7502914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.7503245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.7503545Z ) 2025-05-07T20:32:18.7503900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.7504341Z def test_silu_mul_quant( 2025-05-07T20:32:18.7504638Z self, 2025-05-07T20:32:18.7504848Z T: int, 2025-05-07T20:32:18.7505049Z D: int, 2025-05-07T20:32:18.7505277Z scale_ub: Optional[float], 2025-05-07T20:32:18.7505554Z contiguous: bool, 2025-05-07T20:32:18.7506088Z compiled: bool, 2025-05-07T20:32:18.7506322Z ) -> None: 2025-05-07T20:32:18.7506546Z torch.manual_seed(2025) 2025-05-07T20:32:18.7506787Z 2025-05-07T20:32:18.7507066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.7507421Z 2025-05-07T20:32:18.7507619Z x_sign = torch.sign(x) 2025-05-07T20:32:18.7507917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.7508234Z x = x_sign * x_clamp 2025-05-07T20:32:18.7508481Z x0 = x[:, :D] 2025-05-07T20:32:18.7508699Z x1 = x[:, D:] 2025-05-07T20:32:18.7508920Z 2025-05-07T20:32:18.7509116Z if contiguous: 2025-05-07T20:32:18.7509357Z x0 = x0.contiguous() 2025-05-07T20:32:18.7509700Z x1 = x1.contiguous() 2025-05-07T20:32:18.7509949Z 2025-05-07T20:32:18.7510148Z if scale_ub is not None: 2025-05-07T20:32:18.7510419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.7510761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.7511084Z ) 2025-05-07T20:32:18.7511279Z else: 2025-05-07T20:32:18.7511506Z scale_ub_tensor = None 2025-05-07T20:32:18.7511764Z 2025-05-07T20:32:18.7511996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.7512317Z op = silu_mul_quant 2025-05-07T20:32:18.7512573Z if compiled: 2025-05-07T20:32:18.7512820Z op = torch.compile(op) 2025-05-07T20:32:18.7513121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.7513403Z 2025-05-07T20:32:18.7513599Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.7513765Z 2025-05-07T20:32:18.7513867Z moe/activation_test.py:117: 2025-05-07T20:32:18.7514176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.7514513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.7514802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.7515503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.7516204Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.7516816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.7517505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.7518181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.7518727Z kernel = self.compile( 2025-05-07T20:32:18.7519268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.7519937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.7520345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.7520576Z 2025-05-07T20:32:18.7520787Z self = 2025-05-07T20:32:18.7521932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.7523320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be791f80>} 2025-05-07T20:32:18.7524682Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.7525776Z context = 2025-05-07T20:32:18.7526065Z 2025-05-07T20:32:18.7526240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.7526762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.7527236Z module_map=module_map) 2025-05-07T20:32:18.7527675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.7528032Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.7528299Z E ^ 2025-05-07T20:32:18.7528773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.7529227Z 2025-05-07T20:32:18.7529656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.7530255Z 2025-05-07T20:32:18.7530362Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.7530782Z self=, 2025-05-07T20:32:18.7531196Z T=1, 2025-05-07T20:32:18.7531382Z D=5120, 2025-05-07T20:32:18.7531588Z scale_ub=None, 2025-05-07T20:32:18.7531810Z contiguous=True, 2025-05-07T20:32:18.7532035Z compiled=True, 2025-05-07T20:32:18.7532247Z ) 2025-05-07T20:32:19.1809154Z self = 2025-05-07T20:32:19.1809756Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1810121Z 2025-05-07T20:32:19.1810209Z @given( 2025-05-07T20:32:19.1810460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1810817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1811155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1811501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1811855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1812154Z ) 2025-05-07T20:32:19.1812509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1812960Z def test_silu_mul_quant( 2025-05-07T20:32:19.1813216Z self, 2025-05-07T20:32:19.1813418Z T: int, 2025-05-07T20:32:19.1813628Z D: int, 2025-05-07T20:32:19.1813861Z scale_ub: Optional[float], 2025-05-07T20:32:19.1814427Z contiguous: bool, 2025-05-07T20:32:19.1814687Z compiled: bool, 2025-05-07T20:32:19.1814926Z ) -> None: 2025-05-07T20:32:19.1815144Z torch.manual_seed(2025) 2025-05-07T20:32:19.1815396Z 2025-05-07T20:32:19.1815682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1816029Z 2025-05-07T20:32:19.1816233Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1816534Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1816852Z x = x_sign * x_clamp 2025-05-07T20:32:19.1817105Z x0 = x[:, :D] 2025-05-07T20:32:19.1817335Z x1 = x[:, D:] 2025-05-07T20:32:19.1817547Z 2025-05-07T20:32:19.1817745Z if contiguous: 2025-05-07T20:32:19.1817989Z x0 = x0.contiguous() 2025-05-07T20:32:19.1818261Z x1 = x1.contiguous() 2025-05-07T20:32:19.1818505Z 2025-05-07T20:32:19.1818707Z if scale_ub is not None: 2025-05-07T20:32:19.1818996Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1819419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1819742Z ) 2025-05-07T20:32:19.1819945Z else: 2025-05-07T20:32:19.1820161Z scale_ub_tensor = None 2025-05-07T20:32:19.1820426Z 2025-05-07T20:32:19.1820673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1820995Z op = silu_mul_quant 2025-05-07T20:32:19.1821263Z if compiled: 2025-05-07T20:32:19.1821608Z op = torch.compile(op) 2025-05-07T20:32:19.1821911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1822196Z 2025-05-07T20:32:19.1822401Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1822693Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1822997Z 2025-05-07T20:32:19.1823248Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1823597Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1823902Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1824228Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1824601Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1824918Z 2025-05-07T20:32:19.1825134Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1825334Z 2025-05-07T20:32:19.1825460Z moe/activation_test.py:126: 2025-05-07T20:32:19.1825773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1826208Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1826552Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1827361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1828125Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1828691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1829389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1830092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1830821Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1831584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:19.1832351Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1833093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1833739Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1834401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1834936Z fn() 2025-05-07T20:32:19.1835454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1836038Z self.fn.run( 2025-05-07T20:32:19.1836517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1837059Z kernel = self.compile( 2025-05-07T20:32:19.1837612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1838275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1838685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1838918Z 2025-05-07T20:32:19.1839133Z self = 2025-05-07T20:32:19.1840271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1841682Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be792fc0>} 2025-05-07T20:32:19.1843041Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1844128Z context = 2025-05-07T20:32:19.1844418Z 2025-05-07T20:32:19.1844594Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1845121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1845605Z module_map=module_map) 2025-05-07T20:32:19.1845979Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1846339Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1846616Z E ^ 2025-05-07T20:32:19.1847094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1847715Z 2025-05-07T20:32:19.1848149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1848717Z 2025-05-07T20:32:19.1848826Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1849248Z self=, 2025-05-07T20:32:19.1849661Z T=2048, 2025-05-07T20:32:19.1849856Z D=5120, 2025-05-07T20:32:19.1850062Z scale_ub=None, 2025-05-07T20:32:19.1850288Z contiguous=True, 2025-05-07T20:32:19.1850527Z compiled=True, 2025-05-07T20:32:19.1850743Z ) 2025-05-07T20:32:19.5980584Z self = 2025-05-07T20:32:19.5981422Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.5981701Z 2025-05-07T20:32:19.5981794Z @given( 2025-05-07T20:32:19.5982046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.5982367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.5982703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.5983045Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.5983382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.5983681Z ) 2025-05-07T20:32:19.5984044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.5984491Z def test_silu_mul_quant( 2025-05-07T20:32:19.5984745Z self, 2025-05-07T20:32:19.5984955Z T: int, 2025-05-07T20:32:19.5985452Z D: int, 2025-05-07T20:32:19.5985690Z scale_ub: Optional[float], 2025-05-07T20:32:19.5985969Z contiguous: bool, 2025-05-07T20:32:19.5986214Z compiled: bool, 2025-05-07T20:32:19.5986451Z ) -> None: 2025-05-07T20:32:19.5986679Z torch.manual_seed(2025) 2025-05-07T20:32:19.5986931Z 2025-05-07T20:32:19.5987206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.5987559Z 2025-05-07T20:32:19.5987766Z x_sign = torch.sign(x) 2025-05-07T20:32:19.5988063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.5988386Z x = x_sign * x_clamp 2025-05-07T20:32:19.5988636Z x0 = x[:, :D] 2025-05-07T20:32:19.5988855Z x1 = x[:, D:] 2025-05-07T20:32:19.5989074Z 2025-05-07T20:32:19.5989272Z if contiguous: 2025-05-07T20:32:19.5989509Z x0 = x0.contiguous() 2025-05-07T20:32:19.5989780Z x1 = x1.contiguous() 2025-05-07T20:32:19.5990031Z 2025-05-07T20:32:19.5990314Z if scale_ub is not None: 2025-05-07T20:32:19.5990597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.5990946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.5991259Z ) 2025-05-07T20:32:19.5991461Z else: 2025-05-07T20:32:19.5991681Z scale_ub_tensor = None 2025-05-07T20:32:19.5991939Z 2025-05-07T20:32:19.5992181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.5992599Z op = silu_mul_quant 2025-05-07T20:32:19.5992862Z if compiled: 2025-05-07T20:32:19.5993115Z op = torch.compile(op) 2025-05-07T20:32:19.5993424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.5993717Z 2025-05-07T20:32:19.5993915Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.5994213Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.5994521Z 2025-05-07T20:32:19.5994767Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.5995117Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.5995421Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.5995743Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.5996116Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.5996439Z 2025-05-07T20:32:19.5996654Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.5996945Z 2025-05-07T20:32:19.5997050Z moe/activation_test.py:126: 2025-05-07T20:32:19.5997362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.5997713Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.5998045Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.5998849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.5999621Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.6000180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.6000915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.6001625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.6002362Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.6003127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:19.6003889Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.6004631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.6005334Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.6006306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.6006839Z fn() 2025-05-07T20:32:19.6007361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.6008066Z self.fn.run( 2025-05-07T20:32:19.6008542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.6009094Z kernel = self.compile( 2025-05-07T20:32:19.6009648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.6010309Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.6010719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6010960Z 2025-05-07T20:32:19.6011253Z self = 2025-05-07T20:32:19.6012353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.6013756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be712ac0>} 2025-05-07T20:32:19.6015211Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.6016256Z context = 2025-05-07T20:32:19.6016547Z 2025-05-07T20:32:19.6016725Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.6017263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.6017751Z module_map=module_map) 2025-05-07T20:32:19.6025889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.6026297Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.6026584Z E ^ 2025-05-07T20:32:19.6027063Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.6027680Z 2025-05-07T20:32:19.6028111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.6028641Z 2025-05-07T20:32:19.6028752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.6029180Z self=, 2025-05-07T20:32:19.6029591Z T=128, 2025-05-07T20:32:19.6029797Z D=5120, 2025-05-07T20:32:19.6030018Z scale_ub=None, 2025-05-07T20:32:19.6030246Z contiguous=True, 2025-05-07T20:32:19.6030482Z compiled=True, 2025-05-07T20:32:19.6030709Z ) 2025-05-07T20:32:20.2433059Z self = 2025-05-07T20:32:20.2433604Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.2433876Z 2025-05-07T20:32:20.2433957Z @given( 2025-05-07T20:32:20.2434197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.2434543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.2434846Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.2435181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.2435512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.2435797Z ) 2025-05-07T20:32:20.2436143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.2436884Z def test_silu_mul_quant( 2025-05-07T20:32:20.2437140Z self, 2025-05-07T20:32:20.2437330Z T: int, 2025-05-07T20:32:20.2437533Z D: int, 2025-05-07T20:32:20.2437746Z scale_ub: Optional[float], 2025-05-07T20:32:20.2438016Z contiguous: bool, 2025-05-07T20:32:20.2438261Z compiled: bool, 2025-05-07T20:32:20.2438486Z ) -> None: 2025-05-07T20:32:20.2438710Z torch.manual_seed(2025) 2025-05-07T20:32:20.2438953Z 2025-05-07T20:32:20.2439235Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.2439575Z 2025-05-07T20:32:20.2439773Z x_sign = torch.sign(x) 2025-05-07T20:32:20.2440070Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.2440375Z x = x_sign * x_clamp 2025-05-07T20:32:20.2440622Z x0 = x[:, :D] 2025-05-07T20:32:20.2440842Z x1 = x[:, D:] 2025-05-07T20:32:20.2441071Z 2025-05-07T20:32:20.2441284Z if contiguous: 2025-05-07T20:32:20.2441526Z x0 = x0.contiguous() 2025-05-07T20:32:20.2441862Z x1 = x1.contiguous() 2025-05-07T20:32:20.2442112Z 2025-05-07T20:32:20.2442309Z if scale_ub is not None: 2025-05-07T20:32:20.2442577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.2442914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.2443224Z ) 2025-05-07T20:32:20.2443412Z else: 2025-05-07T20:32:20.2443627Z scale_ub_tensor = None 2025-05-07T20:32:20.2443954Z 2025-05-07T20:32:20.2444184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.2444500Z op = silu_mul_quant 2025-05-07T20:32:20.2444755Z if compiled: 2025-05-07T20:32:20.2445009Z op = torch.compile(op) 2025-05-07T20:32:20.2445301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.2445579Z 2025-05-07T20:32:20.2445781Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.2446063Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.2446359Z 2025-05-07T20:32:20.2446600Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.2446931Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.2447225Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.2447641Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.2448001Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.2448406Z 2025-05-07T20:32:20.2448615Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.2448811Z 2025-05-07T20:32:20.2448921Z moe/activation_test.py:126: 2025-05-07T20:32:20.2449220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.2449562Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.2449896Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.2450695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.2451514Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.2452070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.2452763Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.2453451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.2454191Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.2454955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:20.2455711Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.2456491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.2457148Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.2457760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.2458284Z fn() 2025-05-07T20:32:20.2458804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.2459402Z self.fn.run( 2025-05-07T20:32:20.2459889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.2460425Z kernel = self.compile( 2025-05-07T20:32:20.2460980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.2461647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.2462089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.2462333Z 2025-05-07T20:32:20.2462541Z self = 2025-05-07T20:32:20.2463634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.2465082Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bea95c60>} 2025-05-07T20:32:20.2466445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.2467464Z context = 2025-05-07T20:32:20.2467764Z 2025-05-07T20:32:20.2467935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.2468472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.2468945Z module_map=module_map) 2025-05-07T20:32:20.2469312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.2469678Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.2470007Z E ^ 2025-05-07T20:32:20.2470469Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.2470930Z 2025-05-07T20:32:20.2471346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.2471867Z 2025-05-07T20:32:20.2471972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.2472394Z self=, 2025-05-07T20:32:20.2472800Z T=4096, 2025-05-07T20:32:20.2473001Z D=5120, 2025-05-07T20:32:20.2473205Z scale_ub=None, 2025-05-07T20:32:20.2473420Z contiguous=True, 2025-05-07T20:32:20.2473651Z compiled=True, 2025-05-07T20:32:20.2473868Z ) 2025-05-07T20:32:20.7390502Z self = 2025-05-07T20:32:20.7391244Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.7391633Z 2025-05-07T20:32:20.7391738Z @given( 2025-05-07T20:32:20.7392041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.7392452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.7392839Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.7393167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.7393502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.7393798Z ) 2025-05-07T20:32:20.7394436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.7394885Z def test_silu_mul_quant( 2025-05-07T20:32:20.7395133Z self, 2025-05-07T20:32:20.7395335Z T: int, 2025-05-07T20:32:20.7395531Z D: int, 2025-05-07T20:32:20.7395754Z scale_ub: Optional[float], 2025-05-07T20:32:20.7396029Z contiguous: bool, 2025-05-07T20:32:20.7396265Z compiled: bool, 2025-05-07T20:32:20.7396495Z ) -> None: 2025-05-07T20:32:20.7396722Z torch.manual_seed(2025) 2025-05-07T20:32:20.7396958Z 2025-05-07T20:32:20.7397231Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.7397579Z 2025-05-07T20:32:20.7397771Z x_sign = torch.sign(x) 2025-05-07T20:32:20.7398061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.7398371Z x = x_sign * x_clamp 2025-05-07T20:32:20.7398611Z x0 = x[:, :D] 2025-05-07T20:32:20.7398821Z x1 = x[:, D:] 2025-05-07T20:32:20.7399033Z 2025-05-07T20:32:20.7399294Z if contiguous: 2025-05-07T20:32:20.7399524Z x0 = x0.contiguous() 2025-05-07T20:32:20.7399783Z x1 = x1.contiguous() 2025-05-07T20:32:20.7400022Z 2025-05-07T20:32:20.7400208Z if scale_ub is not None: 2025-05-07T20:32:20.7400482Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.7400817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.7401200Z ) 2025-05-07T20:32:20.7401427Z else: 2025-05-07T20:32:20.7401652Z scale_ub_tensor = None 2025-05-07T20:32:20.7401897Z 2025-05-07T20:32:20.7402130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.7402442Z op = silu_mul_quant 2025-05-07T20:32:20.7402686Z if compiled: 2025-05-07T20:32:20.7402936Z op = torch.compile(op) 2025-05-07T20:32:20.7403234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.7403507Z 2025-05-07T20:32:20.7403700Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.7403985Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.7404271Z 2025-05-07T20:32:20.7404505Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.7404840Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.7405133Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.7405442Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.7406248Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.7406568Z 2025-05-07T20:32:20.7406768Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.7406970Z 2025-05-07T20:32:20.7407075Z moe/activation_test.py:126: 2025-05-07T20:32:20.7407376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7407792Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.7408122Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.7408918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.7409680Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.7410228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.7410922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.7411619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.7412348Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.7413099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:20.7413961Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.7414700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.7415344Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.7415943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.7416469Z fn() 2025-05-07T20:32:20.7416989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.7417576Z self.fn.run( 2025-05-07T20:32:20.7418048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.7418588Z kernel = self.compile( 2025-05-07T20:32:20.7419135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.7419891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.7420297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7420529Z 2025-05-07T20:32:20.7420745Z self = 2025-05-07T20:32:20.7421834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.7423293Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05beb02840>} 2025-05-07T20:32:20.7424639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.7425676Z context = 2025-05-07T20:32:20.7425965Z 2025-05-07T20:32:20.7426138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.7426657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.7427130Z module_map=module_map) 2025-05-07T20:32:20.7427499Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.7427937Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.7428205Z E ^ 2025-05-07T20:32:20.7428677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.7429127Z 2025-05-07T20:32:20.7429547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.7430057Z 2025-05-07T20:32:20.7430166Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.7430587Z self=, 2025-05-07T20:32:20.7430998Z T=16384, 2025-05-07T20:32:20.7431229Z D=5120, 2025-05-07T20:32:20.7431449Z scale_ub=None, 2025-05-07T20:32:20.7431673Z contiguous=True, 2025-05-07T20:32:20.7431904Z compiled=True, 2025-05-07T20:32:20.7432113Z ) 2025-05-07T20:32:20.7691615Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:20.7693290Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:20.7694653Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:20.7695877Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:20.7696989Z W0507 20:32:20.768000 87964 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:20.8381899Z self = 2025-05-07T20:32:20.8383150Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.8383709Z 2025-05-07T20:32:20.8383885Z @given( 2025-05-07T20:32:20.8384360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.8385008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.8385635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.8386300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.8386972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.8387869Z ) 2025-05-07T20:32:20.8388576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.8389472Z def test_silu_mul_quant( 2025-05-07T20:32:20.8389966Z self, 2025-05-07T20:32:20.8390363Z T: int, 2025-05-07T20:32:20.8390757Z D: int, 2025-05-07T20:32:20.8391095Z scale_ub: Optional[float], 2025-05-07T20:32:20.8391377Z contiguous: bool, 2025-05-07T20:32:20.8391706Z compiled: bool, 2025-05-07T20:32:20.8391946Z ) -> None: 2025-05-07T20:32:20.8392173Z torch.manual_seed(2025) 2025-05-07T20:32:20.8392420Z 2025-05-07T20:32:20.8392705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.8393060Z 2025-05-07T20:32:20.8393258Z x_sign = torch.sign(x) 2025-05-07T20:32:20.8393562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.8393885Z x = x_sign * x_clamp 2025-05-07T20:32:20.8394137Z x0 = x[:, :D] 2025-05-07T20:32:20.8394367Z x1 = x[:, D:] 2025-05-07T20:32:20.8394586Z 2025-05-07T20:32:20.8394778Z if contiguous: 2025-05-07T20:32:20.8395023Z x0 = x0.contiguous() 2025-05-07T20:32:20.8395295Z x1 = x1.contiguous() 2025-05-07T20:32:20.8395548Z 2025-05-07T20:32:20.8395746Z if scale_ub is not None: 2025-05-07T20:32:20.8396032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.8396474Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.8396787Z ) 2025-05-07T20:32:20.8396991Z else: 2025-05-07T20:32:20.8397215Z scale_ub_tensor = None 2025-05-07T20:32:20.8397470Z 2025-05-07T20:32:20.8397714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.8398038Z op = silu_mul_quant 2025-05-07T20:32:20.8398297Z if compiled: 2025-05-07T20:32:20.8398556Z op = torch.compile(op) 2025-05-07T20:32:20.8398869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.8399148Z 2025-05-07T20:32:20.8399354Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.8399653Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.8399948Z 2025-05-07T20:32:20.8400197Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.8400546Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.8400858Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.8401181Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.8401556Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.8401879Z 2025-05-07T20:32:20.8402087Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.8402295Z 2025-05-07T20:32:20.8402403Z moe/activation_test.py:126: 2025-05-07T20:32:20.8402713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.8403132Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.8403472Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.8404266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.8405028Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.8405578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.8406562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.8407262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.8408079Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.8408839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:20.8409661Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.8410398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.8411071Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.8411705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.8412326Z fn() 2025-05-07T20:32:20.8412844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.8413432Z self.fn.run( 2025-05-07T20:32:20.8413904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.8414445Z kernel = self.compile( 2025-05-07T20:32:20.8414998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.8415655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.8416056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.8416299Z 2025-05-07T20:32:20.8416508Z self = 2025-05-07T20:32:20.8417594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.8419056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509b25d00>} 2025-05-07T20:32:20.8420406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.8421438Z context = 2025-05-07T20:32:20.8421734Z 2025-05-07T20:32:20.8421903Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.8422435Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.8422909Z module_map=module_map) 2025-05-07T20:32:20.8423286Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.8423655Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.8423925Z E ^ 2025-05-07T20:32:20.8424396Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.8424854Z 2025-05-07T20:32:20.8425335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.8425853Z 2025-05-07T20:32:20.8425971Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.8426387Z self=, 2025-05-07T20:32:20.8426802Z T=1, 2025-05-07T20:32:20.8426999Z D=5120, 2025-05-07T20:32:20.8427200Z scale_ub=1200.0, 2025-05-07T20:32:20.8427437Z contiguous=True, 2025-05-07T20:32:20.8427671Z compiled=True, 2025-05-07T20:32:20.8427889Z ) 2025-05-07T20:32:21.1136395Z self = 2025-05-07T20:32:21.1136957Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.1137223Z 2025-05-07T20:32:21.1137305Z @given( 2025-05-07T20:32:21.1137543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.1137858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.1138156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.1138824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.1139161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.1139443Z ) 2025-05-07T20:32:21.1139796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.1140237Z def test_silu_mul_quant( 2025-05-07T20:32:21.1140484Z self, 2025-05-07T20:32:21.1140674Z T: int, 2025-05-07T20:32:21.1140880Z D: int, 2025-05-07T20:32:21.1141221Z scale_ub: Optional[float], 2025-05-07T20:32:21.1141518Z contiguous: bool, 2025-05-07T20:32:21.1141762Z compiled: bool, 2025-05-07T20:32:21.1141996Z ) -> None: 2025-05-07T20:32:21.1142210Z torch.manual_seed(2025) 2025-05-07T20:32:21.1142455Z 2025-05-07T20:32:21.1142736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.1143077Z 2025-05-07T20:32:21.1143274Z x_sign = torch.sign(x) 2025-05-07T20:32:21.1143576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.1143884Z x = x_sign * x_clamp 2025-05-07T20:32:21.1144132Z x0 = x[:, :D] 2025-05-07T20:32:21.1144355Z x1 = x[:, D:] 2025-05-07T20:32:21.1144558Z 2025-05-07T20:32:21.1144748Z if contiguous: 2025-05-07T20:32:21.1144983Z x0 = x0.contiguous() 2025-05-07T20:32:21.1145235Z x1 = x1.contiguous() 2025-05-07T20:32:21.1145481Z 2025-05-07T20:32:21.1145785Z if scale_ub is not None: 2025-05-07T20:32:21.1146062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.1146396Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.1146707Z ) 2025-05-07T20:32:21.1146902Z else: 2025-05-07T20:32:21.1147113Z scale_ub_tensor = None 2025-05-07T20:32:21.1147370Z 2025-05-07T20:32:21.1147607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.1147922Z op = silu_mul_quant 2025-05-07T20:32:21.1148184Z if compiled: 2025-05-07T20:32:21.1148439Z op = torch.compile(op) 2025-05-07T20:32:21.1148729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1149012Z 2025-05-07T20:32:21.1149211Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.1149373Z 2025-05-07T20:32:21.1157093Z moe/activation_test.py:117: 2025-05-07T20:32:21.1157456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1157829Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.1158138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1158725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.1159306Z return fn(*args, **kwargs) 2025-05-07T20:32:21.1159990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.1160827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.1161380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.1162085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.1162767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.1163318Z kernel = self.compile( 2025-05-07T20:32:21.1163883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.1164561Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.1164987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1165224Z 2025-05-07T20:32:21.1165439Z self = 2025-05-07T20:32:21.1166597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.1168120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509a30a40>} 2025-05-07T20:32:21.1169478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.1170570Z context = 2025-05-07T20:32:21.1170859Z 2025-05-07T20:32:21.1171028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.1171553Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.1172045Z module_map=module_map) 2025-05-07T20:32:21.1172430Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.1172793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.1173069Z E ^ 2025-05-07T20:32:21.1173554Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.1174006Z 2025-05-07T20:32:21.1174433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.1174999Z 2025-05-07T20:32:21.1175108Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.1175533Z self=, 2025-05-07T20:32:21.1175938Z T=1, 2025-05-07T20:32:21.1176138Z D=5120, 2025-05-07T20:32:21.1176341Z scale_ub=None, 2025-05-07T20:32:21.1176578Z contiguous=False, 2025-05-07T20:32:21.1176820Z compiled=True, 2025-05-07T20:32:21.1177042Z ) 2025-05-07T20:32:21.1645273Z self = 2025-05-07T20:32:21.1645910Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:21.1646178Z 2025-05-07T20:32:21.1646269Z @given( 2025-05-07T20:32:21.1646514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.1646832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.1647153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.1647502Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.1647889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.1648185Z ) 2025-05-07T20:32:21.1648543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.1648988Z def test_silu_mul_quant( 2025-05-07T20:32:21.1649242Z self, 2025-05-07T20:32:21.1649451Z T: int, 2025-05-07T20:32:21.1649651Z D: int, 2025-05-07T20:32:21.1650108Z scale_ub: Optional[float], 2025-05-07T20:32:21.1650398Z contiguous: bool, 2025-05-07T20:32:21.1650648Z compiled: bool, 2025-05-07T20:32:21.1650888Z ) -> None: 2025-05-07T20:32:21.1651120Z torch.manual_seed(2025) 2025-05-07T20:32:21.1651381Z 2025-05-07T20:32:21.1651661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.1652021Z 2025-05-07T20:32:21.1652237Z x_sign = torch.sign(x) 2025-05-07T20:32:21.1652538Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.1652867Z x = x_sign * x_clamp 2025-05-07T20:32:21.1653122Z x0 = x[:, :D] 2025-05-07T20:32:21.1653345Z x1 = x[:, D:] 2025-05-07T20:32:21.1653564Z 2025-05-07T20:32:21.1653759Z if contiguous: 2025-05-07T20:32:21.1653997Z x0 = x0.contiguous() 2025-05-07T20:32:21.1654263Z x1 = x1.contiguous() 2025-05-07T20:32:21.1654510Z 2025-05-07T20:32:21.1654709Z if scale_ub is not None: 2025-05-07T20:32:21.1655068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.1655414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.1655718Z ) 2025-05-07T20:32:21.1655925Z else: 2025-05-07T20:32:21.1656141Z scale_ub_tensor = None 2025-05-07T20:32:21.1656390Z 2025-05-07T20:32:21.1656629Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.1657032Z op = silu_mul_quant 2025-05-07T20:32:21.1657289Z if compiled: 2025-05-07T20:32:21.1657541Z op = torch.compile(op) 2025-05-07T20:32:21.1657842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1658132Z 2025-05-07T20:32:21.1658328Z y_fp8, y_scale = fn() 2025-05-07T20:32:21.1658623Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:21.1658927Z 2025-05-07T20:32:21.1659170Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.1659525Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:21.1659830Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:21.1660145Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:21.1660516Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.1660838Z 2025-05-07T20:32:21.1661055Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.1661278Z 2025-05-07T20:32:21.1661491Z moe/activation_test.py:126: 2025-05-07T20:32:21.1661800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1662149Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:21.1662478Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.1663279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:21.1664040Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.1664594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.1665277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.1665968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:21.1666698Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.1667450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:21.1668200Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.1668932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:21.1669631Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.1670238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:21.1670764Z fn() 2025-05-07T20:32:21.1671307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:21.1671919Z self.fn.run( 2025-05-07T20:32:21.1672389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.1672935Z kernel = self.compile( 2025-05-07T20:32:21.1673485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.1674138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.1674541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1674778Z 2025-05-07T20:32:21.1674991Z self = 2025-05-07T20:32:21.1676177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.1677573Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509a12200>} 2025-05-07T20:32:21.1678969Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.1680005Z context = 2025-05-07T20:32:21.1680295Z 2025-05-07T20:32:21.1680472Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.1681002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.1681517Z module_map=module_map) 2025-05-07T20:32:21.1681889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.1682252Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:21.1682520Z E ^ 2025-05-07T20:32:21.1682990Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.1683490Z 2025-05-07T20:32:21.1683917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.1684427Z 2025-05-07T20:32:21.1684540Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.1684949Z self=, 2025-05-07T20:32:21.1685357Z T=1, 2025-05-07T20:32:21.1685549Z D=5120, 2025-05-07T20:32:21.1685758Z scale_ub=None, 2025-05-07T20:32:21.1685976Z contiguous=True, 2025-05-07T20:32:21.1686209Z compiled=False, 2025-05-07T20:32:21.1686427Z ) 2025-05-07T20:32:21.2837164Z self = 2025-05-07T20:32:21.2838489Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.2839017Z 2025-05-07T20:32:21.2839180Z @given( 2025-05-07T20:32:21.2839647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.2840322Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.2840936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.2841396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.2841751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.2842037Z ) 2025-05-07T20:32:21.2842387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.2842831Z def test_silu_mul_quant( 2025-05-07T20:32:21.2843357Z self, 2025-05-07T20:32:21.2843560Z T: int, 2025-05-07T20:32:21.2843764Z D: int, 2025-05-07T20:32:21.2843985Z scale_ub: Optional[float], 2025-05-07T20:32:21.2844257Z contiguous: bool, 2025-05-07T20:32:21.2844505Z compiled: bool, 2025-05-07T20:32:21.2844742Z ) -> None: 2025-05-07T20:32:21.2844958Z torch.manual_seed(2025) 2025-05-07T20:32:21.2845209Z 2025-05-07T20:32:21.2845487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.2845833Z 2025-05-07T20:32:21.2846040Z x_sign = torch.sign(x) 2025-05-07T20:32:21.2846339Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.2846656Z x = x_sign * x_clamp 2025-05-07T20:32:21.2846897Z x0 = x[:, :D] 2025-05-07T20:32:21.2847118Z x1 = x[:, D:] 2025-05-07T20:32:21.2847329Z 2025-05-07T20:32:21.2847610Z if contiguous: 2025-05-07T20:32:21.2847850Z x0 = x0.contiguous() 2025-05-07T20:32:21.2848202Z x1 = x1.contiguous() 2025-05-07T20:32:21.2848470Z 2025-05-07T20:32:21.2848680Z if scale_ub is not None: 2025-05-07T20:32:21.2848983Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.2849355Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.2849707Z ) 2025-05-07T20:32:21.2849914Z else: 2025-05-07T20:32:21.2850136Z scale_ub_tensor = None 2025-05-07T20:32:21.2850511Z 2025-05-07T20:32:21.2850751Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.2851067Z op = silu_mul_quant 2025-05-07T20:32:21.2851353Z if compiled: 2025-05-07T20:32:21.2851636Z op = torch.compile(op) 2025-05-07T20:32:21.2851937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.2852212Z 2025-05-07T20:32:21.2852412Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.2852579Z 2025-05-07T20:32:21.2852690Z moe/activation_test.py:117: 2025-05-07T20:32:21.2852990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.2853336Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.2853626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.2854314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.2855014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.2855643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.2856330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.2856991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.2857528Z kernel = self.compile( 2025-05-07T20:32:21.2858075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.2858731Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.2859131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.2859366Z 2025-05-07T20:32:21.2859572Z self = 2025-05-07T20:32:21.2860657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.2862064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509a13560>} 2025-05-07T20:32:21.2863452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.2864485Z context = 2025-05-07T20:32:21.2864778Z 2025-05-07T20:32:21.2864944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.2865471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.2865936Z module_map=module_map) 2025-05-07T20:32:21.2866310Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.2866667Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.2866927Z E ^ 2025-05-07T20:32:21.2867396Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.2867854Z 2025-05-07T20:32:21.2868274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.2868787Z 2025-05-07T20:32:21.2868945Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.2869361Z self=, 2025-05-07T20:32:21.2869769Z T=128, 2025-05-07T20:32:21.2869967Z D=5120, 2025-05-07T20:32:21.2870162Z scale_ub=None, 2025-05-07T20:32:21.2870387Z contiguous=False, 2025-05-07T20:32:21.2870623Z compiled=True, 2025-05-07T20:32:21.2870829Z ) 2025-05-07T20:32:21.2871201Z self = 2025-05-07T20:32:21.2871697Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:21.2871966Z 2025-05-07T20:32:21.2872056Z @given( 2025-05-07T20:32:21.2872288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.2872614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.2872927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.2873261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.2873599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.2873895Z ) 2025-05-07T20:32:21.2874244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.2874692Z def test_silu_mul_quant( 2025-05-07T20:32:21.2874942Z self, 2025-05-07T20:32:21.2875143Z T: int, 2025-05-07T20:32:21.2875339Z D: int, 2025-05-07T20:32:21.2875615Z scale_ub: Optional[float], 2025-05-07T20:32:21.2875890Z contiguous: bool, 2025-05-07T20:32:21.2876128Z compiled: bool, 2025-05-07T20:32:21.2876356Z ) -> None: 2025-05-07T20:32:21.2876575Z torch.manual_seed(2025) 2025-05-07T20:32:21.2876817Z 2025-05-07T20:32:21.2877093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.2877439Z 2025-05-07T20:32:21.2877633Z x_sign = torch.sign(x) 2025-05-07T20:32:21.2877929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.2878244Z x = x_sign * x_clamp 2025-05-07T20:32:21.2878481Z x0 = x[:, :D] 2025-05-07T20:32:21.2878704Z x1 = x[:, D:] 2025-05-07T20:32:21.2878921Z 2025-05-07T20:32:21.2879107Z if contiguous: 2025-05-07T20:32:21.2879347Z x0 = x0.contiguous() 2025-05-07T20:32:21.2879609Z x1 = x1.contiguous() 2025-05-07T20:32:21.2879846Z 2025-05-07T20:32:21.2880047Z if scale_ub is not None: 2025-05-07T20:32:21.2880332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.2880670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.2880977Z ) 2025-05-07T20:32:21.2881180Z else: 2025-05-07T20:32:21.2881394Z scale_ub_tensor = None 2025-05-07T20:32:21.2881648Z 2025-05-07T20:32:21.2881884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.2882204Z op = silu_mul_quant 2025-05-07T20:32:21.2882505Z if compiled: 2025-05-07T20:32:21.2882763Z op = torch.compile(op) 2025-05-07T20:32:21.2883063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.2883337Z 2025-05-07T20:32:21.2883537Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.2883700Z 2025-05-07T20:32:21.2883807Z moe/activation_test.py:117: 2025-05-07T20:32:21.2884102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.2884445Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.2884735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.2885299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.2885858Z return fn(*args, **kwargs) 2025-05-07T20:32:21.2886523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.2887221Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.2887888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.2888576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.2889242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.2889781Z kernel = self.compile( 2025-05-07T20:32:21.2890321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.2891029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.2891430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.2891664Z 2025-05-07T20:32:21.2891876Z self = 2025-05-07T20:32:21.2892957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.2894327Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509a304a0>} 2025-05-07T20:32:21.2895671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.2896742Z context = 2025-05-07T20:32:21.2897030Z 2025-05-07T20:32:21.2897197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.2897721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.2898196Z module_map=module_map) 2025-05-07T20:32:21.2898567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.2898936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.2899198Z E ^ 2025-05-07T20:32:21.2899665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.2900118Z 2025-05-07T20:32:21.2900542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.2901063Z 2025-05-07T20:32:21.2901176Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.2901588Z self=, 2025-05-07T20:32:21.2901998Z T=128, 2025-05-07T20:32:21.2902197Z D=7168, 2025-05-07T20:32:21.2902392Z scale_ub=1200.0, 2025-05-07T20:32:21.2902623Z contiguous=False, 2025-05-07T20:32:21.2902854Z compiled=False, 2025-05-07T20:32:21.2903059Z ) 2025-05-07T20:32:21.3770858Z self = 2025-05-07T20:32:21.3771465Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:21.3771750Z 2025-05-07T20:32:21.3771833Z @given( 2025-05-07T20:32:21.3772069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.3772380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.3772695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.3773045Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.3773376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.3773661Z ) 2025-05-07T20:32:21.3774014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.3774458Z def test_silu_mul_quant( 2025-05-07T20:32:21.3774699Z self, 2025-05-07T20:32:21.3774902Z T: int, 2025-05-07T20:32:21.3775109Z D: int, 2025-05-07T20:32:21.3775408Z scale_ub: Optional[float], 2025-05-07T20:32:21.3775691Z contiguous: bool, 2025-05-07T20:32:21.3775937Z compiled: bool, 2025-05-07T20:32:21.3776165Z ) -> None: 2025-05-07T20:32:21.3776389Z torch.manual_seed(2025) 2025-05-07T20:32:21.3776636Z 2025-05-07T20:32:21.3776910Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.3777257Z 2025-05-07T20:32:21.3777457Z x_sign = torch.sign(x) 2025-05-07T20:32:21.3777840Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.3778156Z x = x_sign * x_clamp 2025-05-07T20:32:21.3778407Z x0 = x[:, :D] 2025-05-07T20:32:21.3778629Z x1 = x[:, D:] 2025-05-07T20:32:21.3778840Z 2025-05-07T20:32:21.3779039Z if contiguous: 2025-05-07T20:32:21.3779277Z x0 = x0.contiguous() 2025-05-07T20:32:21.3779534Z x1 = x1.contiguous() 2025-05-07T20:32:21.3779774Z 2025-05-07T20:32:21.3779972Z if scale_ub is not None: 2025-05-07T20:32:21.3780244Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.3780581Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.3780897Z ) 2025-05-07T20:32:21.3781091Z else: 2025-05-07T20:32:21.3781307Z scale_ub_tensor = None 2025-05-07T20:32:21.3781563Z 2025-05-07T20:32:21.3781790Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.3782109Z op = silu_mul_quant 2025-05-07T20:32:21.3782438Z if compiled: 2025-05-07T20:32:21.3782684Z op = torch.compile(op) 2025-05-07T20:32:21.3782978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.3783253Z 2025-05-07T20:32:21.3783445Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.3783606Z 2025-05-07T20:32:21.3783705Z moe/activation_test.py:117: 2025-05-07T20:32:21.3783996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.3784330Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.3784607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.3785299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.3785992Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.3786521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.3787211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.3787869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.3788404Z kernel = self.compile( 2025-05-07T20:32:21.3788937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.3789594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.3790038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.3790265Z 2025-05-07T20:32:21.3790477Z self = 2025-05-07T20:32:21.3791545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.3792938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05091fca40>} 2025-05-07T20:32:21.3794283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.3795316Z context = 2025-05-07T20:32:21.3795677Z 2025-05-07T20:32:21.3795851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.3796371Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.3796847Z module_map=module_map) 2025-05-07T20:32:21.3797214Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.3797612Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.3797883Z E ^ 2025-05-07T20:32:21.3798351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.3798802Z 2025-05-07T20:32:21.3799225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.3799735Z 2025-05-07T20:32:21.3799839Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.3800261Z self=, 2025-05-07T20:32:21.3800668Z T=128, 2025-05-07T20:32:21.3800881Z D=5120, 2025-05-07T20:32:21.3810029Z scale_ub=None, 2025-05-07T20:32:21.3810281Z contiguous=False, 2025-05-07T20:32:21.3810529Z compiled=False, 2025-05-07T20:32:21.3810746Z ) 2025-05-07T20:32:21.3811084Z self = 2025-05-07T20:32:21.3811606Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:21.3812012Z 2025-05-07T20:32:21.3812098Z @given( 2025-05-07T20:32:21.3812349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.3812680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.3812992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.3813338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.3813679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.3813980Z ) 2025-05-07T20:32:21.3814342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.3814796Z def test_silu_mul_quant( 2025-05-07T20:32:21.3815049Z self, 2025-05-07T20:32:21.3815250Z T: int, 2025-05-07T20:32:21.3815460Z D: int, 2025-05-07T20:32:21.3815695Z scale_ub: Optional[float], 2025-05-07T20:32:21.3815974Z contiguous: bool, 2025-05-07T20:32:21.3816229Z compiled: bool, 2025-05-07T20:32:21.3816468Z ) -> None: 2025-05-07T20:32:21.3816691Z torch.manual_seed(2025) 2025-05-07T20:32:21.3816952Z 2025-05-07T20:32:21.3817237Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.3817585Z 2025-05-07T20:32:21.3817790Z x_sign = torch.sign(x) 2025-05-07T20:32:21.3818095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.3818409Z x = x_sign * x_clamp 2025-05-07T20:32:21.3818670Z x0 = x[:, :D] 2025-05-07T20:32:21.3818978Z x1 = x[:, D:] 2025-05-07T20:32:21.3819194Z 2025-05-07T20:32:21.3819394Z if contiguous: 2025-05-07T20:32:21.3819630Z x0 = x0.contiguous() 2025-05-07T20:32:21.3819892Z x1 = x1.contiguous() 2025-05-07T20:32:21.3820145Z 2025-05-07T20:32:21.3820352Z if scale_ub is not None: 2025-05-07T20:32:21.3820637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.3820982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.3821308Z ) 2025-05-07T20:32:21.3821520Z else: 2025-05-07T20:32:21.3821739Z scale_ub_tensor = None 2025-05-07T20:32:21.3822003Z 2025-05-07T20:32:21.3822249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.3822571Z op = silu_mul_quant 2025-05-07T20:32:21.3822835Z if compiled: 2025-05-07T20:32:21.3823099Z op = torch.compile(op) 2025-05-07T20:32:21.3823407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.3823762Z 2025-05-07T20:32:21.3823972Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.3824144Z 2025-05-07T20:32:21.3824250Z moe/activation_test.py:117: 2025-05-07T20:32:21.3824558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.3824911Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.3825206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.3825963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.3826669Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.3827218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.3827903Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.3828581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.3829125Z kernel = self.compile( 2025-05-07T20:32:21.3829673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.3830328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.3830740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.3831020Z 2025-05-07T20:32:21.3831242Z self = 2025-05-07T20:32:21.3832327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.3833700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508c84720>} 2025-05-07T20:32:21.3835045Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.3836073Z context = 2025-05-07T20:32:21.3836362Z 2025-05-07T20:32:21.3836537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.3837068Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.3837545Z module_map=module_map) 2025-05-07T20:32:21.3837921Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.3838284Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.3838550Z E ^ 2025-05-07T20:32:21.3839078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.3839533Z 2025-05-07T20:32:21.3839958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.3840470Z 2025-05-07T20:32:21.3840589Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.3841010Z self=, 2025-05-07T20:32:21.3841433Z T=128, 2025-05-07T20:32:21.3841648Z D=5120, 2025-05-07T20:32:21.3841859Z scale_ub=1200.0, 2025-05-07T20:32:21.3842099Z contiguous=True, 2025-05-07T20:32:21.3842340Z compiled=False, 2025-05-07T20:32:21.3842555Z ) 2025-05-07T20:32:21.6765205Z self = 2025-05-07T20:32:21.6765843Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.6766119Z 2025-05-07T20:32:21.6766202Z @given( 2025-05-07T20:32:21.6766467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.6767075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.6767387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.6767852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.6768191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.6768484Z ) 2025-05-07T20:32:21.6768835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.6769371Z def test_silu_mul_quant( 2025-05-07T20:32:21.6769621Z self, 2025-05-07T20:32:21.6769819Z T: int, 2025-05-07T20:32:21.6770026Z D: int, 2025-05-07T20:32:21.6770261Z scale_ub: Optional[float], 2025-05-07T20:32:21.6770532Z contiguous: bool, 2025-05-07T20:32:21.6770780Z compiled: bool, 2025-05-07T20:32:21.6771018Z ) -> None: 2025-05-07T20:32:21.6771235Z torch.manual_seed(2025) 2025-05-07T20:32:21.6771489Z 2025-05-07T20:32:21.6771774Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.6772118Z 2025-05-07T20:32:21.6772325Z x_sign = torch.sign(x) 2025-05-07T20:32:21.6772625Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.6772938Z x = x_sign * x_clamp 2025-05-07T20:32:21.6773187Z x0 = x[:, :D] 2025-05-07T20:32:21.6773411Z x1 = x[:, D:] 2025-05-07T20:32:21.6773630Z 2025-05-07T20:32:21.6773819Z if contiguous: 2025-05-07T20:32:21.6774158Z x0 = x0.contiguous() 2025-05-07T20:32:21.6774425Z x1 = x1.contiguous() 2025-05-07T20:32:21.6774666Z 2025-05-07T20:32:21.6774870Z if scale_ub is not None: 2025-05-07T20:32:21.6775148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.6775487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.6775800Z ) 2025-05-07T20:32:21.6776006Z else: 2025-05-07T20:32:21.6776219Z scale_ub_tensor = None 2025-05-07T20:32:21.6776481Z 2025-05-07T20:32:21.6776718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.6777028Z op = silu_mul_quant 2025-05-07T20:32:21.6777283Z if compiled: 2025-05-07T20:32:21.6777535Z op = torch.compile(op) 2025-05-07T20:32:21.6777828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.6778112Z 2025-05-07T20:32:21.6778307Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.6778478Z 2025-05-07T20:32:21.6778587Z moe/activation_test.py:117: 2025-05-07T20:32:21.6778882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.6779219Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.6779505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.6780196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.6780972Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.6781519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.6782211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.6782873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.6783414Z kernel = self.compile( 2025-05-07T20:32:21.6783963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.6784623Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.6785023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.6785255Z 2025-05-07T20:32:21.6785458Z self = 2025-05-07T20:32:21.6786585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.6787979Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508c858a0>} 2025-05-07T20:32:21.6789326Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.6790399Z context = 2025-05-07T20:32:21.6790691Z 2025-05-07T20:32:21.6790859Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.6791417Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.6791903Z module_map=module_map) 2025-05-07T20:32:21.6792279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.6792638Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.6792899Z E ^ 2025-05-07T20:32:21.6793369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.6793826Z 2025-05-07T20:32:21.6794241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.6794797Z 2025-05-07T20:32:21.6794907Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.6795320Z self=, 2025-05-07T20:32:21.6795722Z T=1, 2025-05-07T20:32:21.6795919Z D=7168, 2025-05-07T20:32:21.6796109Z scale_ub=1200.0, 2025-05-07T20:32:21.6796343Z contiguous=True, 2025-05-07T20:32:21.6796563Z compiled=True, 2025-05-07T20:32:21.6796774Z ) 2025-05-07T20:32:21.6797096Z self = 2025-05-07T20:32:21.6797588Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.6797845Z 2025-05-07T20:32:21.6797930Z @given( 2025-05-07T20:32:21.6798165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.6798480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.6798792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.6799122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.6799460Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.6799745Z ) 2025-05-07T20:32:21.6800095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.6800537Z def test_silu_mul_quant( 2025-05-07T20:32:21.6800784Z self, 2025-05-07T20:32:21.6800975Z T: int, 2025-05-07T20:32:21.6801181Z D: int, 2025-05-07T20:32:21.6801450Z scale_ub: Optional[float], 2025-05-07T20:32:21.6801731Z contiguous: bool, 2025-05-07T20:32:21.6801966Z compiled: bool, 2025-05-07T20:32:21.6802194Z ) -> None: 2025-05-07T20:32:21.6802410Z torch.manual_seed(2025) 2025-05-07T20:32:21.6802656Z 2025-05-07T20:32:21.6802932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.6803285Z 2025-05-07T20:32:21.6803476Z x_sign = torch.sign(x) 2025-05-07T20:32:21.6803782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.6804093Z x = x_sign * x_clamp 2025-05-07T20:32:21.6804331Z x0 = x[:, :D] 2025-05-07T20:32:21.6804548Z x1 = x[:, D:] 2025-05-07T20:32:21.6804768Z 2025-05-07T20:32:21.6804949Z if contiguous: 2025-05-07T20:32:21.6805186Z x0 = x0.contiguous() 2025-05-07T20:32:21.6805446Z x1 = x1.contiguous() 2025-05-07T20:32:21.6805961Z 2025-05-07T20:32:21.6806168Z if scale_ub is not None: 2025-05-07T20:32:21.6806674Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.6807014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.6807329Z ) 2025-05-07T20:32:21.6807599Z else: 2025-05-07T20:32:21.6807823Z scale_ub_tensor = None 2025-05-07T20:32:21.6808069Z 2025-05-07T20:32:21.6808307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.6808701Z op = silu_mul_quant 2025-05-07T20:32:21.6808952Z if compiled: 2025-05-07T20:32:21.6809203Z op = torch.compile(op) 2025-05-07T20:32:21.6809506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.6809772Z 2025-05-07T20:32:21.6809975Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.6810139Z 2025-05-07T20:32:21.6810244Z moe/activation_test.py:117: 2025-05-07T20:32:21.6810542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.6810877Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.6811167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.6811727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.6812286Z return fn(*args, **kwargs) 2025-05-07T20:32:21.6812942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.6813712Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.6814240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.6814929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.6815588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.6816129Z kernel = self.compile( 2025-05-07T20:32:21.6816670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.6817324Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.6817723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.6817949Z 2025-05-07T20:32:21.6818156Z self = 2025-05-07T20:32:21.6819231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.6820594Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508c86e80>} 2025-05-07T20:32:21.6822041Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.6823068Z context = 2025-05-07T20:32:21.6823351Z 2025-05-07T20:32:21.6823516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.6824048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.6824517Z module_map=module_map) 2025-05-07T20:32:21.6824884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.6825229Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.6825493Z E ^ 2025-05-07T20:32:21.6825954Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.6826403Z 2025-05-07T20:32:21.6826865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.6827380Z 2025-05-07T20:32:21.6827481Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.6827897Z self=, 2025-05-07T20:32:21.6828294Z T=1, 2025-05-07T20:32:21.6828480Z D=7168, 2025-05-07T20:32:21.6828675Z scale_ub=1200.0, 2025-05-07T20:32:21.6828905Z contiguous=False, 2025-05-07T20:32:21.6829166Z compiled=True, 2025-05-07T20:32:21.6829368Z ) 2025-05-07T20:32:21.7831905Z self = 2025-05-07T20:32:21.7832549Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:21.7832818Z 2025-05-07T20:32:21.7832911Z @given( 2025-05-07T20:32:21.7833148Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.7833468Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.7833797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.7834132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.7834465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.7834759Z ) 2025-05-07T20:32:21.7835104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.7835553Z def test_silu_mul_quant( 2025-05-07T20:32:21.7835806Z self, 2025-05-07T20:32:21.7836002Z T: int, 2025-05-07T20:32:21.7836495Z D: int, 2025-05-07T20:32:21.7836719Z scale_ub: Optional[float], 2025-05-07T20:32:21.7836990Z contiguous: bool, 2025-05-07T20:32:21.7837238Z compiled: bool, 2025-05-07T20:32:21.7837472Z ) -> None: 2025-05-07T20:32:21.7837691Z torch.manual_seed(2025) 2025-05-07T20:32:21.7837930Z 2025-05-07T20:32:21.7838206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.7838549Z 2025-05-07T20:32:21.7838744Z x_sign = torch.sign(x) 2025-05-07T20:32:21.7839044Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.7839356Z x = x_sign * x_clamp 2025-05-07T20:32:21.7839609Z x0 = x[:, :D] 2025-05-07T20:32:21.7839823Z x1 = x[:, D:] 2025-05-07T20:32:21.7840037Z 2025-05-07T20:32:21.7840229Z if contiguous: 2025-05-07T20:32:21.7840462Z x0 = x0.contiguous() 2025-05-07T20:32:21.7840725Z x1 = x1.contiguous() 2025-05-07T20:32:21.7840978Z 2025-05-07T20:32:21.7841170Z if scale_ub is not None: 2025-05-07T20:32:21.7841474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.7841840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.7842151Z ) 2025-05-07T20:32:21.7842349Z else: 2025-05-07T20:32:21.7842565Z scale_ub_tensor = None 2025-05-07T20:32:21.7842815Z 2025-05-07T20:32:21.7843049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.7843454Z op = silu_mul_quant 2025-05-07T20:32:21.7843712Z if compiled: 2025-05-07T20:32:21.7843960Z op = torch.compile(op) 2025-05-07T20:32:21.7844261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.7844541Z 2025-05-07T20:32:21.7844733Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.7844905Z 2025-05-07T20:32:21.7845006Z moe/activation_test.py:117: 2025-05-07T20:32:21.7845308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.7845649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.7845938Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.7846500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.7847062Z return fn(*args, **kwargs) 2025-05-07T20:32:21.7847843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.7848621Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.7849165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.7849846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.7850511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.7851119Z kernel = self.compile( 2025-05-07T20:32:21.7851669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.7852326Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.7852727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.7852958Z 2025-05-07T20:32:21.7853170Z self = 2025-05-07T20:32:21.7854259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.7855649Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508e44680>} 2025-05-07T20:32:21.7856996Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.7858071Z context = 2025-05-07T20:32:21.7858355Z 2025-05-07T20:32:21.7858530Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.7859052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.7859523Z module_map=module_map) 2025-05-07T20:32:21.7859892Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.7860245Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.7860504Z E ^ 2025-05-07T20:32:21.7860972Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.7861424Z 2025-05-07T20:32:21.7861847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.7862357Z 2025-05-07T20:32:21.7862463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.7862878Z self=, 2025-05-07T20:32:21.7863282Z T=1, 2025-05-07T20:32:21.7863471Z D=7168, 2025-05-07T20:32:21.7863667Z scale_ub=None, 2025-05-07T20:32:21.7863890Z contiguous=False, 2025-05-07T20:32:21.7864169Z compiled=True, 2025-05-07T20:32:21.7864382Z ) 2025-05-07T20:32:21.8538896Z self = 2025-05-07T20:32:21.8539526Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:21.8539789Z 2025-05-07T20:32:21.8539878Z @given( 2025-05-07T20:32:21.8540100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.8540412Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.8540743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.8541071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.8541390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.8541675Z ) 2025-05-07T20:32:21.8542020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.8542453Z def test_silu_mul_quant( 2025-05-07T20:32:21.8542701Z self, 2025-05-07T20:32:21.8542897Z T: int, 2025-05-07T20:32:21.8543320Z D: int, 2025-05-07T20:32:21.8543541Z scale_ub: Optional[float], 2025-05-07T20:32:21.8543812Z contiguous: bool, 2025-05-07T20:32:21.8544045Z compiled: bool, 2025-05-07T20:32:21.8544273Z ) -> None: 2025-05-07T20:32:21.8544488Z torch.manual_seed(2025) 2025-05-07T20:32:21.8544726Z 2025-05-07T20:32:21.8544998Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.8545427Z 2025-05-07T20:32:21.8545616Z x_sign = torch.sign(x) 2025-05-07T20:32:21.8545907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.8546221Z x = x_sign * x_clamp 2025-05-07T20:32:21.8546462Z x0 = x[:, :D] 2025-05-07T20:32:21.8546671Z x1 = x[:, D:] 2025-05-07T20:32:21.8546886Z 2025-05-07T20:32:21.8547070Z if contiguous: 2025-05-07T20:32:21.8547297Z x0 = x0.contiguous() 2025-05-07T20:32:21.8547550Z x1 = x1.contiguous() 2025-05-07T20:32:21.8547810Z 2025-05-07T20:32:21.8547998Z if scale_ub is not None: 2025-05-07T20:32:21.8548272Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.8548604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.8557094Z ) 2025-05-07T20:32:21.8557308Z else: 2025-05-07T20:32:21.8557526Z scale_ub_tensor = None 2025-05-07T20:32:21.8557800Z 2025-05-07T20:32:21.8558052Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.8558536Z op = silu_mul_quant 2025-05-07T20:32:21.8558804Z if compiled: 2025-05-07T20:32:21.8559071Z op = torch.compile(op) 2025-05-07T20:32:21.8559377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.8559668Z 2025-05-07T20:32:21.8559877Z y_fp8, y_scale = fn() 2025-05-07T20:32:21.8560170Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:21.8560480Z 2025-05-07T20:32:21.8560744Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.8561093Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:21.8561412Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:21.8561782Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:21.8562157Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.8562472Z 2025-05-07T20:32:21.8562690Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.8562897Z 2025-05-07T20:32:21.8563009Z moe/activation_test.py:126: 2025-05-07T20:32:21.8563316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.8563665Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:21.8564005Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.8564804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:21.8565643Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.8566204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.8566894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.8567726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:21.8568498Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.8569263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:21.8570022Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.8570765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:21.8571491Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.8572131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:21.8572705Z fn() 2025-05-07T20:32:21.8573395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:21.8573988Z self.fn.run( 2025-05-07T20:32:21.8574472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.8575088Z kernel = self.compile( 2025-05-07T20:32:21.8575634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.8576298Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.8576707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.8576941Z 2025-05-07T20:32:21.8577165Z self = 2025-05-07T20:32:21.8578246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.8579639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508e45580>} 2025-05-07T20:32:21.8581037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.8582069Z context = 2025-05-07T20:32:21.8582359Z 2025-05-07T20:32:21.8582540Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.8583069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.8583548Z module_map=module_map) 2025-05-07T20:32:21.8583928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.8584292Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:21.8584575Z E ^ 2025-05-07T20:32:21.8585048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.8585506Z 2025-05-07T20:32:21.8585932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.8586444Z 2025-05-07T20:32:21.8586555Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.8586977Z self=, 2025-05-07T20:32:21.8587390Z T=1, 2025-05-07T20:32:21.8587585Z D=5120, 2025-05-07T20:32:21.8587850Z scale_ub=1200.0, 2025-05-07T20:32:21.8588092Z contiguous=False, 2025-05-07T20:32:21.8588325Z compiled=True, 2025-05-07T20:32:21.8588548Z ) 2025-05-07T20:32:21.9789365Z self = 2025-05-07T20:32:21.9790133Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:21.9790409Z 2025-05-07T20:32:21.9790493Z @given( 2025-05-07T20:32:21.9790758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.9791088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.9791398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.9791740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.9792080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.9792377Z ) 2025-05-07T20:32:21.9792730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.9793187Z def test_silu_mul_quant( 2025-05-07T20:32:21.9793693Z self, 2025-05-07T20:32:21.9793897Z T: int, 2025-05-07T20:32:21.9794111Z D: int, 2025-05-07T20:32:21.9794340Z scale_ub: Optional[float], 2025-05-07T20:32:21.9794614Z contiguous: bool, 2025-05-07T20:32:21.9794865Z compiled: bool, 2025-05-07T20:32:21.9795104Z ) -> None: 2025-05-07T20:32:21.9795324Z torch.manual_seed(2025) 2025-05-07T20:32:21.9795578Z 2025-05-07T20:32:21.9795954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9796302Z 2025-05-07T20:32:21.9796505Z x_sign = torch.sign(x) 2025-05-07T20:32:21.9796813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.9797125Z x = x_sign * x_clamp 2025-05-07T20:32:21.9797375Z x0 = x[:, :D] 2025-05-07T20:32:21.9797600Z x1 = x[:, D:] 2025-05-07T20:32:21.9797812Z 2025-05-07T20:32:21.9798012Z if contiguous: 2025-05-07T20:32:21.9798259Z x0 = x0.contiguous() 2025-05-07T20:32:21.9798532Z x1 = x1.contiguous() 2025-05-07T20:32:21.9798775Z 2025-05-07T20:32:21.9798978Z if scale_ub is not None: 2025-05-07T20:32:21.9799260Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.9799598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.9799914Z ) 2025-05-07T20:32:21.9800117Z else: 2025-05-07T20:32:21.9800335Z scale_ub_tensor = None 2025-05-07T20:32:21.9800687Z 2025-05-07T20:32:21.9800926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.9801242Z op = silu_mul_quant 2025-05-07T20:32:21.9801504Z if compiled: 2025-05-07T20:32:21.9801759Z op = torch.compile(op) 2025-05-07T20:32:21.9802054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.9802339Z 2025-05-07T20:32:21.9802541Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.9802710Z 2025-05-07T20:32:21.9802829Z moe/activation_test.py:117: 2025-05-07T20:32:21.9803130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9803476Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.9803773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.9804343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.9804919Z return fn(*args, **kwargs) 2025-05-07T20:32:21.9805898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.9806610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.9807151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.9807898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.9808656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.9809199Z kernel = self.compile( 2025-05-07T20:32:21.9809743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.9810413Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.9810817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9811053Z 2025-05-07T20:32:21.9811266Z self = 2025-05-07T20:32:21.9812352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.9813820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508e46b60>} 2025-05-07T20:32:21.9815187Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.9816225Z context = 2025-05-07T20:32:21.9816516Z 2025-05-07T20:32:21.9816755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.9817276Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.9817756Z module_map=module_map) 2025-05-07T20:32:21.9818126Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.9818485Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.9818751Z E ^ 2025-05-07T20:32:21.9819234Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.9819686Z 2025-05-07T20:32:21.9820106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.9820626Z 2025-05-07T20:32:21.9820734Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.9821151Z self=, 2025-05-07T20:32:21.9821678Z T=1, 2025-05-07T20:32:21.9821873Z D=5120, 2025-05-07T20:32:21.9822081Z scale_ub=1200.0, 2025-05-07T20:32:21.9822315Z contiguous=False, 2025-05-07T20:32:21.9822550Z compiled=False, 2025-05-07T20:32:21.9822765Z ) 2025-05-07T20:32:21.9823095Z self = 2025-05-07T20:32:21.9823585Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:21.9823857Z 2025-05-07T20:32:21.9823940Z @given( 2025-05-07T20:32:21.9824181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.9824509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.9824814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.9825156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.9825489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.9825778Z ) 2025-05-07T20:32:21.9826131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.9826582Z def test_silu_mul_quant( 2025-05-07T20:32:21.9826837Z self, 2025-05-07T20:32:21.9827034Z T: int, 2025-05-07T20:32:21.9827245Z D: int, 2025-05-07T20:32:21.9827471Z scale_ub: Optional[float], 2025-05-07T20:32:21.9827749Z contiguous: bool, 2025-05-07T20:32:21.9827995Z compiled: bool, 2025-05-07T20:32:21.9828233Z ) -> None: 2025-05-07T20:32:21.9828452Z torch.manual_seed(2025) 2025-05-07T20:32:21.9828711Z 2025-05-07T20:32:21.9829042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.9829395Z 2025-05-07T20:32:21.9829597Z x_sign = torch.sign(x) 2025-05-07T20:32:21.9829902Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.9830212Z x = x_sign * x_clamp 2025-05-07T20:32:21.9830465Z x0 = x[:, :D] 2025-05-07T20:32:21.9830693Z x1 = x[:, D:] 2025-05-07T20:32:21.9830909Z 2025-05-07T20:32:21.9831105Z if contiguous: 2025-05-07T20:32:21.9831349Z x0 = x0.contiguous() 2025-05-07T20:32:21.9831629Z x1 = x1.contiguous() 2025-05-07T20:32:21.9831907Z 2025-05-07T20:32:21.9832102Z if scale_ub is not None: 2025-05-07T20:32:21.9832383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.9832717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.9833037Z ) 2025-05-07T20:32:21.9833243Z else: 2025-05-07T20:32:21.9833462Z scale_ub_tensor = None 2025-05-07T20:32:21.9833828Z 2025-05-07T20:32:21.9834074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.9834385Z op = silu_mul_quant 2025-05-07T20:32:21.9834648Z if compiled: 2025-05-07T20:32:21.9834904Z op = torch.compile(op) 2025-05-07T20:32:21.9835206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.9835486Z 2025-05-07T20:32:21.9835693Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.9835906Z 2025-05-07T20:32:21.9836006Z moe/activation_test.py:117: 2025-05-07T20:32:21.9836314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9836651Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.9836944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.9837628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.9838333Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.9838881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.9839568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.9840232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.9840779Z kernel = self.compile( 2025-05-07T20:32:21.9841383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.9842092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.9842492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.9842725Z 2025-05-07T20:32:21.9842941Z self = 2025-05-07T20:32:21.9844025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.9845401Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508e472e0>} 2025-05-07T20:32:21.9846745Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.9847898Z context = 2025-05-07T20:32:21.9848189Z 2025-05-07T20:32:21.9848366Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.9848894Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.9849427Z module_map=module_map) 2025-05-07T20:32:21.9849799Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.9850167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.9850426Z E ^ 2025-05-07T20:32:21.9850902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.9851353Z 2025-05-07T20:32:21.9851829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.9852345Z 2025-05-07T20:32:21.9852465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.9852878Z self=, 2025-05-07T20:32:21.9853301Z T=16384, 2025-05-07T20:32:21.9853504Z D=5120, 2025-05-07T20:32:21.9853708Z scale_ub=1200.0, 2025-05-07T20:32:21.9853937Z contiguous=False, 2025-05-07T20:32:21.9854184Z compiled=True, 2025-05-07T20:32:21.9854392Z ) 2025-05-07T20:32:22.2214580Z self = 2025-05-07T20:32:22.2215169Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.2215458Z 2025-05-07T20:32:22.2215539Z @given( 2025-05-07T20:32:22.2215781Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.2216092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.2216497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.2216829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.2217160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.2217444Z ) 2025-05-07T20:32:22.2217800Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.2218246Z def test_silu_mul_quant( 2025-05-07T20:32:22.2218487Z self, 2025-05-07T20:32:22.2218685Z T: int, 2025-05-07T20:32:22.2218890Z D: int, 2025-05-07T20:32:22.2219107Z scale_ub: Optional[float], 2025-05-07T20:32:22.2219381Z contiguous: bool, 2025-05-07T20:32:22.2219629Z compiled: bool, 2025-05-07T20:32:22.2219854Z ) -> None: 2025-05-07T20:32:22.2220075Z torch.manual_seed(2025) 2025-05-07T20:32:22.2220323Z 2025-05-07T20:32:22.2220593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.2220947Z 2025-05-07T20:32:22.2221237Z x_sign = torch.sign(x) 2025-05-07T20:32:22.2221551Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.2221890Z x = x_sign * x_clamp 2025-05-07T20:32:22.2222137Z x0 = x[:, :D] 2025-05-07T20:32:22.2222361Z x1 = x[:, D:] 2025-05-07T20:32:22.2222576Z 2025-05-07T20:32:22.2222765Z if contiguous: 2025-05-07T20:32:22.2223006Z x0 = x0.contiguous() 2025-05-07T20:32:22.2223261Z x1 = x1.contiguous() 2025-05-07T20:32:22.2223501Z 2025-05-07T20:32:22.2223711Z if scale_ub is not None: 2025-05-07T20:32:22.2223982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.2224323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.2224634Z ) 2025-05-07T20:32:22.2224829Z else: 2025-05-07T20:32:22.2225052Z scale_ub_tensor = None 2025-05-07T20:32:22.2225308Z 2025-05-07T20:32:22.2225537Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.2225863Z op = silu_mul_quant 2025-05-07T20:32:22.2226117Z if compiled: 2025-05-07T20:32:22.2226363Z op = torch.compile(op) 2025-05-07T20:32:22.2226664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.2226943Z 2025-05-07T20:32:22.2227135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.2227310Z 2025-05-07T20:32:22.2227413Z moe/activation_test.py:117: 2025-05-07T20:32:22.2227791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.2228131Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.2228409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.2228968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.2229535Z return fn(*args, **kwargs) 2025-05-07T20:32:22.2230197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.2230889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.2231430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.2232114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.2232783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.2233317Z kernel = self.compile( 2025-05-07T20:32:22.2233910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.2234574Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.2234971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.2235200Z 2025-05-07T20:32:22.2235403Z self = 2025-05-07T20:32:22.2236524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.2237912Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509844fe0>} 2025-05-07T20:32:22.2239262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.2240288Z context = 2025-05-07T20:32:22.2240579Z 2025-05-07T20:32:22.2240745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.2241271Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.2241788Z module_map=module_map) 2025-05-07T20:32:22.2242149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.2242507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.2242772Z E ^ 2025-05-07T20:32:22.2243234Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.2243692Z 2025-05-07T20:32:22.2244113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.2244635Z 2025-05-07T20:32:22.2244741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.2245158Z self=, 2025-05-07T20:32:22.2245559Z T=2048, 2025-05-07T20:32:22.2245757Z D=7168, 2025-05-07T20:32:22.2245959Z scale_ub=1200.0, 2025-05-07T20:32:22.2246191Z contiguous=False, 2025-05-07T20:32:22.2246427Z compiled=True, 2025-05-07T20:32:22.2246643Z ) 2025-05-07T20:32:22.2246960Z self = 2025-05-07T20:32:22.2247458Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.2247857Z 2025-05-07T20:32:22.2247940Z @given( 2025-05-07T20:32:22.2248179Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.2248492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.2248862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.2249198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.2249524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.2249815Z ) 2025-05-07T20:32:22.2250167Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.2250603Z def test_silu_mul_quant( 2025-05-07T20:32:22.2250851Z self, 2025-05-07T20:32:22.2251057Z T: int, 2025-05-07T20:32:22.2251257Z D: int, 2025-05-07T20:32:22.2251477Z scale_ub: Optional[float], 2025-05-07T20:32:22.2251792Z contiguous: bool, 2025-05-07T20:32:22.2252042Z compiled: bool, 2025-05-07T20:32:22.2252262Z ) -> None: 2025-05-07T20:32:22.2252484Z torch.manual_seed(2025) 2025-05-07T20:32:22.2252729Z 2025-05-07T20:32:22.2252999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.2253346Z 2025-05-07T20:32:22.2253596Z x_sign = torch.sign(x) 2025-05-07T20:32:22.2253886Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.2254200Z x = x_sign * x_clamp 2025-05-07T20:32:22.2254444Z x0 = x[:, :D] 2025-05-07T20:32:22.2254660Z x1 = x[:, D:] 2025-05-07T20:32:22.2254874Z 2025-05-07T20:32:22.2255065Z if contiguous: 2025-05-07T20:32:22.2255295Z x0 = x0.contiguous() 2025-05-07T20:32:22.2255605Z x1 = x1.contiguous() 2025-05-07T20:32:22.2255849Z 2025-05-07T20:32:22.2256040Z if scale_ub is not None: 2025-05-07T20:32:22.2256313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.2256654Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.2256967Z ) 2025-05-07T20:32:22.2257161Z else: 2025-05-07T20:32:22.2257377Z scale_ub_tensor = None 2025-05-07T20:32:22.2257634Z 2025-05-07T20:32:22.2257867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.2258190Z op = silu_mul_quant 2025-05-07T20:32:22.2258447Z if compiled: 2025-05-07T20:32:22.2258698Z op = torch.compile(op) 2025-05-07T20:32:22.2259001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.2259284Z 2025-05-07T20:32:22.2259479Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.2259650Z 2025-05-07T20:32:22.2259749Z moe/activation_test.py:117: 2025-05-07T20:32:22.2260127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.2260464Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.2260744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.2261305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.2261871Z return fn(*args, **kwargs) 2025-05-07T20:32:22.2262536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.2263231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.2263768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.2264459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.2265115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.2265670Z kernel = self.compile( 2025-05-07T20:32:22.2274235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.2274919Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.2275325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.2275572Z 2025-05-07T20:32:22.2275864Z self = 2025-05-07T20:32:22.2276967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.2278359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509845b20>} 2025-05-07T20:32:22.2279711Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.2280753Z context = 2025-05-07T20:32:22.2281054Z 2025-05-07T20:32:22.2281227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.2281861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.2282337Z module_map=module_map) 2025-05-07T20:32:22.2282718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.2283090Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.2283356Z E ^ 2025-05-07T20:32:22.2283837Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.2284343Z 2025-05-07T20:32:22.2284764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.2285279Z 2025-05-07T20:32:22.3168645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.3169314Z self=, 2025-05-07T20:32:22.3169919Z T=1, 2025-05-07T20:32:22.3170123Z D=5120, 2025-05-07T20:32:22.3170324Z scale_ub=None, 2025-05-07T20:32:22.3170562Z contiguous=False, 2025-05-07T20:32:22.3170813Z compiled=False, 2025-05-07T20:32:22.3171029Z ) 2025-05-07T20:32:22.3171384Z self = 2025-05-07T20:32:22.3171879Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.3172145Z 2025-05-07T20:32:22.3172227Z @given( 2025-05-07T20:32:22.3172462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.3173084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.3173402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.3173733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.3174071Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.3174370Z ) 2025-05-07T20:32:22.3174720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.3175172Z def test_silu_mul_quant( 2025-05-07T20:32:22.3175434Z self, 2025-05-07T20:32:22.3175639Z T: int, 2025-05-07T20:32:22.3175840Z D: int, 2025-05-07T20:32:22.3176067Z scale_ub: Optional[float], 2025-05-07T20:32:22.3176349Z contiguous: bool, 2025-05-07T20:32:22.3176589Z compiled: bool, 2025-05-07T20:32:22.3176834Z ) -> None: 2025-05-07T20:32:22.3177059Z torch.manual_seed(2025) 2025-05-07T20:32:22.3177307Z 2025-05-07T20:32:22.3177588Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.3177953Z 2025-05-07T20:32:22.3178149Z x_sign = torch.sign(x) 2025-05-07T20:32:22.3178449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.3178768Z x = x_sign * x_clamp 2025-05-07T20:32:22.3179017Z x0 = x[:, :D] 2025-05-07T20:32:22.3179250Z x1 = x[:, D:] 2025-05-07T20:32:22.3179467Z 2025-05-07T20:32:22.3179661Z if contiguous: 2025-05-07T20:32:22.3179903Z x0 = x0.contiguous() 2025-05-07T20:32:22.3180261Z x1 = x1.contiguous() 2025-05-07T20:32:22.3180510Z 2025-05-07T20:32:22.3180710Z if scale_ub is not None: 2025-05-07T20:32:22.3180989Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.3181336Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.3181643Z ) 2025-05-07T20:32:22.3181852Z else: 2025-05-07T20:32:22.3182073Z scale_ub_tensor = None 2025-05-07T20:32:22.3182329Z 2025-05-07T20:32:22.3182572Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.3182899Z op = silu_mul_quant 2025-05-07T20:32:22.3183151Z if compiled: 2025-05-07T20:32:22.3183409Z op = torch.compile(op) 2025-05-07T20:32:22.3183711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3183985Z 2025-05-07T20:32:22.3184186Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.3184354Z 2025-05-07T20:32:22.3184461Z moe/activation_test.py:117: 2025-05-07T20:32:22.3184839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3185184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.3185476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3186176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.3186867Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.3187490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.3188184Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.3188853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.3189391Z kernel = self.compile( 2025-05-07T20:32:22.3189945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.3190612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.3191011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3191250Z 2025-05-07T20:32:22.3191459Z self = 2025-05-07T20:32:22.3192549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.3193989Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509846e80>} 2025-05-07T20:32:22.3195359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.3196386Z context = 2025-05-07T20:32:22.3196681Z 2025-05-07T20:32:22.3196849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.3197374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.3197852Z module_map=module_map) 2025-05-07T20:32:22.3198228Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.3198593Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.3198866Z E ^ 2025-05-07T20:32:22.3199330Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.3199788Z 2025-05-07T20:32:22.3200206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.3200775Z 2025-05-07T20:32:22.3200885Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.3201306Z self=, 2025-05-07T20:32:22.3201759Z T=4096, 2025-05-07T20:32:22.3201962Z D=7168, 2025-05-07T20:32:22.3202163Z scale_ub=1200.0, 2025-05-07T20:32:22.3202390Z contiguous=False, 2025-05-07T20:32:22.3202627Z compiled=False, 2025-05-07T20:32:22.3202846Z ) 2025-05-07T20:32:22.3203174Z self = 2025-05-07T20:32:22.3203677Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.3203954Z 2025-05-07T20:32:22.3204049Z @given( 2025-05-07T20:32:22.3204283Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.3204607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.3204926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.3205267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.3205946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.3206249Z ) 2025-05-07T20:32:22.3206605Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.3207043Z def test_silu_mul_quant( 2025-05-07T20:32:22.3207293Z self, 2025-05-07T20:32:22.3207495Z T: int, 2025-05-07T20:32:22.3207781Z D: int, 2025-05-07T20:32:22.3208073Z scale_ub: Optional[float], 2025-05-07T20:32:22.3208353Z contiguous: bool, 2025-05-07T20:32:22.3208593Z compiled: bool, 2025-05-07T20:32:22.3208825Z ) -> None: 2025-05-07T20:32:22.3209050Z torch.manual_seed(2025) 2025-05-07T20:32:22.3209293Z 2025-05-07T20:32:22.3209573Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.3209929Z 2025-05-07T20:32:22.3210133Z x_sign = torch.sign(x) 2025-05-07T20:32:22.3210426Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.3210747Z x = x_sign * x_clamp 2025-05-07T20:32:22.3210993Z x0 = x[:, :D] 2025-05-07T20:32:22.3211216Z x1 = x[:, D:] 2025-05-07T20:32:22.3211424Z 2025-05-07T20:32:22.3211616Z if contiguous: 2025-05-07T20:32:22.3211850Z x0 = x0.contiguous() 2025-05-07T20:32:22.3212106Z x1 = x1.contiguous() 2025-05-07T20:32:22.3212355Z 2025-05-07T20:32:22.3212553Z if scale_ub is not None: 2025-05-07T20:32:22.3212898Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.3213235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.3213548Z ) 2025-05-07T20:32:22.3213742Z else: 2025-05-07T20:32:22.3213964Z scale_ub_tensor = None 2025-05-07T20:32:22.3214221Z 2025-05-07T20:32:22.3214454Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.3214771Z op = silu_mul_quant 2025-05-07T20:32:22.3215029Z if compiled: 2025-05-07T20:32:22.3215281Z op = torch.compile(op) 2025-05-07T20:32:22.3215582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3215864Z 2025-05-07T20:32:22.3216065Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.3216233Z 2025-05-07T20:32:22.3216334Z moe/activation_test.py:117: 2025-05-07T20:32:22.3216637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3216981Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.3217267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3217962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.3218658Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.3219198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.3219952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.3220622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.3221158Z kernel = self.compile( 2025-05-07T20:32:22.3221701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.3222365Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.3222769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3222998Z 2025-05-07T20:32:22.3223211Z self = 2025-05-07T20:32:22.3224281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.3225700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509424040>} 2025-05-07T20:32:22.3227037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.3228068Z context = 2025-05-07T20:32:22.3228428Z 2025-05-07T20:32:22.3228601Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.3229120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.3229596Z module_map=module_map) 2025-05-07T20:32:22.3229967Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.3230323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.3230592Z E ^ 2025-05-07T20:32:22.3231063Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.3231524Z 2025-05-07T20:32:22.3231992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.3232502Z 2025-05-07T20:32:22.3232608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.3233029Z self=, 2025-05-07T20:32:22.3233491Z T=16384, 2025-05-07T20:32:22.3233688Z D=7168, 2025-05-07T20:32:22.3233891Z scale_ub=None, 2025-05-07T20:32:22.3234112Z contiguous=True, 2025-05-07T20:32:22.3234335Z compiled=True, 2025-05-07T20:32:22.3234541Z ) 2025-05-07T20:32:22.4593264Z self = 2025-05-07T20:32:22.4594051Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.4594396Z 2025-05-07T20:32:22.4594494Z @given( 2025-05-07T20:32:22.4594730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.4595050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.4595363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.4595691Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.4596029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.4596329Z ) 2025-05-07T20:32:22.4596691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.4597132Z def test_silu_mul_quant( 2025-05-07T20:32:22.4597381Z self, 2025-05-07T20:32:22.4597588Z T: int, 2025-05-07T20:32:22.4597785Z D: int, 2025-05-07T20:32:22.4598015Z scale_ub: Optional[float], 2025-05-07T20:32:22.4598288Z contiguous: bool, 2025-05-07T20:32:22.4598526Z compiled: bool, 2025-05-07T20:32:22.4598760Z ) -> None: 2025-05-07T20:32:22.4599292Z torch.manual_seed(2025) 2025-05-07T20:32:22.4599539Z 2025-05-07T20:32:22.4599817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.4600166Z 2025-05-07T20:32:22.4600369Z x_sign = torch.sign(x) 2025-05-07T20:32:22.4600669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.4600984Z x = x_sign * x_clamp 2025-05-07T20:32:22.4601224Z x0 = x[:, :D] 2025-05-07T20:32:22.4601452Z x1 = x[:, D:] 2025-05-07T20:32:22.4601669Z 2025-05-07T20:32:22.4601858Z if contiguous: 2025-05-07T20:32:22.4602099Z x0 = x0.contiguous() 2025-05-07T20:32:22.4602359Z x1 = x1.contiguous() 2025-05-07T20:32:22.4602603Z 2025-05-07T20:32:22.4602794Z if scale_ub is not None: 2025-05-07T20:32:22.4603072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.4603409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.4603720Z ) 2025-05-07T20:32:22.4603999Z else: 2025-05-07T20:32:22.4604217Z scale_ub_tensor = None 2025-05-07T20:32:22.4604466Z 2025-05-07T20:32:22.4604701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.4605022Z op = silu_mul_quant 2025-05-07T20:32:22.4605273Z if compiled: 2025-05-07T20:32:22.4605527Z op = torch.compile(op) 2025-05-07T20:32:22.4606112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4606475Z 2025-05-07T20:32:22.4606673Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.4606844Z 2025-05-07T20:32:22.4606948Z moe/activation_test.py:117: 2025-05-07T20:32:22.4607244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4607667Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.4607952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4608517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.4609083Z return fn(*args, **kwargs) 2025-05-07T20:32:22.4609745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.4610433Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.4610970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.4611791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.4612462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.4612998Z kernel = self.compile( 2025-05-07T20:32:22.4613543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.4614201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.4614610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4614839Z 2025-05-07T20:32:22.4615052Z self = 2025-05-07T20:32:22.4616126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.4617521Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509425260>} 2025-05-07T20:32:22.4618863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.4619962Z context = 2025-05-07T20:32:22.4620256Z 2025-05-07T20:32:22.4620432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.4620953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.4621423Z module_map=module_map) 2025-05-07T20:32:22.4621794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.4622147Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.4622420Z E ^ 2025-05-07T20:32:22.4622884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.4623333Z 2025-05-07T20:32:22.4623751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.4624259Z 2025-05-07T20:32:22.4624366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.4624865Z self=, 2025-05-07T20:32:22.4625276Z T=4096, 2025-05-07T20:32:22.4625472Z D=5120, 2025-05-07T20:32:22.4625667Z scale_ub=None, 2025-05-07T20:32:22.4625888Z contiguous=False, 2025-05-07T20:32:22.4626119Z compiled=True, 2025-05-07T20:32:22.4626322Z ) 2025-05-07T20:32:22.4626645Z self = 2025-05-07T20:32:22.4627139Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.4627454Z 2025-05-07T20:32:22.4627537Z @given( 2025-05-07T20:32:22.4627766Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.4628087Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.4628397Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.4628724Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.4629052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.4629352Z ) 2025-05-07T20:32:22.4629706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.4630149Z def test_silu_mul_quant( 2025-05-07T20:32:22.4630389Z self, 2025-05-07T20:32:22.4630589Z T: int, 2025-05-07T20:32:22.4630792Z D: int, 2025-05-07T20:32:22.4631010Z scale_ub: Optional[float], 2025-05-07T20:32:22.4631286Z contiguous: bool, 2025-05-07T20:32:22.4631530Z compiled: bool, 2025-05-07T20:32:22.4631802Z ) -> None: 2025-05-07T20:32:22.4632021Z torch.manual_seed(2025) 2025-05-07T20:32:22.4632265Z 2025-05-07T20:32:22.4632535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.4632882Z 2025-05-07T20:32:22.4633078Z x_sign = torch.sign(x) 2025-05-07T20:32:22.4633372Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.4633685Z x = x_sign * x_clamp 2025-05-07T20:32:22.4633930Z x0 = x[:, :D] 2025-05-07T20:32:22.4634149Z x1 = x[:, D:] 2025-05-07T20:32:22.4634367Z 2025-05-07T20:32:22.4634561Z if contiguous: 2025-05-07T20:32:22.4634792Z x0 = x0.contiguous() 2025-05-07T20:32:22.4635058Z x1 = x1.contiguous() 2025-05-07T20:32:22.4635309Z 2025-05-07T20:32:22.4635500Z if scale_ub is not None: 2025-05-07T20:32:22.4635779Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.4636121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.4636435Z ) 2025-05-07T20:32:22.4636632Z else: 2025-05-07T20:32:22.4636852Z scale_ub_tensor = None 2025-05-07T20:32:22.4637109Z 2025-05-07T20:32:22.4637337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.4637656Z op = silu_mul_quant 2025-05-07T20:32:22.4637911Z if compiled: 2025-05-07T20:32:22.4638162Z op = torch.compile(op) 2025-05-07T20:32:22.4638512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4638797Z 2025-05-07T20:32:22.4638993Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.4639164Z 2025-05-07T20:32:22.4639268Z moe/activation_test.py:117: 2025-05-07T20:32:22.4639567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4639910Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.4640192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4640749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.4641314Z return fn(*args, **kwargs) 2025-05-07T20:32:22.4641967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.4642657Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.4643197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.4643930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.4644586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.4645119Z kernel = self.compile( 2025-05-07T20:32:22.4645661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.4646310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.4646770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4647003Z 2025-05-07T20:32:22.4647209Z self = 2025-05-07T20:32:22.4648365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.4649732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509425da0>} 2025-05-07T20:32:22.4651059Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.4652175Z context = 2025-05-07T20:32:22.4652462Z 2025-05-07T20:32:22.4652632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.4653150Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.4653612Z module_map=module_map) 2025-05-07T20:32:22.4653978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.4654339Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.4654600Z E ^ 2025-05-07T20:32:22.4655064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.4655512Z 2025-05-07T20:32:22.4655933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.4656442Z 2025-05-07T20:32:22.5793421Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.5794096Z self=, 2025-05-07T20:32:22.5794704Z T=4096, 2025-05-07T20:32:22.5794971Z D=5120, 2025-05-07T20:32:22.5795188Z scale_ub=1200.0, 2025-05-07T20:32:22.5795412Z contiguous=False, 2025-05-07T20:32:22.5795645Z compiled=False, 2025-05-07T20:32:22.5795863Z ) 2025-05-07T20:32:22.5796180Z self = 2025-05-07T20:32:22.5796979Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.5797263Z 2025-05-07T20:32:22.5797377Z @given( 2025-05-07T20:32:22.5797620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.5797937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.5798245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.5798585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.5798922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.5799209Z ) 2025-05-07T20:32:22.5799563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.5800007Z def test_silu_mul_quant( 2025-05-07T20:32:22.5800249Z self, 2025-05-07T20:32:22.5800449Z T: int, 2025-05-07T20:32:22.5808946Z D: int, 2025-05-07T20:32:22.5809205Z scale_ub: Optional[float], 2025-05-07T20:32:22.5809490Z contiguous: bool, 2025-05-07T20:32:22.5809756Z compiled: bool, 2025-05-07T20:32:22.5810126Z ) -> None: 2025-05-07T20:32:22.5810353Z torch.manual_seed(2025) 2025-05-07T20:32:22.5810612Z 2025-05-07T20:32:22.5810903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.5811250Z 2025-05-07T20:32:22.5811464Z x_sign = torch.sign(x) 2025-05-07T20:32:22.5811770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.5812087Z x = x_sign * x_clamp 2025-05-07T20:32:22.5812422Z x0 = x[:, :D] 2025-05-07T20:32:22.5812657Z x1 = x[:, D:] 2025-05-07T20:32:22.5812873Z 2025-05-07T20:32:22.5813076Z if contiguous: 2025-05-07T20:32:22.5813320Z x0 = x0.contiguous() 2025-05-07T20:32:22.5813588Z x1 = x1.contiguous() 2025-05-07T20:32:22.5813826Z 2025-05-07T20:32:22.5814025Z if scale_ub is not None: 2025-05-07T20:32:22.5814301Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.5814650Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.5814983Z ) 2025-05-07T20:32:22.5815194Z else: 2025-05-07T20:32:22.5815410Z scale_ub_tensor = None 2025-05-07T20:32:22.5815679Z 2025-05-07T20:32:22.5815927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.5816255Z op = silu_mul_quant 2025-05-07T20:32:22.5816522Z if compiled: 2025-05-07T20:32:22.5816783Z op = torch.compile(op) 2025-05-07T20:32:22.5817173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.5817452Z 2025-05-07T20:32:22.5817660Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.5817830Z 2025-05-07T20:32:22.5817949Z moe/activation_test.py:117: 2025-05-07T20:32:22.5818257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.5818610Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.5818906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.5819621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.5820324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.5820879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.5821579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.5822261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.5822804Z kernel = self.compile( 2025-05-07T20:32:22.5823361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.5824031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.5824434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.5824675Z 2025-05-07T20:32:22.5824958Z self = 2025-05-07T20:32:22.5826057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.5827454Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509427420>} 2025-05-07T20:32:22.5828812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.5829841Z context = 2025-05-07T20:32:22.5830138Z 2025-05-07T20:32:22.5830311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.5830888Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.5831367Z module_map=module_map) 2025-05-07T20:32:22.5831737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.5832102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.5832374Z E ^ 2025-05-07T20:32:22.5832843Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.5833347Z 2025-05-07T20:32:22.5833767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.5834290Z 2025-05-07T20:32:22.5834398Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.5834824Z self=, 2025-05-07T20:32:22.5835230Z T=4096, 2025-05-07T20:32:22.5835442Z D=5120, 2025-05-07T20:32:22.5835654Z scale_ub=1200.0, 2025-05-07T20:32:22.5835885Z contiguous=False, 2025-05-07T20:32:22.5836126Z compiled=True, 2025-05-07T20:32:22.5836345Z ) 2025-05-07T20:32:22.5836671Z self = 2025-05-07T20:32:22.5837174Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.5837451Z 2025-05-07T20:32:22.5837541Z @given( 2025-05-07T20:32:22.5837829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.5838156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.5838475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.5838817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.5839149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.5839444Z ) 2025-05-07T20:32:22.5839803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.5840256Z def test_silu_mul_quant( 2025-05-07T20:32:22.5840514Z self, 2025-05-07T20:32:22.5840730Z T: int, 2025-05-07T20:32:22.5840938Z D: int, 2025-05-07T20:32:22.5841174Z scale_ub: Optional[float], 2025-05-07T20:32:22.5841460Z contiguous: bool, 2025-05-07T20:32:22.5841711Z compiled: bool, 2025-05-07T20:32:22.5841974Z ) -> None: 2025-05-07T20:32:22.5842233Z torch.manual_seed(2025) 2025-05-07T20:32:22.5842489Z 2025-05-07T20:32:22.5842776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.5843139Z 2025-05-07T20:32:22.5843350Z x_sign = torch.sign(x) 2025-05-07T20:32:22.5843648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.5843971Z x = x_sign * x_clamp 2025-05-07T20:32:22.5844232Z x0 = x[:, :D] 2025-05-07T20:32:22.5844457Z x1 = x[:, D:] 2025-05-07T20:32:22.5844681Z 2025-05-07T20:32:22.5844885Z if contiguous: 2025-05-07T20:32:22.5845174Z x0 = x0.contiguous() 2025-05-07T20:32:22.5845449Z x1 = x1.contiguous() 2025-05-07T20:32:22.5845702Z 2025-05-07T20:32:22.5845902Z if scale_ub is not None: 2025-05-07T20:32:22.5846194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.5846544Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.5846859Z ) 2025-05-07T20:32:22.5847068Z else: 2025-05-07T20:32:22.5847300Z scale_ub_tensor = None 2025-05-07T20:32:22.5847612Z 2025-05-07T20:32:22.5847850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.5848171Z op = silu_mul_quant 2025-05-07T20:32:22.5848430Z if compiled: 2025-05-07T20:32:22.5848678Z op = torch.compile(op) 2025-05-07T20:32:22.5848981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.5849263Z 2025-05-07T20:32:22.5849460Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.5849637Z 2025-05-07T20:32:22.5849787Z moe/activation_test.py:117: 2025-05-07T20:32:22.5850094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.5850433Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.5850722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.5851284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.5851892Z return fn(*args, **kwargs) 2025-05-07T20:32:22.5852553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.5853245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.5853788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.5854467Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.5855136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.5855678Z kernel = self.compile( 2025-05-07T20:32:22.5856225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.5856880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.5857285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.5857561Z 2025-05-07T20:32:22.5857778Z self = 2025-05-07T20:32:22.5858855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.5860227Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be074860>} 2025-05-07T20:32:22.5861551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.5862620Z context = 2025-05-07T20:32:22.5862914Z 2025-05-07T20:32:22.5863081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.5863601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.5864056Z module_map=module_map) 2025-05-07T20:32:22.5864416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.5864766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.5865023Z E ^ 2025-05-07T20:32:22.5865529Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.5865979Z 2025-05-07T20:32:22.5866389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.5866894Z 2025-05-07T20:32:22.6737230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6737842Z self=, 2025-05-07T20:32:22.6738278Z T=2048, 2025-05-07T20:32:22.6738467Z D=7168, 2025-05-07T20:32:22.6738664Z scale_ub=1200.0, 2025-05-07T20:32:22.6738894Z contiguous=False, 2025-05-07T20:32:22.6739125Z compiled=False, 2025-05-07T20:32:22.6739334Z ) 2025-05-07T20:32:22.6739658Z self = 2025-05-07T20:32:22.6740160Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.6740435Z 2025-05-07T20:32:22.6740524Z @given( 2025-05-07T20:32:22.6741026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.6741347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.6741649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.6741983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.6742315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.6742600Z ) 2025-05-07T20:32:22.6742953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.6743466Z def test_silu_mul_quant( 2025-05-07T20:32:22.6743715Z self, 2025-05-07T20:32:22.6743911Z T: int, 2025-05-07T20:32:22.6744113Z D: int, 2025-05-07T20:32:22.6744336Z scale_ub: Optional[float], 2025-05-07T20:32:22.6744605Z contiguous: bool, 2025-05-07T20:32:22.6744847Z compiled: bool, 2025-05-07T20:32:22.6745080Z ) -> None: 2025-05-07T20:32:22.6745297Z torch.manual_seed(2025) 2025-05-07T20:32:22.6745547Z 2025-05-07T20:32:22.6745831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.6746179Z 2025-05-07T20:32:22.6746380Z x_sign = torch.sign(x) 2025-05-07T20:32:22.6746677Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.6746985Z x = x_sign * x_clamp 2025-05-07T20:32:22.6747234Z x0 = x[:, :D] 2025-05-07T20:32:22.6747457Z x1 = x[:, D:] 2025-05-07T20:32:22.6747788Z 2025-05-07T20:32:22.6747983Z if contiguous: 2025-05-07T20:32:22.6748220Z x0 = x0.contiguous() 2025-05-07T20:32:22.6748484Z x1 = x1.contiguous() 2025-05-07T20:32:22.6748723Z 2025-05-07T20:32:22.6748921Z if scale_ub is not None: 2025-05-07T20:32:22.6749201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.6749536Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.6749848Z ) 2025-05-07T20:32:22.6750045Z else: 2025-05-07T20:32:22.6750260Z scale_ub_tensor = None 2025-05-07T20:32:22.6750520Z 2025-05-07T20:32:22.6750756Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6751068Z op = silu_mul_quant 2025-05-07T20:32:22.6751323Z if compiled: 2025-05-07T20:32:22.6751574Z op = torch.compile(op) 2025-05-07T20:32:22.6751867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6752143Z 2025-05-07T20:32:22.6752345Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.6752507Z 2025-05-07T20:32:22.6752611Z moe/activation_test.py:117: 2025-05-07T20:32:22.6752903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6753239Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.6753519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6754277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.6754978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.6755518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.6756200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.6756854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.6757392Z kernel = self.compile( 2025-05-07T20:32:22.6757932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.6758576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.6758970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6759204Z 2025-05-07T20:32:22.6759408Z self = 2025-05-07T20:32:22.6760524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.6761901Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be0756c0>} 2025-05-07T20:32:22.6763287Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.6764302Z context = 2025-05-07T20:32:22.6764590Z 2025-05-07T20:32:22.6764764Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.6765287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.6765757Z module_map=module_map) 2025-05-07T20:32:22.6766133Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.6766502Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.6766756Z E ^ 2025-05-07T20:32:22.6767228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6767756Z 2025-05-07T20:32:22.6768223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.6768734Z 2025-05-07T20:32:22.6768848Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6769253Z self=, 2025-05-07T20:32:22.6769661Z T=1, 2025-05-07T20:32:22.6769852Z D=7168, 2025-05-07T20:32:22.6770049Z scale_ub=None, 2025-05-07T20:32:22.6770268Z contiguous=True, 2025-05-07T20:32:22.6770503Z compiled=False, 2025-05-07T20:32:22.6770716Z ) 2025-05-07T20:32:22.6771041Z self = 2025-05-07T20:32:22.6771523Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.6771785Z 2025-05-07T20:32:22.6771867Z @given( 2025-05-07T20:32:22.6772100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.6772417Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.6772728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.6773057Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.6773384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.6773676Z ) 2025-05-07T20:32:22.6774016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.6774456Z def test_silu_mul_quant( 2025-05-07T20:32:22.6774708Z self, 2025-05-07T20:32:22.6774950Z T: int, 2025-05-07T20:32:22.6775161Z D: int, 2025-05-07T20:32:22.6775382Z scale_ub: Optional[float], 2025-05-07T20:32:22.6775655Z contiguous: bool, 2025-05-07T20:32:22.6775900Z compiled: bool, 2025-05-07T20:32:22.6776131Z ) -> None: 2025-05-07T20:32:22.6776341Z torch.manual_seed(2025) 2025-05-07T20:32:22.6776587Z 2025-05-07T20:32:22.6776860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.6777214Z 2025-05-07T20:32:22.6777400Z x_sign = torch.sign(x) 2025-05-07T20:32:22.6777703Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.6778011Z x = x_sign * x_clamp 2025-05-07T20:32:22.6778254Z x0 = x[:, :D] 2025-05-07T20:32:22.6778471Z x1 = x[:, D:] 2025-05-07T20:32:22.6778685Z 2025-05-07T20:32:22.6778868Z if contiguous: 2025-05-07T20:32:22.6779104Z x0 = x0.contiguous() 2025-05-07T20:32:22.6779367Z x1 = x1.contiguous() 2025-05-07T20:32:22.6779614Z 2025-05-07T20:32:22.6779882Z if scale_ub is not None: 2025-05-07T20:32:22.6780164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.6780493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.6780809Z ) 2025-05-07T20:32:22.6781014Z else: 2025-05-07T20:32:22.6781229Z scale_ub_tensor = None 2025-05-07T20:32:22.6781481Z 2025-05-07T20:32:22.6781720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6782085Z op = silu_mul_quant 2025-05-07T20:32:22.6782339Z if compiled: 2025-05-07T20:32:22.6782590Z op = torch.compile(op) 2025-05-07T20:32:22.6782900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6783171Z 2025-05-07T20:32:22.6783372Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.6783537Z 2025-05-07T20:32:22.6783644Z moe/activation_test.py:117: 2025-05-07T20:32:22.6783947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6784281Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.6784573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6785254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.6785948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.6786485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.6787223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.6787876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.6788413Z kernel = self.compile( 2025-05-07T20:32:22.6788954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.6789622Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.6790016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6790251Z 2025-05-07T20:32:22.6790458Z self = 2025-05-07T20:32:22.6791536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.6792959Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be074fe0>} 2025-05-07T20:32:22.6794345Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.6795358Z context = 2025-05-07T20:32:22.6795654Z 2025-05-07T20:32:22.6795822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.6796343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.6796817Z module_map=module_map) 2025-05-07T20:32:22.6797180Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.6797544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.6797806Z E ^ 2025-05-07T20:32:22.6798269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6798721Z 2025-05-07T20:32:22.6799133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.6799649Z 2025-05-07T20:32:22.6799755Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6800217Z self=, 2025-05-07T20:32:22.6800613Z T=16384, 2025-05-07T20:32:22.6800818Z D=7168, 2025-05-07T20:32:22.6801014Z scale_ub=1200.0, 2025-05-07T20:32:22.6801239Z contiguous=False, 2025-05-07T20:32:22.6801467Z compiled=True, 2025-05-07T20:32:23.0268104Z ) 2025-05-07T20:32:23.0268514Z self = 2025-05-07T20:32:23.0269481Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:23.0269777Z 2025-05-07T20:32:23.0269859Z @given( 2025-05-07T20:32:23.0270097Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.0270413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.0270717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.0271050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.0271396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.0271681Z ) 2025-05-07T20:32:23.0272036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.0272482Z def test_silu_mul_quant( 2025-05-07T20:32:23.0272723Z self, 2025-05-07T20:32:23.0272924Z T: int, 2025-05-07T20:32:23.0273125Z D: int, 2025-05-07T20:32:23.0273344Z scale_ub: Optional[float], 2025-05-07T20:32:23.0273740Z contiguous: bool, 2025-05-07T20:32:23.0273984Z compiled: bool, 2025-05-07T20:32:23.0274223Z ) -> None: 2025-05-07T20:32:23.0274443Z torch.manual_seed(2025) 2025-05-07T20:32:23.0274691Z 2025-05-07T20:32:23.0274968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.0275310Z 2025-05-07T20:32:23.0275510Z x_sign = torch.sign(x) 2025-05-07T20:32:23.0275808Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.0276118Z x = x_sign * x_clamp 2025-05-07T20:32:23.0276369Z x0 = x[:, :D] 2025-05-07T20:32:23.0276602Z x1 = x[:, D:] 2025-05-07T20:32:23.0276811Z 2025-05-07T20:32:23.0277004Z if contiguous: 2025-05-07T20:32:23.0277245Z x0 = x0.contiguous() 2025-05-07T20:32:23.0277506Z x1 = x1.contiguous() 2025-05-07T20:32:23.0277756Z 2025-05-07T20:32:23.0277956Z if scale_ub is not None: 2025-05-07T20:32:23.0278231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.0278579Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.0278893Z ) 2025-05-07T20:32:23.0279092Z else: 2025-05-07T20:32:23.0279299Z scale_ub_tensor = None 2025-05-07T20:32:23.0279554Z 2025-05-07T20:32:23.0279793Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.0280102Z op = silu_mul_quant 2025-05-07T20:32:23.0280354Z if compiled: 2025-05-07T20:32:23.0280686Z op = torch.compile(op) 2025-05-07T20:32:23.0280982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0281261Z 2025-05-07T20:32:23.0281454Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.0281617Z 2025-05-07T20:32:23.0281717Z moe/activation_test.py:117: 2025-05-07T20:32:23.0282019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0282353Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.0282635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0283197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.0283767Z return fn(*args, **kwargs) 2025-05-07T20:32:23.0284431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.0285133Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.0295628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.0296346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.0297034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.0297579Z kernel = self.compile( 2025-05-07T20:32:23.0298171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.0298894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.0299312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0299547Z 2025-05-07T20:32:23.0299766Z self = 2025-05-07T20:32:23.0300856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.0302258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be077b00>} 2025-05-07T20:32:23.0303625Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.0304717Z context = 2025-05-07T20:32:23.0305011Z 2025-05-07T20:32:23.0305191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.0306033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.0306519Z module_map=module_map) 2025-05-07T20:32:23.0306903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.0307263Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.0307537Z E ^ 2025-05-07T20:32:23.0308015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.0308472Z 2025-05-07T20:32:23.0308899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.0309422Z 2025-05-07T20:32:23.0309531Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.0309960Z self=, 2025-05-07T20:32:23.0310375Z T=1, 2025-05-07T20:32:23.0310566Z D=7168, 2025-05-07T20:32:23.0310780Z scale_ub=None, 2025-05-07T20:32:23.0311009Z contiguous=False, 2025-05-07T20:32:23.0311253Z compiled=False, 2025-05-07T20:32:23.0311465Z ) 2025-05-07T20:32:23.0311877Z self = 2025-05-07T20:32:23.0312380Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:23.0312645Z 2025-05-07T20:32:23.0312737Z @given( 2025-05-07T20:32:23.0312976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.0313301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.0313619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.0313958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.0314306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.0314607Z ) 2025-05-07T20:32:23.0314962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.0315416Z def test_silu_mul_quant( 2025-05-07T20:32:23.0315671Z self, 2025-05-07T20:32:23.0315881Z T: int, 2025-05-07T20:32:23.0316086Z D: int, 2025-05-07T20:32:23.0316317Z scale_ub: Optional[float], 2025-05-07T20:32:23.0316605Z contiguous: bool, 2025-05-07T20:32:23.0316919Z compiled: bool, 2025-05-07T20:32:23.0317160Z ) -> None: 2025-05-07T20:32:23.0317389Z torch.manual_seed(2025) 2025-05-07T20:32:23.0317637Z 2025-05-07T20:32:23.0317923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.0318282Z 2025-05-07T20:32:23.0318483Z x_sign = torch.sign(x) 2025-05-07T20:32:23.0318790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.0319195Z x = x_sign * x_clamp 2025-05-07T20:32:23.0319444Z x0 = x[:, :D] 2025-05-07T20:32:23.0319675Z x1 = x[:, D:] 2025-05-07T20:32:23.0319897Z 2025-05-07T20:32:23.0320089Z if contiguous: 2025-05-07T20:32:23.0320336Z x0 = x0.contiguous() 2025-05-07T20:32:23.0320608Z x1 = x1.contiguous() 2025-05-07T20:32:23.0320856Z 2025-05-07T20:32:23.0321062Z if scale_ub is not None: 2025-05-07T20:32:23.0321353Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.0321713Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.0322066Z ) 2025-05-07T20:32:23.0322274Z else: 2025-05-07T20:32:23.0322500Z scale_ub_tensor = None 2025-05-07T20:32:23.0322754Z 2025-05-07T20:32:23.0322998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.0323326Z op = silu_mul_quant 2025-05-07T20:32:23.0323586Z if compiled: 2025-05-07T20:32:23.0323929Z op = torch.compile(op) 2025-05-07T20:32:23.0324240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0324520Z 2025-05-07T20:32:23.0324726Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.0324895Z 2025-05-07T20:32:23.0325005Z moe/activation_test.py:117: 2025-05-07T20:32:23.0325305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0325652Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.0325955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0326663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.0327362Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.0328045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.0328750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.0329432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.0329974Z kernel = self.compile( 2025-05-07T20:32:23.0330526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.0331195Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.0331649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0331922Z 2025-05-07T20:32:23.0332158Z self = 2025-05-07T20:32:23.0333248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.0334630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509070680>} 2025-05-07T20:32:23.0335974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.0337000Z context = 2025-05-07T20:32:23.0337299Z 2025-05-07T20:32:23.0337512Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.0338042Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.0338518Z module_map=module_map) 2025-05-07T20:32:23.0338883Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.0339247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.0339565Z E ^ 2025-05-07T20:32:23.0340028Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.0340485Z 2025-05-07T20:32:23.0340901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.0341418Z 2025-05-07T20:32:23.0341525Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.0341945Z self=, 2025-05-07T20:32:23.0342348Z T=2048, 2025-05-07T20:32:23.0342543Z D=7168, 2025-05-07T20:32:23.0342737Z scale_ub=None, 2025-05-07T20:32:23.0342948Z contiguous=False, 2025-05-07T20:32:23.0343175Z compiled=True, 2025-05-07T20:32:23.0343379Z ) 2025-05-07T20:32:23.1014258Z self = 2025-05-07T20:32:23.1014854Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:23.1015507Z 2025-05-07T20:32:23.1015593Z @given( 2025-05-07T20:32:23.1015839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.1016164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.1016475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.1016815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.1017153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.1017452Z ) 2025-05-07T20:32:23.1017812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.1018261Z def test_silu_mul_quant( 2025-05-07T20:32:23.1018513Z self, 2025-05-07T20:32:23.1018713Z T: int, 2025-05-07T20:32:23.1018922Z D: int, 2025-05-07T20:32:23.1019156Z scale_ub: Optional[float], 2025-05-07T20:32:23.1019431Z contiguous: bool, 2025-05-07T20:32:23.1019686Z compiled: bool, 2025-05-07T20:32:23.1019928Z ) -> None: 2025-05-07T20:32:23.1020159Z torch.manual_seed(2025) 2025-05-07T20:32:23.1020413Z 2025-05-07T20:32:23.1020700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.1021056Z 2025-05-07T20:32:23.1021255Z x_sign = torch.sign(x) 2025-05-07T20:32:23.1021561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.1021926Z x = x_sign * x_clamp 2025-05-07T20:32:23.1022180Z x0 = x[:, :D] 2025-05-07T20:32:23.1022413Z x1 = x[:, D:] 2025-05-07T20:32:23.1022725Z 2025-05-07T20:32:23.1022920Z if contiguous: 2025-05-07T20:32:23.1023162Z x0 = x0.contiguous() 2025-05-07T20:32:23.1023427Z x1 = x1.contiguous() 2025-05-07T20:32:23.1023678Z 2025-05-07T20:32:23.1023875Z if scale_ub is not None: 2025-05-07T20:32:23.1024155Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.1024497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.1024815Z ) 2025-05-07T20:32:23.1025017Z else: 2025-05-07T20:32:23.1025239Z scale_ub_tensor = None 2025-05-07T20:32:23.1025494Z 2025-05-07T20:32:23.1025734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.1026056Z op = silu_mul_quant 2025-05-07T20:32:23.1026312Z if compiled: 2025-05-07T20:32:23.1026570Z op = torch.compile(op) 2025-05-07T20:32:23.1026875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.1027154Z 2025-05-07T20:32:23.1027443Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.1027620Z 2025-05-07T20:32:23.1027724Z moe/activation_test.py:117: 2025-05-07T20:32:23.1028028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.1028364Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.1028654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.1029221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.1029857Z return fn(*args, **kwargs) 2025-05-07T20:32:23.1030524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.1031220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.1031764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.1032452Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.1033120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.1033661Z kernel = self.compile( 2025-05-07T20:32:23.1034208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.1034873Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.1035360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.1035591Z 2025-05-07T20:32:23.1035807Z self = 2025-05-07T20:32:23.1036883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.1038274Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509071d00>} 2025-05-07T20:32:23.1039618Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.1040645Z context = 2025-05-07T20:32:23.1040938Z 2025-05-07T20:32:23.1041114Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.1041638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.1042114Z module_map=module_map) 2025-05-07T20:32:23.1042491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.1042852Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.1043167Z E ^ 2025-05-07T20:32:23.1043647Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.1044103Z 2025-05-07T20:32:23.1044528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.1045040Z 2025-05-07T20:32:23.1045147Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.1045578Z self=, 2025-05-07T20:32:23.1045993Z T=4096, 2025-05-07T20:32:23.1046187Z D=7168, 2025-05-07T20:32:23.1046389Z scale_ub=None, 2025-05-07T20:32:23.1046615Z contiguous=False, 2025-05-07T20:32:23.1046844Z compiled=True, 2025-05-07T20:32:23.1047063Z ) 2025-05-07T20:32:23.1047387Z self = 2025-05-07T20:32:23.1048011Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:23.1048283Z 2025-05-07T20:32:23.1048417Z @given( 2025-05-07T20:32:23.1048660Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.1048988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.1049296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.1049634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.1049972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.1050312Z ) 2025-05-07T20:32:23.1050669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.1051117Z def test_silu_mul_quant( 2025-05-07T20:32:23.1051365Z self, 2025-05-07T20:32:23.1051567Z T: int, 2025-05-07T20:32:23.1051773Z D: int, 2025-05-07T20:32:23.1052000Z scale_ub: Optional[float], 2025-05-07T20:32:23.1052275Z contiguous: bool, 2025-05-07T20:32:23.1052525Z compiled: bool, 2025-05-07T20:32:23.1052759Z ) -> None: 2025-05-07T20:32:23.1052989Z torch.manual_seed(2025) 2025-05-07T20:32:23.1053243Z 2025-05-07T20:32:23.1053525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.1053870Z 2025-05-07T20:32:23.1054072Z x_sign = torch.sign(x) 2025-05-07T20:32:23.1054373Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.1054686Z x = x_sign * x_clamp 2025-05-07T20:32:23.1054941Z x0 = x[:, :D] 2025-05-07T20:32:23.1055223Z x1 = x[:, D:] 2025-05-07T20:32:23.1055438Z 2025-05-07T20:32:23.1055637Z if contiguous: 2025-05-07T20:32:23.1055880Z x0 = x0.contiguous() 2025-05-07T20:32:23.1056141Z x1 = x1.contiguous() 2025-05-07T20:32:23.1056391Z 2025-05-07T20:32:23.1056597Z if scale_ub is not None: 2025-05-07T20:32:23.1056874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.1057219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.1057541Z ) 2025-05-07T20:32:23.1057750Z else: 2025-05-07T20:32:23.1057969Z scale_ub_tensor = None 2025-05-07T20:32:23.1058230Z 2025-05-07T20:32:23.1058475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.1058793Z op = silu_mul_quant 2025-05-07T20:32:23.1059055Z if compiled: 2025-05-07T20:32:23.1059310Z op = torch.compile(op) 2025-05-07T20:32:23.1059608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.1059899Z 2025-05-07T20:32:23.1060108Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.1060275Z 2025-05-07T20:32:23.1060378Z moe/activation_test.py:117: 2025-05-07T20:32:23.1060682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.1061022Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.1061315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.1061925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.1062496Z return fn(*args, **kwargs) 2025-05-07T20:32:23.1063153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.1063847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.1064390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.1065085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.1065758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.1066292Z kernel = self.compile( 2025-05-07T20:32:23.1066842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.1067507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.1067948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.1068187Z 2025-05-07T20:32:23.1068394Z self = 2025-05-07T20:32:23.1069479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.1070892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509072840>} 2025-05-07T20:32:23.1072226Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.1073246Z context = 2025-05-07T20:32:23.1073537Z 2025-05-07T20:32:23.1073701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.1074222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.1074688Z module_map=module_map) 2025-05-07T20:32:23.1075048Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.1075451Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.1075719Z E ^ 2025-05-07T20:32:23.1076181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.1076634Z 2025-05-07T20:32:23.1077047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.1077560Z 2025-05-07T20:32:23.2334354Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.2334989Z self=, 2025-05-07T20:32:23.2335564Z T=16384, 2025-05-07T20:32:23.2335816Z D=5120, 2025-05-07T20:32:23.2336086Z scale_ub=1200.0, 2025-05-07T20:32:23.2336325Z contiguous=False, 2025-05-07T20:32:23.2336550Z compiled=False, 2025-05-07T20:32:23.2336767Z ) 2025-05-07T20:32:23.2337091Z self = 2025-05-07T20:32:23.2337607Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:23.2337898Z 2025-05-07T20:32:23.2337981Z @given( 2025-05-07T20:32:23.2338220Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.2338539Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.2338848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.2339184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.2339516Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.2340072Z ) 2025-05-07T20:32:23.2340430Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.2340874Z def test_silu_mul_quant( 2025-05-07T20:32:23.2341122Z self, 2025-05-07T20:32:23.2341314Z T: int, 2025-05-07T20:32:23.2341524Z D: int, 2025-05-07T20:32:23.2341752Z scale_ub: Optional[float], 2025-05-07T20:32:23.2342081Z contiguous: bool, 2025-05-07T20:32:23.2342325Z compiled: bool, 2025-05-07T20:32:23.2342559Z ) -> None: 2025-05-07T20:32:23.2342785Z torch.manual_seed(2025) 2025-05-07T20:32:23.2343026Z 2025-05-07T20:32:23.2343302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.2343652Z 2025-05-07T20:32:23.2343848Z x_sign = torch.sign(x) 2025-05-07T20:32:23.2344147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.2344464Z x = x_sign * x_clamp 2025-05-07T20:32:23.2344704Z x0 = x[:, :D] 2025-05-07T20:32:23.2345006Z x1 = x[:, D:] 2025-05-07T20:32:23.2345228Z 2025-05-07T20:32:23.2345417Z if contiguous: 2025-05-07T20:32:23.2345654Z x0 = x0.contiguous() 2025-05-07T20:32:23.2345920Z x1 = x1.contiguous() 2025-05-07T20:32:23.2346166Z 2025-05-07T20:32:23.2346365Z if scale_ub is not None: 2025-05-07T20:32:23.2346642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.2347047Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.2347364Z ) 2025-05-07T20:32:23.2347567Z else: 2025-05-07T20:32:23.2347786Z scale_ub_tensor = None 2025-05-07T20:32:23.2348041Z 2025-05-07T20:32:23.2348282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.2348601Z op = silu_mul_quant 2025-05-07T20:32:23.2348853Z if compiled: 2025-05-07T20:32:23.2349110Z op = torch.compile(op) 2025-05-07T20:32:23.2349415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.2349697Z 2025-05-07T20:32:23.2349905Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.2350069Z 2025-05-07T20:32:23.2350176Z moe/activation_test.py:117: 2025-05-07T20:32:23.2350472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.2350814Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.2351104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.2351893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.2352584Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.2353128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.2353813Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.2354475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.2355011Z kernel = self.compile( 2025-05-07T20:32:23.2355554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.2356212Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.2356607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.2356846Z 2025-05-07T20:32:23.2357053Z self = 2025-05-07T20:32:23.2358132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.2359567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508758040>} 2025-05-07T20:32:23.2360900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.2361927Z context = 2025-05-07T20:32:23.2362223Z 2025-05-07T20:32:23.2362390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.2362918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.2363384Z module_map=module_map) 2025-05-07T20:32:23.2363755Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.2364116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.2364384Z E ^ 2025-05-07T20:32:23.2364895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.2365348Z 2025-05-07T20:32:23.2365765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.2366277Z 2025-05-07T20:32:23.2366388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.2366805Z self=, 2025-05-07T20:32:23.2367210Z T=16384, 2025-05-07T20:32:23.2367487Z D=5120, 2025-05-07T20:32:23.2367804Z scale_ub=1200.0, 2025-05-07T20:32:23.2368025Z contiguous=True, 2025-05-07T20:32:23.2368252Z compiled=True, 2025-05-07T20:32:23.2368466Z ) 2025-05-07T20:32:23.2368780Z self = 2025-05-07T20:32:23.2369275Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:23.2369550Z 2025-05-07T20:32:23.2369640Z @given( 2025-05-07T20:32:23.2369876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.2370200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.2370511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.2370847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.2371175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.2371471Z ) 2025-05-07T20:32:23.2371870Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.2372401Z def test_silu_mul_quant( 2025-05-07T20:32:23.2372644Z self, 2025-05-07T20:32:23.2372849Z T: int, 2025-05-07T20:32:23.2373055Z D: int, 2025-05-07T20:32:23.2373274Z scale_ub: Optional[float], 2025-05-07T20:32:23.2373555Z contiguous: bool, 2025-05-07T20:32:23.2373802Z compiled: bool, 2025-05-07T20:32:23.2374030Z ) -> None: 2025-05-07T20:32:23.2374256Z torch.manual_seed(2025) 2025-05-07T20:32:23.2374508Z 2025-05-07T20:32:23.2374784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.2375143Z 2025-05-07T20:32:23.2382912Z x_sign = torch.sign(x) 2025-05-07T20:32:23.2383235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.2383559Z x = x_sign * x_clamp 2025-05-07T20:32:23.2383815Z x0 = x[:, :D] 2025-05-07T20:32:23.2384050Z x1 = x[:, D:] 2025-05-07T20:32:23.2384266Z 2025-05-07T20:32:23.2384476Z if contiguous: 2025-05-07T20:32:23.2384729Z x0 = x0.contiguous() 2025-05-07T20:32:23.2384993Z x1 = x1.contiguous() 2025-05-07T20:32:23.2385244Z 2025-05-07T20:32:23.2385452Z if scale_ub is not None: 2025-05-07T20:32:23.2385730Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.2386081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.2386401Z ) 2025-05-07T20:32:23.2386608Z else: 2025-05-07T20:32:23.2386906Z scale_ub_tensor = None 2025-05-07T20:32:23.2387173Z 2025-05-07T20:32:23.2387420Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.2387748Z op = silu_mul_quant 2025-05-07T20:32:23.2388012Z if compiled: 2025-05-07T20:32:23.2388276Z op = torch.compile(op) 2025-05-07T20:32:23.2388578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.2388867Z 2025-05-07T20:32:23.2389074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.2389248Z 2025-05-07T20:32:23.2389355Z moe/activation_test.py:117: 2025-05-07T20:32:23.2389667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.2390014Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.2390297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.2390871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.2391446Z return fn(*args, **kwargs) 2025-05-07T20:32:23.2392221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.2392918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.2393466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.2394160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.2394881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.2395417Z kernel = self.compile( 2025-05-07T20:32:23.2395970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.2396637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.2397039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.2397283Z 2025-05-07T20:32:23.2397497Z self = 2025-05-07T20:32:23.2398585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.2399966Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508759300>} 2025-05-07T20:32:23.2401364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.2402443Z context = 2025-05-07T20:32:23.2402741Z 2025-05-07T20:32:23.2402916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.2403450Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.2403930Z module_map=module_map) 2025-05-07T20:32:23.2404301Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.2404672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.2404947Z E ^ 2025-05-07T20:32:23.2405422Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.2406184Z 2025-05-07T20:32:23.2406606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.2407128Z 2025-05-07T20:32:23.5437871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.5439071Z self=, 2025-05-07T20:32:23.5440235Z T=16384, 2025-05-07T20:32:23.5441050Z D=5120, 2025-05-07T20:32:23.5441455Z scale_ub=None, 2025-05-07T20:32:23.5441864Z contiguous=False, 2025-05-07T20:32:23.5442096Z compiled=True, 2025-05-07T20:32:23.5442316Z ) 2025-05-07T20:32:23.5442646Z self = 2025-05-07T20:32:23.5443151Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:23.5443443Z 2025-05-07T20:32:23.5443539Z @given( 2025-05-07T20:32:23.5443782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.5444106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.5444417Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.5444755Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.5445091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.5445378Z ) 2025-05-07T20:32:23.5445738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.5446264Z def test_silu_mul_quant( 2025-05-07T20:32:23.5446513Z self, 2025-05-07T20:32:23.5446720Z T: int, 2025-05-07T20:32:23.5446926Z D: int, 2025-05-07T20:32:23.5447143Z scale_ub: Optional[float], 2025-05-07T20:32:23.5447423Z contiguous: bool, 2025-05-07T20:32:23.5447776Z compiled: bool, 2025-05-07T20:32:23.5448006Z ) -> None: 2025-05-07T20:32:23.5448232Z torch.manual_seed(2025) 2025-05-07T20:32:23.5448571Z 2025-05-07T20:32:23.5448848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.5449200Z 2025-05-07T20:32:23.5449401Z x_sign = torch.sign(x) 2025-05-07T20:32:23.5449700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.5450012Z x = x_sign * x_clamp 2025-05-07T20:32:23.5450260Z x0 = x[:, :D] 2025-05-07T20:32:23.5450481Z x1 = x[:, D:] 2025-05-07T20:32:23.5450687Z 2025-05-07T20:32:23.5450885Z if contiguous: 2025-05-07T20:32:23.5451129Z x0 = x0.contiguous() 2025-05-07T20:32:23.5451390Z x1 = x1.contiguous() 2025-05-07T20:32:23.5451637Z 2025-05-07T20:32:23.5451834Z if scale_ub is not None: 2025-05-07T20:32:23.5452110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.5452461Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.5452790Z ) 2025-05-07T20:32:23.5452995Z else: 2025-05-07T20:32:23.5453309Z scale_ub_tensor = None 2025-05-07T20:32:23.5453572Z 2025-05-07T20:32:23.5453805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.5454130Z op = silu_mul_quant 2025-05-07T20:32:23.5454391Z if compiled: 2025-05-07T20:32:23.5454654Z op = torch.compile(op) 2025-05-07T20:32:23.5454954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.5455238Z 2025-05-07T20:32:23.5455440Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.5455608Z 2025-05-07T20:32:23.5455714Z moe/activation_test.py:117: 2025-05-07T20:32:23.5456017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.5456360Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.5456647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.5457215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.5457794Z return fn(*args, **kwargs) 2025-05-07T20:32:23.5458464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.5459157Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.5459703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.5460393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.5461114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.5461657Z kernel = self.compile( 2025-05-07T20:32:23.5462201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.5462865Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.5463273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.5463508Z 2025-05-07T20:32:23.5463723Z self = 2025-05-07T20:32:23.5464800Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.5466244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508759e40>} 2025-05-07T20:32:23.5467592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.5468620Z context = 2025-05-07T20:32:23.5468952Z 2025-05-07T20:32:23.5469119Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.5469641Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.5470110Z module_map=module_map) 2025-05-07T20:32:23.5470477Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.5470828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.5471092Z E ^ 2025-05-07T20:32:23.5471564Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.5472042Z 2025-05-07T20:32:23.5472488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.5472997Z 2025-05-07T20:32:23.5473103Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.5473516Z self=, 2025-05-07T20:32:23.5473977Z T=2048, 2025-05-07T20:32:23.5474164Z D=5120, 2025-05-07T20:32:23.5474363Z scale_ub=None, 2025-05-07T20:32:23.5474586Z contiguous=False, 2025-05-07T20:32:23.5474809Z compiled=True, 2025-05-07T20:32:23.5475016Z ) 2025-05-07T20:32:23.6190959Z self = 2025-05-07T20:32:23.6191739Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:23.6192146Z 2025-05-07T20:32:23.6192275Z @given( 2025-05-07T20:32:23.6192515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.6192834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.6193143Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.6193475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.6193803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.6194095Z ) 2025-05-07T20:32:23.6194465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.6194910Z def test_silu_mul_quant( 2025-05-07T20:32:23.6195157Z self, 2025-05-07T20:32:23.6195363Z T: int, 2025-05-07T20:32:23.6195561Z D: int, 2025-05-07T20:32:23.6195786Z scale_ub: Optional[float], 2025-05-07T20:32:23.6196063Z contiguous: bool, 2025-05-07T20:32:23.6196305Z compiled: bool, 2025-05-07T20:32:23.6196538Z ) -> None: 2025-05-07T20:32:23.6197027Z torch.manual_seed(2025) 2025-05-07T20:32:23.6197272Z 2025-05-07T20:32:23.6197554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.6197913Z 2025-05-07T20:32:23.6198113Z x_sign = torch.sign(x) 2025-05-07T20:32:23.6198403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.6198718Z x = x_sign * x_clamp 2025-05-07T20:32:23.6198963Z x0 = x[:, :D] 2025-05-07T20:32:23.6199182Z x1 = x[:, D:] 2025-05-07T20:32:23.6199406Z 2025-05-07T20:32:23.6199601Z if contiguous: 2025-05-07T20:32:23.6199831Z x0 = x0.contiguous() 2025-05-07T20:32:23.6200094Z x1 = x1.contiguous() 2025-05-07T20:32:23.6200341Z 2025-05-07T20:32:23.6200533Z if scale_ub is not None: 2025-05-07T20:32:23.6200811Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.6201152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.6201458Z ) 2025-05-07T20:32:23.6201664Z else: 2025-05-07T20:32:23.6201960Z scale_ub_tensor = None 2025-05-07T20:32:23.6202240Z 2025-05-07T20:32:23.6202497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.6202820Z op = silu_mul_quant 2025-05-07T20:32:23.6203077Z if compiled: 2025-05-07T20:32:23.6203324Z op = torch.compile(op) 2025-05-07T20:32:23.6203626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.6204021Z 2025-05-07T20:32:23.6204214Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.6204386Z 2025-05-07T20:32:23.6204486Z moe/activation_test.py:117: 2025-05-07T20:32:23.6204786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.6205124Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.6205413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.6206355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.6206923Z return fn(*args, **kwargs) 2025-05-07T20:32:23.6207723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.6208435Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.6208978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.6209653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.6210439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.6210978Z kernel = self.compile( 2025-05-07T20:32:23.6211528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.6212185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.6212603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.6212834Z 2025-05-07T20:32:23.6213048Z self = 2025-05-07T20:32:23.6214137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.6215533Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050875b240>} 2025-05-07T20:32:23.6216880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.6218061Z context = 2025-05-07T20:32:23.6218425Z 2025-05-07T20:32:23.6218603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.6219121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.6219590Z module_map=module_map) 2025-05-07T20:32:23.6219956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.6220314Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.6220580Z E ^ 2025-05-07T20:32:23.6221045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.6221492Z 2025-05-07T20:32:23.6221917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.6222425Z 2025-05-07T20:32:23.6222530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.6222948Z self=, 2025-05-07T20:32:23.6223412Z T=2048, 2025-05-07T20:32:23.6223606Z D=5120, 2025-05-07T20:32:23.6223795Z scale_ub=1200.0, 2025-05-07T20:32:23.6224023Z contiguous=False, 2025-05-07T20:32:23.6224247Z compiled=True, 2025-05-07T20:32:23.6224449Z ) 2025-05-07T20:32:23.6224769Z self = 2025-05-07T20:32:23.6225261Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:23.6225596Z 2025-05-07T20:32:23.6225678Z @given( 2025-05-07T20:32:23.6225912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.6226224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.6226527Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.6226864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.6227198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.6227491Z ) 2025-05-07T20:32:23.6227845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.6228287Z def test_silu_mul_quant( 2025-05-07T20:32:23.6228538Z self, 2025-05-07T20:32:23.6228735Z T: int, 2025-05-07T20:32:23.6228940Z D: int, 2025-05-07T20:32:23.6229167Z scale_ub: Optional[float], 2025-05-07T20:32:23.6229439Z contiguous: bool, 2025-05-07T20:32:23.6229689Z compiled: bool, 2025-05-07T20:32:23.6229923Z ) -> None: 2025-05-07T20:32:23.6230190Z torch.manual_seed(2025) 2025-05-07T20:32:23.6230438Z 2025-05-07T20:32:23.6230719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.6231067Z 2025-05-07T20:32:23.6231270Z x_sign = torch.sign(x) 2025-05-07T20:32:23.6231570Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.6231890Z x = x_sign * x_clamp 2025-05-07T20:32:23.6232134Z x0 = x[:, :D] 2025-05-07T20:32:23.6232369Z x1 = x[:, D:] 2025-05-07T20:32:23.6232591Z 2025-05-07T20:32:23.6232786Z if contiguous: 2025-05-07T20:32:23.6233031Z x0 = x0.contiguous() 2025-05-07T20:32:23.6233299Z x1 = x1.contiguous() 2025-05-07T20:32:23.6233542Z 2025-05-07T20:32:23.6233752Z if scale_ub is not None: 2025-05-07T20:32:23.6234033Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.6234374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.6234696Z ) 2025-05-07T20:32:23.6234912Z else: 2025-05-07T20:32:23.6235137Z scale_ub_tensor = None 2025-05-07T20:32:23.6235401Z 2025-05-07T20:32:23.6235645Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.6235963Z op = silu_mul_quant 2025-05-07T20:32:23.6236220Z if compiled: 2025-05-07T20:32:23.6236474Z op = torch.compile(op) 2025-05-07T20:32:23.6236770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.6237104Z 2025-05-07T20:32:23.6237310Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.6237477Z 2025-05-07T20:32:23.6237587Z moe/activation_test.py:117: 2025-05-07T20:32:23.6237897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.6238231Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.6238524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.6239093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.6239662Z return fn(*args, **kwargs) 2025-05-07T20:32:23.6240330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.6241031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.6241582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.6242315Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.6242988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.6243528Z kernel = self.compile( 2025-05-07T20:32:23.6244068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.6244730Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.6245180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.6245412Z 2025-05-07T20:32:23.6245624Z self = 2025-05-07T20:32:23.6246699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.6248147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050886c720>} 2025-05-07T20:32:23.6249483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.6250511Z context = 2025-05-07T20:32:23.6250843Z 2025-05-07T20:32:23.6251016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.6251535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.6252001Z module_map=module_map) 2025-05-07T20:32:23.6252373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.6252723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.6252994Z E ^ 2025-05-07T20:32:23.6253466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.6253912Z 2025-05-07T20:32:23.6254331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.6254843Z 2025-05-07T20:32:23.7582359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.7583004Z self=, 2025-05-07T20:32:23.7583559Z T=4096, 2025-05-07T20:32:23.7583831Z D=5120, 2025-05-07T20:32:23.7584038Z scale_ub=1200.0, 2025-05-07T20:32:23.7584265Z contiguous=True, 2025-05-07T20:32:23.7584497Z compiled=True, 2025-05-07T20:32:23.7584703Z ) 2025-05-07T20:32:23.7585027Z self = 2025-05-07T20:32:23.7585787Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:23.7586067Z 2025-05-07T20:32:23.7586150Z @given( 2025-05-07T20:32:23.7586389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.7586705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.7587012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.7587345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.7587676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.7587970Z ) 2025-05-07T20:32:23.7588317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.7588761Z def test_silu_mul_quant( 2025-05-07T20:32:23.7589008Z self, 2025-05-07T20:32:23.7589202Z T: int, 2025-05-07T20:32:23.7589403Z D: int, 2025-05-07T20:32:23.7589625Z scale_ub: Optional[float], 2025-05-07T20:32:23.7589895Z contiguous: bool, 2025-05-07T20:32:23.7590140Z compiled: bool, 2025-05-07T20:32:23.7590369Z ) -> None: 2025-05-07T20:32:23.7590712Z torch.manual_seed(2025) 2025-05-07T20:32:23.7590957Z 2025-05-07T20:32:23.7591233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.7591583Z 2025-05-07T20:32:23.7591777Z x_sign = torch.sign(x) 2025-05-07T20:32:23.7592071Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.7592392Z x = x_sign * x_clamp 2025-05-07T20:32:23.7592712Z x0 = x[:, :D] 2025-05-07T20:32:23.7592940Z x1 = x[:, D:] 2025-05-07T20:32:23.7593158Z 2025-05-07T20:32:23.7593347Z if contiguous: 2025-05-07T20:32:23.7593582Z x0 = x0.contiguous() 2025-05-07T20:32:23.7593844Z x1 = x1.contiguous() 2025-05-07T20:32:23.7594085Z 2025-05-07T20:32:23.7594277Z if scale_ub is not None: 2025-05-07T20:32:23.7594551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.7594890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.7595206Z ) 2025-05-07T20:32:23.7595407Z else: 2025-05-07T20:32:23.7595622Z scale_ub_tensor = None 2025-05-07T20:32:23.7595874Z 2025-05-07T20:32:23.7596111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.7596430Z op = silu_mul_quant 2025-05-07T20:32:23.7596679Z if compiled: 2025-05-07T20:32:23.7596932Z op = torch.compile(op) 2025-05-07T20:32:23.7597236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.7597598Z 2025-05-07T20:32:23.7597799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.7597970Z 2025-05-07T20:32:23.7598071Z moe/activation_test.py:117: 2025-05-07T20:32:23.7598371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.7598703Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.7598991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.7599559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.7600117Z return fn(*args, **kwargs) 2025-05-07T20:32:23.7600781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.7601488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.7602066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.7602771Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.7603431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.7612498Z kernel = self.compile( 2025-05-07T20:32:23.7613066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.7613858Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.7614264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.7614502Z 2025-05-07T20:32:23.7614713Z self = 2025-05-07T20:32:23.7615809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.7617217Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050886d260>} 2025-05-07T20:32:23.7618570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.7619679Z context = 2025-05-07T20:32:23.7619977Z 2025-05-07T20:32:23.7620146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.7620675Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.7621146Z module_map=module_map) 2025-05-07T20:32:23.7621521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.7621952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.7622213Z E ^ 2025-05-07T20:32:23.7622689Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.7623145Z 2025-05-07T20:32:23.7623562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.7624072Z 2025-05-07T20:32:23.7624188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.7624603Z self=, 2025-05-07T20:32:23.7625019Z T=128, 2025-05-07T20:32:23.7625223Z D=5120, 2025-05-07T20:32:23.7625420Z scale_ub=1200.0, 2025-05-07T20:32:23.7625652Z contiguous=False, 2025-05-07T20:32:23.7625883Z compiled=True, 2025-05-07T20:32:23.7626088Z ) 2025-05-07T20:32:24.0141519Z self = 2025-05-07T20:32:24.0142625Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:24.0143005Z 2025-05-07T20:32:24.0143088Z @given( 2025-05-07T20:32:24.0143332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.0143647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.0143949Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.0144280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.0144616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.0144907Z ) 2025-05-07T20:32:24.0145262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.0145706Z def test_silu_mul_quant( 2025-05-07T20:32:24.0145956Z self, 2025-05-07T20:32:24.0146146Z T: int, 2025-05-07T20:32:24.0146345Z D: int, 2025-05-07T20:32:24.0146568Z scale_ub: Optional[float], 2025-05-07T20:32:24.0146834Z contiguous: bool, 2025-05-07T20:32:24.0147084Z compiled: bool, 2025-05-07T20:32:24.0147319Z ) -> None: 2025-05-07T20:32:24.0147537Z torch.manual_seed(2025) 2025-05-07T20:32:24.0147783Z 2025-05-07T20:32:24.0148060Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.0148407Z 2025-05-07T20:32:24.0148609Z x_sign = torch.sign(x) 2025-05-07T20:32:24.0148905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.0149211Z x = x_sign * x_clamp 2025-05-07T20:32:24.0149557Z x0 = x[:, :D] 2025-05-07T20:32:24.0149791Z x1 = x[:, D:] 2025-05-07T20:32:24.0150000Z 2025-05-07T20:32:24.0150191Z if contiguous: 2025-05-07T20:32:24.0150430Z x0 = x0.contiguous() 2025-05-07T20:32:24.0150686Z x1 = x1.contiguous() 2025-05-07T20:32:24.0150932Z 2025-05-07T20:32:24.0151123Z if scale_ub is not None: 2025-05-07T20:32:24.0151401Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.0151739Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.0152057Z ) 2025-05-07T20:32:24.0152288Z else: 2025-05-07T20:32:24.0152512Z scale_ub_tensor = None 2025-05-07T20:32:24.0152771Z 2025-05-07T20:32:24.0153007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.0153317Z op = silu_mul_quant 2025-05-07T20:32:24.0153571Z if compiled: 2025-05-07T20:32:24.0153826Z op = torch.compile(op) 2025-05-07T20:32:24.0154210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0154494Z 2025-05-07T20:32:24.0154691Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.0154858Z 2025-05-07T20:32:24.0154958Z moe/activation_test.py:117: 2025-05-07T20:32:24.0155255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0155591Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.0155872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0156507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.0157073Z return fn(*args, **kwargs) 2025-05-07T20:32:24.0157736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.0158425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.0158968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.0159655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.0160321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.0160849Z kernel = self.compile( 2025-05-07T20:32:24.0161396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.0162130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0162558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0162787Z 2025-05-07T20:32:24.0162993Z self = 2025-05-07T20:32:24.0164073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.0165473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050886e480>} 2025-05-07T20:32:24.0166810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.0167950Z context = 2025-05-07T20:32:24.0168247Z 2025-05-07T20:32:24.0168415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.0168937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0169413Z module_map=module_map) 2025-05-07T20:32:24.0169827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0170197Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.0170463Z E ^ 2025-05-07T20:32:24.0170931Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.0171384Z 2025-05-07T20:32:24.0171798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.0172317Z 2025-05-07T20:32:24.0172424Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.0172851Z self=, 2025-05-07T20:32:24.0173246Z T=16384, 2025-05-07T20:32:24.0173456Z D=7168, 2025-05-07T20:32:24.0173661Z scale_ub=1200.0, 2025-05-07T20:32:24.0173885Z contiguous=True, 2025-05-07T20:32:24.0174114Z compiled=True, 2025-05-07T20:32:24.0174334Z ) 2025-05-07T20:32:24.0174649Z self = 2025-05-07T20:32:24.0175208Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:24.0175498Z 2025-05-07T20:32:24.0175579Z @given( 2025-05-07T20:32:24.0175820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.0176129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.0176450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.0176789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.0177159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.0177448Z ) 2025-05-07T20:32:24.0177800Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.0178248Z def test_silu_mul_quant( 2025-05-07T20:32:24.0178487Z self, 2025-05-07T20:32:24.0178695Z T: int, 2025-05-07T20:32:24.0178899Z D: int, 2025-05-07T20:32:24.0179117Z scale_ub: Optional[float], 2025-05-07T20:32:24.0179390Z contiguous: bool, 2025-05-07T20:32:24.0179641Z compiled: bool, 2025-05-07T20:32:24.0179857Z ) -> None: 2025-05-07T20:32:24.0180081Z torch.manual_seed(2025) 2025-05-07T20:32:24.0180321Z 2025-05-07T20:32:24.0180598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.0180941Z 2025-05-07T20:32:24.0181143Z x_sign = torch.sign(x) 2025-05-07T20:32:24.0181427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.0181791Z x = x_sign * x_clamp 2025-05-07T20:32:24.0182031Z x0 = x[:, :D] 2025-05-07T20:32:24.0182252Z x1 = x[:, D:] 2025-05-07T20:32:24.0182464Z 2025-05-07T20:32:24.0182657Z if contiguous: 2025-05-07T20:32:24.0182887Z x0 = x0.contiguous() 2025-05-07T20:32:24.0183147Z x1 = x1.contiguous() 2025-05-07T20:32:24.0183387Z 2025-05-07T20:32:24.0183589Z if scale_ub is not None: 2025-05-07T20:32:24.0183858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.0184202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.0184515Z ) 2025-05-07T20:32:24.0184709Z else: 2025-05-07T20:32:24.0184921Z scale_ub_tensor = None 2025-05-07T20:32:24.0185174Z 2025-05-07T20:32:24.0185402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.0185722Z op = silu_mul_quant 2025-05-07T20:32:24.0185977Z if compiled: 2025-05-07T20:32:24.0186233Z op = torch.compile(op) 2025-05-07T20:32:24.0186527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0186813Z 2025-05-07T20:32:24.0187008Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.0187184Z 2025-05-07T20:32:24.0187288Z moe/activation_test.py:117: 2025-05-07T20:32:24.0187586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0187923Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.0188199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0188814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.0189383Z return fn(*args, **kwargs) 2025-05-07T20:32:24.0190048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.0190744Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.0191291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.0191984Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.0192648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.0193183Z kernel = self.compile( 2025-05-07T20:32:24.0193730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.0194433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0194831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0195064Z 2025-05-07T20:32:24.0195268Z self = 2025-05-07T20:32:24.0196348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.0197789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050886fd80>} 2025-05-07T20:32:24.0199125Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.0200144Z context = 2025-05-07T20:32:24.0200435Z 2025-05-07T20:32:24.0200603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.0201121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0201582Z module_map=module_map) 2025-05-07T20:32:24.0202000Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0202352Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.0202612Z E ^ 2025-05-07T20:32:24.0203077Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.0203530Z 2025-05-07T20:32:24.0203945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.0204453Z 2025-05-07T20:32:24.1160257Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.1160939Z self=, 2025-05-07T20:32:24.1161482Z T=16384, 2025-05-07T20:32:24.1161682Z D=5120, 2025-05-07T20:32:24.1161884Z scale_ub=1200.0, 2025-05-07T20:32:24.1162109Z contiguous=True, 2025-05-07T20:32:24.1162330Z compiled=False, 2025-05-07T20:32:24.1162541Z ) 2025-05-07T20:32:24.1162867Z self = 2025-05-07T20:32:24.1163375Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.1163663Z 2025-05-07T20:32:24.1163742Z @given( 2025-05-07T20:32:24.1163976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.1164419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.1164727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.1165328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.1165665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.1165946Z ) 2025-05-07T20:32:24.1166298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.1166746Z def test_silu_mul_quant( 2025-05-07T20:32:24.1166987Z self, 2025-05-07T20:32:24.1167186Z T: int, 2025-05-07T20:32:24.1167392Z D: int, 2025-05-07T20:32:24.1167703Z scale_ub: Optional[float], 2025-05-07T20:32:24.1167983Z contiguous: bool, 2025-05-07T20:32:24.1168228Z compiled: bool, 2025-05-07T20:32:24.1168458Z ) -> None: 2025-05-07T20:32:24.1168672Z torch.manual_seed(2025) 2025-05-07T20:32:24.1168920Z 2025-05-07T20:32:24.1169202Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.1169546Z 2025-05-07T20:32:24.1169747Z x_sign = torch.sign(x) 2025-05-07T20:32:24.1170046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.1170438Z x = x_sign * x_clamp 2025-05-07T20:32:24.1170685Z x0 = x[:, :D] 2025-05-07T20:32:24.1170904Z x1 = x[:, D:] 2025-05-07T20:32:24.1171112Z 2025-05-07T20:32:24.1171301Z if contiguous: 2025-05-07T20:32:24.1171538Z x0 = x0.contiguous() 2025-05-07T20:32:24.1171794Z x1 = x1.contiguous() 2025-05-07T20:32:24.1172035Z 2025-05-07T20:32:24.1172258Z if scale_ub is not None: 2025-05-07T20:32:24.1172618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.1172956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.1173267Z ) 2025-05-07T20:32:24.1173465Z else: 2025-05-07T20:32:24.1173672Z scale_ub_tensor = None 2025-05-07T20:32:24.1173927Z 2025-05-07T20:32:24.1174162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.1174476Z op = silu_mul_quant 2025-05-07T20:32:24.1174729Z if compiled: 2025-05-07T20:32:24.1174987Z op = torch.compile(op) 2025-05-07T20:32:24.1175280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.1175558Z 2025-05-07T20:32:24.1175755Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.1175919Z 2025-05-07T20:32:24.1176017Z moe/activation_test.py:117: 2025-05-07T20:32:24.1176314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.1176649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.1177011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.1177700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.1178388Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.1178927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.1179606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.1180269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.1180801Z kernel = self.compile( 2025-05-07T20:32:24.1181345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.1181997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.1182400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.1182630Z 2025-05-07T20:32:24.1182842Z self = 2025-05-07T20:32:24.1183916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.1185342Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050856ccc0>} 2025-05-07T20:32:24.1186686Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.1187709Z context = 2025-05-07T20:32:24.1188000Z 2025-05-07T20:32:24.1188177Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.1188697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.1189167Z module_map=module_map) 2025-05-07T20:32:24.1189535Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.1189894Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.1190155Z E ^ 2025-05-07T20:32:24.1190674Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.1191128Z 2025-05-07T20:32:24.1191547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.1192053Z 2025-05-07T20:32:24.1192161Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.1192580Z self=, 2025-05-07T20:32:24.1193036Z T=1, 2025-05-07T20:32:24.1193227Z D=7168, 2025-05-07T20:32:24.1193420Z scale_ub=1200.0, 2025-05-07T20:32:24.1193650Z contiguous=False, 2025-05-07T20:32:24.1193880Z compiled=False, 2025-05-07T20:32:24.1194086Z ) 2025-05-07T20:32:24.1194409Z self = 2025-05-07T20:32:24.1194900Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:24.1195172Z 2025-05-07T20:32:24.1195262Z @given( 2025-05-07T20:32:24.1195492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.1195809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.1196118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.1196447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.1196780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.1197073Z ) 2025-05-07T20:32:24.1197484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.1197923Z def test_silu_mul_quant( 2025-05-07T20:32:24.1198171Z self, 2025-05-07T20:32:24.1198373Z T: int, 2025-05-07T20:32:24.1198568Z D: int, 2025-05-07T20:32:24.1198792Z scale_ub: Optional[float], 2025-05-07T20:32:24.1199067Z contiguous: bool, 2025-05-07T20:32:24.1199306Z compiled: bool, 2025-05-07T20:32:24.1199535Z ) -> None: 2025-05-07T20:32:24.1199758Z torch.manual_seed(2025) 2025-05-07T20:32:24.1200001Z 2025-05-07T20:32:24.1200280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.1200631Z 2025-05-07T20:32:24.1200822Z x_sign = torch.sign(x) 2025-05-07T20:32:24.1201118Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.1201437Z x = x_sign * x_clamp 2025-05-07T20:32:24.1201676Z x0 = x[:, :D] 2025-05-07T20:32:24.1201904Z x1 = x[:, D:] 2025-05-07T20:32:24.1202128Z 2025-05-07T20:32:24.1202347Z if contiguous: 2025-05-07T20:32:24.1202604Z x0 = x0.contiguous() 2025-05-07T20:32:24.1202868Z x1 = x1.contiguous() 2025-05-07T20:32:24.1203118Z 2025-05-07T20:32:24.1203312Z if scale_ub is not None: 2025-05-07T20:32:24.1203590Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.1203929Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.1204233Z ) 2025-05-07T20:32:24.1204487Z else: 2025-05-07T20:32:24.1204710Z scale_ub_tensor = None 2025-05-07T20:32:24.1204960Z 2025-05-07T20:32:24.1205196Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.1205517Z op = silu_mul_quant 2025-05-07T20:32:24.1206081Z if compiled: 2025-05-07T20:32:24.1206335Z op = torch.compile(op) 2025-05-07T20:32:24.1206635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.1206912Z 2025-05-07T20:32:24.1207110Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.1207274Z 2025-05-07T20:32:24.1207385Z moe/activation_test.py:117: 2025-05-07T20:32:24.1207731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.1208060Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.1208344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.1209038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.1209801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.1210344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.1211032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.1211699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.1212347Z kernel = self.compile( 2025-05-07T20:32:24.1212892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.1213553Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.1213945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.1214183Z 2025-05-07T20:32:24.1214393Z self = 2025-05-07T20:32:24.1215476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.1216852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050856d080>} 2025-05-07T20:32:24.1218269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.1219292Z context = 2025-05-07T20:32:24.1219584Z 2025-05-07T20:32:24.1219750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.1220281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.1220754Z module_map=module_map) 2025-05-07T20:32:24.1221117Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.1221474Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.1221735Z E ^ 2025-05-07T20:32:24.1222195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.1222659Z 2025-05-07T20:32:24.1223079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.1223600Z 2025-05-07T20:32:24.2563053Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2563726Z self=, 2025-05-07T20:32:24.2564288Z T=4096, 2025-05-07T20:32:24.2564496Z D=7168, 2025-05-07T20:32:24.2564697Z scale_ub=1200.0, 2025-05-07T20:32:24.2565205Z contiguous=False, 2025-05-07T20:32:24.2565445Z compiled=True, 2025-05-07T20:32:24.2565665Z ) 2025-05-07T20:32:24.2565993Z self = 2025-05-07T20:32:24.2566487Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:24.2566775Z 2025-05-07T20:32:24.2566856Z @given( 2025-05-07T20:32:24.2567096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.2567428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.2567863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.2568240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.2568617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.2568935Z ) 2025-05-07T20:32:24.2569348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.2569797Z def test_silu_mul_quant( 2025-05-07T20:32:24.2570046Z self, 2025-05-07T20:32:24.2570366Z T: int, 2025-05-07T20:32:24.2570576Z D: int, 2025-05-07T20:32:24.2570800Z scale_ub: Optional[float], 2025-05-07T20:32:24.2579049Z contiguous: bool, 2025-05-07T20:32:24.2579358Z compiled: bool, 2025-05-07T20:32:24.2579596Z ) -> None: 2025-05-07T20:32:24.2579829Z torch.manual_seed(2025) 2025-05-07T20:32:24.2580086Z 2025-05-07T20:32:24.2580374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.2580863Z 2025-05-07T20:32:24.2581069Z x_sign = torch.sign(x) 2025-05-07T20:32:24.2581367Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.2581689Z x = x_sign * x_clamp 2025-05-07T20:32:24.2581942Z x0 = x[:, :D] 2025-05-07T20:32:24.2582165Z x1 = x[:, D:] 2025-05-07T20:32:24.2582385Z 2025-05-07T20:32:24.2582585Z if contiguous: 2025-05-07T20:32:24.2582821Z x0 = x0.contiguous() 2025-05-07T20:32:24.2583094Z x1 = x1.contiguous() 2025-05-07T20:32:24.2583346Z 2025-05-07T20:32:24.2583546Z if scale_ub is not None: 2025-05-07T20:32:24.2583831Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.2584182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.2584503Z ) 2025-05-07T20:32:24.2584706Z else: 2025-05-07T20:32:24.2584934Z scale_ub_tensor = None 2025-05-07T20:32:24.2585200Z 2025-05-07T20:32:24.2585532Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.2585862Z op = silu_mul_quant 2025-05-07T20:32:24.2586118Z if compiled: 2025-05-07T20:32:24.2586378Z op = torch.compile(op) 2025-05-07T20:32:24.2586687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.2586969Z 2025-05-07T20:32:24.2587178Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.2587348Z 2025-05-07T20:32:24.2587462Z moe/activation_test.py:117: 2025-05-07T20:32:24.2587771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.2588119Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.2588404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.2588965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.2589535Z return fn(*args, **kwargs) 2025-05-07T20:32:24.2590206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.2590913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.2591453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.2592171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.2592930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.2593469Z kernel = self.compile( 2025-05-07T20:32:24.2594024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.2594691Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.2595098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.2595334Z 2025-05-07T20:32:24.2595548Z self = 2025-05-07T20:32:24.2596634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.2598024Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050856f060>} 2025-05-07T20:32:24.2599405Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.2600434Z context = 2025-05-07T20:32:24.2600717Z 2025-05-07T20:32:24.2600884Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.2601455Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.2601928Z module_map=module_map) 2025-05-07T20:32:24.2602295Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.2602661Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.2602934Z E ^ 2025-05-07T20:32:24.2603411Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.2603862Z 2025-05-07T20:32:24.2604282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.2604801Z 2025-05-07T20:32:24.2604909Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2605329Z self=, 2025-05-07T20:32:24.2606064Z T=128, 2025-05-07T20:32:24.2606259Z D=7168, 2025-05-07T20:32:24.2606547Z scale_ub=1200.0, 2025-05-07T20:32:24.2606787Z contiguous=False, 2025-05-07T20:32:24.2607016Z compiled=True, 2025-05-07T20:32:24.2607232Z ) 2025-05-07T20:32:24.3329719Z self = 2025-05-07T20:32:24.3330520Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:24.3330898Z 2025-05-07T20:32:24.3331012Z @given( 2025-05-07T20:32:24.3331337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.3331742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.3332050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.3332390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.3332731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.3333026Z ) 2025-05-07T20:32:24.3333377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.3333831Z def test_silu_mul_quant( 2025-05-07T20:32:24.3334090Z self, 2025-05-07T20:32:24.3334290Z T: int, 2025-05-07T20:32:24.3334501Z D: int, 2025-05-07T20:32:24.3334731Z scale_ub: Optional[float], 2025-05-07T20:32:24.3335005Z contiguous: bool, 2025-05-07T20:32:24.3335260Z compiled: bool, 2025-05-07T20:32:24.3335497Z ) -> None: 2025-05-07T20:32:24.3335715Z torch.manual_seed(2025) 2025-05-07T20:32:24.3335972Z 2025-05-07T20:32:24.3336544Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.3336898Z 2025-05-07T20:32:24.3337105Z x_sign = torch.sign(x) 2025-05-07T20:32:24.3337406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.3337719Z x = x_sign * x_clamp 2025-05-07T20:32:24.3337968Z x0 = x[:, :D] 2025-05-07T20:32:24.3338198Z x1 = x[:, D:] 2025-05-07T20:32:24.3338418Z 2025-05-07T20:32:24.3338609Z if contiguous: 2025-05-07T20:32:24.3338856Z x0 = x0.contiguous() 2025-05-07T20:32:24.3339124Z x1 = x1.contiguous() 2025-05-07T20:32:24.3339368Z 2025-05-07T20:32:24.3339576Z if scale_ub is not None: 2025-05-07T20:32:24.3339855Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.3340193Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.3340514Z ) 2025-05-07T20:32:24.3340721Z else: 2025-05-07T20:32:24.3340937Z scale_ub_tensor = None 2025-05-07T20:32:24.3341197Z 2025-05-07T20:32:24.3341523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.3341839Z op = silu_mul_quant 2025-05-07T20:32:24.3342101Z if compiled: 2025-05-07T20:32:24.3342363Z op = torch.compile(op) 2025-05-07T20:32:24.3342658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.3342949Z 2025-05-07T20:32:24.3343159Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.3343328Z 2025-05-07T20:32:24.3343518Z moe/activation_test.py:117: 2025-05-07T20:32:24.3343814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.3344155Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.3344446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.3345004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.3345570Z return fn(*args, **kwargs) 2025-05-07T20:32:24.3346241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.3346938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.3347474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.3348160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.3348830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.3349453Z kernel = self.compile( 2025-05-07T20:32:24.3349992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.3350653Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.3351057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.3351285Z 2025-05-07T20:32:24.3351503Z self = 2025-05-07T20:32:24.3352626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.3354016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508438360>} 2025-05-07T20:32:24.3355357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.3356380Z context = 2025-05-07T20:32:24.3356663Z 2025-05-07T20:32:24.3356841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.3357405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.3357875Z module_map=module_map) 2025-05-07T20:32:24.3358242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.3358592Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.3358858Z E ^ 2025-05-07T20:32:24.3359326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.3359781Z 2025-05-07T20:32:24.3360203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.3360714Z 2025-05-07T20:32:24.3360819Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.3361236Z self=, 2025-05-07T20:32:24.3361641Z T=2048, 2025-05-07T20:32:24.3361836Z D=7168, 2025-05-07T20:32:24.3362085Z scale_ub=None, 2025-05-07T20:32:24.3362355Z contiguous=True, 2025-05-07T20:32:24.3362608Z compiled=True, 2025-05-07T20:32:24.3362811Z ) 2025-05-07T20:32:24.3363132Z self = 2025-05-07T20:32:24.3363624Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.3363891Z 2025-05-07T20:32:24.3363974Z @given( 2025-05-07T20:32:24.3364249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.3364567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.3364880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.3365206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.3365535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.3365827Z ) 2025-05-07T20:32:24.3366174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.3366618Z def test_silu_mul_quant( 2025-05-07T20:32:24.3366871Z self, 2025-05-07T20:32:24.3367068Z T: int, 2025-05-07T20:32:24.3367278Z D: int, 2025-05-07T20:32:24.3367503Z scale_ub: Optional[float], 2025-05-07T20:32:24.3367900Z contiguous: bool, 2025-05-07T20:32:24.3368138Z compiled: bool, 2025-05-07T20:32:24.3368367Z ) -> None: 2025-05-07T20:32:24.3368589Z torch.manual_seed(2025) 2025-05-07T20:32:24.3368834Z 2025-05-07T20:32:24.3369167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.3369518Z 2025-05-07T20:32:24.3369713Z x_sign = torch.sign(x) 2025-05-07T20:32:24.3370011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.3370332Z x = x_sign * x_clamp 2025-05-07T20:32:24.3370573Z x0 = x[:, :D] 2025-05-07T20:32:24.3370800Z x1 = x[:, D:] 2025-05-07T20:32:24.3371015Z 2025-05-07T20:32:24.3371204Z if contiguous: 2025-05-07T20:32:24.3371445Z x0 = x0.contiguous() 2025-05-07T20:32:24.3371712Z x1 = x1.contiguous() 2025-05-07T20:32:24.3371947Z 2025-05-07T20:32:24.3372154Z if scale_ub is not None: 2025-05-07T20:32:24.3372479Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.3372812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.3373126Z ) 2025-05-07T20:32:24.3373329Z else: 2025-05-07T20:32:24.3373551Z scale_ub_tensor = None 2025-05-07T20:32:24.3373806Z 2025-05-07T20:32:24.3374044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.3374361Z op = silu_mul_quant 2025-05-07T20:32:24.3374610Z if compiled: 2025-05-07T20:32:24.3374864Z op = torch.compile(op) 2025-05-07T20:32:24.3375164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.3375439Z 2025-05-07T20:32:24.3375641Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.3375807Z 2025-05-07T20:32:24.3376005Z moe/activation_test.py:117: 2025-05-07T20:32:24.3376305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.3376644Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.3376931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.3377489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.3378047Z return fn(*args, **kwargs) 2025-05-07T20:32:24.3378715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.3379408Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.3379938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.3380619Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.3381326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.3381864Z kernel = self.compile( 2025-05-07T20:32:24.3382400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.3383058Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.3383459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.3383731Z 2025-05-07T20:32:24.3383944Z self = 2025-05-07T20:32:24.3385013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.3386385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508438ea0>} 2025-05-07T20:32:24.3387720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.3388740Z context = 2025-05-07T20:32:24.3389024Z 2025-05-07T20:32:24.3389191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.3389758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.3390228Z module_map=module_map) 2025-05-07T20:32:24.3390600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.3390952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.3391217Z E ^ 2025-05-07T20:32:24.3391690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.3392188Z 2025-05-07T20:32:24.3392604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.3393119Z 2025-05-07T20:32:24.4172707Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.4173344Z self=, 2025-05-07T20:32:24.4173972Z T=16384, 2025-05-07T20:32:24.4174180Z D=5120, 2025-05-07T20:32:24.4174372Z scale_ub=None, 2025-05-07T20:32:24.4174588Z contiguous=False, 2025-05-07T20:32:24.4174816Z compiled=False, 2025-05-07T20:32:24.4175025Z ) 2025-05-07T20:32:24.4175356Z self = 2025-05-07T20:32:24.4175862Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.4176145Z 2025-05-07T20:32:24.4176229Z @given( 2025-05-07T20:32:24.4176588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.4176912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.4177213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.4177552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.4177888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.4178182Z ) 2025-05-07T20:32:24.4178523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.4178964Z def test_silu_mul_quant( 2025-05-07T20:32:24.4179213Z self, 2025-05-07T20:32:24.4179409Z T: int, 2025-05-07T20:32:24.4179614Z D: int, 2025-05-07T20:32:24.4179841Z scale_ub: Optional[float], 2025-05-07T20:32:24.4180112Z contiguous: bool, 2025-05-07T20:32:24.4180366Z compiled: bool, 2025-05-07T20:32:24.4180598Z ) -> None: 2025-05-07T20:32:24.4180816Z torch.manual_seed(2025) 2025-05-07T20:32:24.4181063Z 2025-05-07T20:32:24.4181419Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.4181764Z 2025-05-07T20:32:24.4181989Z x_sign = torch.sign(x) 2025-05-07T20:32:24.4182285Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.4184317Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.4186268Z 2025-05-07T20:32:24.4186391Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:24.4186612Z 2025-05-07T20:32:24.4186725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.4187148Z self=, 2025-05-07T20:32:24.4187549Z T=4096, 2025-05-07T20:32:24.4187750Z D=7168, 2025-05-07T20:32:24.4187951Z scale_ub=1200.0, 2025-05-07T20:32:24.4188180Z contiguous=True, 2025-05-07T20:32:24.4188409Z compiled=True, 2025-05-07T20:32:24.4188617Z ) 2025-05-07T20:32:24.4188937Z self = 2025-05-07T20:32:24.4189511Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:24.4189781Z 2025-05-07T20:32:24.4189871Z @given( 2025-05-07T20:32:24.4190110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.4190419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.4190732Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.4191069Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.4191399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.4191690Z ) 2025-05-07T20:32:24.4192049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.4192545Z def test_silu_mul_quant( 2025-05-07T20:32:24.4192787Z self, 2025-05-07T20:32:24.4192989Z T: int, 2025-05-07T20:32:24.4193194Z D: int, 2025-05-07T20:32:24.4193410Z scale_ub: Optional[float], 2025-05-07T20:32:24.4193692Z contiguous: bool, 2025-05-07T20:32:24.4193938Z compiled: bool, 2025-05-07T20:32:24.4194160Z ) -> None: 2025-05-07T20:32:24.4194385Z torch.manual_seed(2025) 2025-05-07T20:32:24.4194631Z 2025-05-07T20:32:24.4194903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.4195255Z 2025-05-07T20:32:24.4195462Z x_sign = torch.sign(x) 2025-05-07T20:32:24.4195754Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.4197799Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.4199678Z 2025-05-07T20:32:24.4199802Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:24.4200022Z 2025-05-07T20:32:24.4200128Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.4200545Z self=, 2025-05-07T20:32:24.4200950Z T=16384, 2025-05-07T20:32:24.4201150Z D=7168, 2025-05-07T20:32:24.4201351Z scale_ub=None, 2025-05-07T20:32:24.4201613Z contiguous=False, 2025-05-07T20:32:24.4201851Z compiled=False, 2025-05-07T20:32:24.4202070Z ) 2025-05-07T20:32:24.4202390Z self = 2025-05-07T20:32:24.4202890Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.4203174Z 2025-05-07T20:32:24.4203256Z @given( 2025-05-07T20:32:24.4203495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.4203854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.4204167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.4204499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.4204827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.4205118Z ) 2025-05-07T20:32:24.4205473Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.4206216Z def test_silu_mul_quant( 2025-05-07T20:32:24.4206487Z self, 2025-05-07T20:32:24.4206699Z T: int, 2025-05-07T20:32:24.4206906Z D: int, 2025-05-07T20:32:24.4207130Z scale_ub: Optional[float], 2025-05-07T20:32:24.4207406Z contiguous: bool, 2025-05-07T20:32:24.4207730Z compiled: bool, 2025-05-07T20:32:24.4207950Z ) -> None: 2025-05-07T20:32:24.4208173Z torch.manual_seed(2025) 2025-05-07T20:32:24.4208421Z 2025-05-07T20:32:24.4208692Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.4210836Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.4212700Z 2025-05-07T20:32:24.4212821Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.4213032Z 2025-05-07T20:32:24.4213145Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.4213565Z self=, 2025-05-07T20:32:24.4213968Z T=2048, 2025-05-07T20:32:24.4214171Z D=7168, 2025-05-07T20:32:24.4214370Z scale_ub=1200.0, 2025-05-07T20:32:24.4214593Z contiguous=True, 2025-05-07T20:32:24.4214820Z compiled=True, 2025-05-07T20:32:24.4215031Z ) 2025-05-07T20:32:24.4215348Z self = 2025-05-07T20:32:24.4215841Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:24.4216111Z 2025-05-07T20:32:24.4216198Z @given( 2025-05-07T20:32:24.4216492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.4216817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.4217125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.4217458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.4217786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.4218078Z ) 2025-05-07T20:32:24.4218431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.4218874Z def test_silu_mul_quant( 2025-05-07T20:32:24.4219121Z self, 2025-05-07T20:32:24.4219324Z T: int, 2025-05-07T20:32:24.4219521Z D: int, 2025-05-07T20:32:24.4219747Z scale_ub: Optional[float], 2025-05-07T20:32:24.4220026Z contiguous: bool, 2025-05-07T20:32:24.4220267Z compiled: bool, 2025-05-07T20:32:24.4220500Z ) -> None: 2025-05-07T20:32:24.4220727Z torch.manual_seed(2025) 2025-05-07T20:32:24.4220968Z 2025-05-07T20:32:24.4221313Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.4221668Z 2025-05-07T20:32:24.4221863Z x_sign = torch.sign(x) 2025-05-07T20:32:24.4222162Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.4224152Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.4226059Z 2025-05-07T20:32:24.4226178Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:24.4226393Z 2025-05-07T20:32:24.4226504Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.4226936Z self=, 2025-05-07T20:32:24.4227334Z T=2048, 2025-05-07T20:32:24.4227533Z D=7168, 2025-05-07T20:32:24.4227739Z scale_ub=None, 2025-05-07T20:32:24.4227955Z contiguous=True, 2025-05-07T20:32:24.4228189Z compiled=False, 2025-05-07T20:32:24.4235919Z ) 2025-05-07T20:32:24.5104898Z self = 2025-05-07T20:32:24.5105575Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.5106137Z 2025-05-07T20:32:24.5106248Z @given( 2025-05-07T20:32:24.5106571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.5106968Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.5107280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.5107612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.5107954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.5108277Z ) 2025-05-07T20:32:24.5108635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.5109072Z def test_silu_mul_quant( 2025-05-07T20:32:24.5109322Z self, 2025-05-07T20:32:24.5109524Z T: int, 2025-05-07T20:32:24.5109722Z D: int, 2025-05-07T20:32:24.5109952Z scale_ub: Optional[float], 2025-05-07T20:32:24.5110236Z contiguous: bool, 2025-05-07T20:32:24.5110488Z compiled: bool, 2025-05-07T20:32:24.5110716Z ) -> None: 2025-05-07T20:32:24.5110940Z torch.manual_seed(2025) 2025-05-07T20:32:24.5111191Z 2025-05-07T20:32:24.5111464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.5111819Z 2025-05-07T20:32:24.5112022Z > x_sign = torch.sign(x) 2025-05-07T20:32:24.5114074Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.5115957Z 2025-05-07T20:32:24.5116078Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:24.5116299Z 2025-05-07T20:32:24.5116405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.5116828Z self=, 2025-05-07T20:32:24.5117237Z T=1, 2025-05-07T20:32:24.5117426Z D=7168, 2025-05-07T20:32:24.5117631Z scale_ub=1200.0, 2025-05-07T20:32:24.5117861Z contiguous=True, 2025-05-07T20:32:24.5118084Z compiled=False, 2025-05-07T20:32:24.5118303Z ) 2025-05-07T20:32:24.5118733Z self = 2025-05-07T20:32:24.5119226Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.5119499Z 2025-05-07T20:32:24.5119578Z @given( 2025-05-07T20:32:24.5119816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.5120135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.5120504Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.5120845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.5121181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.5121468Z ) 2025-05-07T20:32:24.5121828Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.5122278Z def test_silu_mul_quant( 2025-05-07T20:32:24.5122526Z self, 2025-05-07T20:32:24.5122732Z T: int, 2025-05-07T20:32:24.5122938Z D: int, 2025-05-07T20:32:24.5123163Z scale_ub: Optional[float], 2025-05-07T20:32:24.5123443Z contiguous: bool, 2025-05-07T20:32:24.5123695Z compiled: bool, 2025-05-07T20:32:24.5123919Z ) -> None: 2025-05-07T20:32:24.5124143Z torch.manual_seed(2025) 2025-05-07T20:32:24.5124396Z 2025-05-07T20:32:24.5124667Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.5125021Z 2025-05-07T20:32:24.5125225Z x_sign = torch.sign(x) 2025-05-07T20:32:24.5125587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.5125903Z x = x_sign * x_clamp 2025-05-07T20:32:24.5126150Z x0 = x[:, :D] 2025-05-07T20:32:24.5126374Z x1 = x[:, D:] 2025-05-07T20:32:24.5126583Z 2025-05-07T20:32:24.5126775Z if contiguous: 2025-05-07T20:32:24.5127024Z x0 = x0.contiguous() 2025-05-07T20:32:24.5127282Z x1 = x1.contiguous() 2025-05-07T20:32:24.5127639Z 2025-05-07T20:32:24.5127855Z if scale_ub is not None: 2025-05-07T20:32:24.5128124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.5128463Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.5128776Z ) 2025-05-07T20:32:24.5128969Z else: 2025-05-07T20:32:24.5129188Z scale_ub_tensor = None 2025-05-07T20:32:24.5129446Z 2025-05-07T20:32:24.5129675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.5130005Z op = silu_mul_quant 2025-05-07T20:32:24.5130261Z if compiled: 2025-05-07T20:32:24.5130517Z op = torch.compile(op) 2025-05-07T20:32:24.5130816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5131101Z 2025-05-07T20:32:24.5131306Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.5131472Z 2025-05-07T20:32:24.5131576Z moe/activation_test.py:117: 2025-05-07T20:32:24.5131876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5132266Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.5132550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5133256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.5133959Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.5134507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.5135201Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.5135873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.5136412Z kernel = self.compile( 2025-05-07T20:32:24.5136956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.5137623Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.5138068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5138301Z 2025-05-07T20:32:24.5138515Z self = 2025-05-07T20:32:24.5139600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.5141028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050831c680>} 2025-05-07T20:32:24.5142383Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.5143559Z context = 2025-05-07T20:32:24.5143851Z 2025-05-07T20:32:24.5144025Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.5144547Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.5145019Z module_map=module_map) 2025-05-07T20:32:24.5145388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.5145801Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.5146069Z E ^ 2025-05-07T20:32:24.5146537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.5146987Z 2025-05-07T20:32:24.5147408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.5147917Z 2025-05-07T20:32:24.5148024Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.5148445Z self=, 2025-05-07T20:32:24.5148851Z T=128, 2025-05-07T20:32:24.5149042Z D=5120, 2025-05-07T20:32:24.5149242Z scale_ub=None, 2025-05-07T20:32:24.5149453Z contiguous=True, 2025-05-07T20:32:24.5149676Z compiled=False, 2025-05-07T20:32:24.5149885Z ) 2025-05-07T20:32:24.7384230Z self = 2025-05-07T20:32:24.7384761Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.7385063Z 2025-05-07T20:32:24.7385173Z @given( 2025-05-07T20:32:24.7385519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.7385849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.7386164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.7386507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.7386945Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.7387369Z ) 2025-05-07T20:32:24.7387731Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.7388174Z def test_silu_mul_quant( 2025-05-07T20:32:24.7388425Z self, 2025-05-07T20:32:24.7388630Z T: int, 2025-05-07T20:32:24.7388824Z D: int, 2025-05-07T20:32:24.7389050Z scale_ub: Optional[float], 2025-05-07T20:32:24.7389325Z contiguous: bool, 2025-05-07T20:32:24.7389573Z compiled: bool, 2025-05-07T20:32:24.7389803Z ) -> None: 2025-05-07T20:32:24.7390024Z torch.manual_seed(2025) 2025-05-07T20:32:24.7390268Z 2025-05-07T20:32:24.7390542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.7390893Z 2025-05-07T20:32:24.7391093Z x_sign = torch.sign(x) 2025-05-07T20:32:24.7391385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.7391700Z x = x_sign * x_clamp 2025-05-07T20:32:24.7391951Z x0 = x[:, :D] 2025-05-07T20:32:24.7392236Z x1 = x[:, D:] 2025-05-07T20:32:24.7392451Z 2025-05-07T20:32:24.7392674Z if contiguous: 2025-05-07T20:32:24.7392914Z x0 = x0.contiguous() 2025-05-07T20:32:24.7393180Z x1 = x1.contiguous() 2025-05-07T20:32:24.7393420Z 2025-05-07T20:32:24.7393621Z if scale_ub is not None: 2025-05-07T20:32:24.7393899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.7394296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.7394615Z ) 2025-05-07T20:32:24.7394813Z else: 2025-05-07T20:32:24.7395024Z scale_ub_tensor = None 2025-05-07T20:32:24.7395281Z 2025-05-07T20:32:24.7395517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.7395831Z op = silu_mul_quant 2025-05-07T20:32:24.7396088Z if compiled: 2025-05-07T20:32:24.7396342Z op = torch.compile(op) 2025-05-07T20:32:24.7396642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.7396925Z 2025-05-07T20:32:24.7397122Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.7397286Z 2025-05-07T20:32:24.7397390Z moe/activation_test.py:117: 2025-05-07T20:32:24.7397683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.7398020Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.7398307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.7399076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.7399780Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.7400327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.7401011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.7401687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.7402224Z kernel = self.compile( 2025-05-07T20:32:24.7402772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.7403429Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.7403830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.7404070Z 2025-05-07T20:32:24.7404280Z self = 2025-05-07T20:32:24.7405366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.7406996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050831d8a0>} 2025-05-07T20:32:24.7408397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.7409421Z context = 2025-05-07T20:32:24.7409702Z 2025-05-07T20:32:24.7409869Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.7410402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.7410859Z module_map=module_map) 2025-05-07T20:32:24.7411228Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.7411581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.7411842Z E ^ 2025-05-07T20:32:24.7412429Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.7412891Z 2025-05-07T20:32:24.7413310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.7413824Z 2025-05-07T20:32:24.7413937Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.7414340Z self=, 2025-05-07T20:32:24.7414752Z T=128, 2025-05-07T20:32:24.7415002Z D=7168, 2025-05-07T20:32:24.7415198Z scale_ub=None, 2025-05-07T20:32:24.7415410Z contiguous=True, 2025-05-07T20:32:24.7415639Z compiled=False, 2025-05-07T20:32:24.7415840Z ) 2025-05-07T20:32:24.7416167Z self = 2025-05-07T20:32:24.7416654Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.7416924Z 2025-05-07T20:32:24.7417004Z @given( 2025-05-07T20:32:24.7417236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.7417553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.7417859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.7418191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.7418515Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.7418804Z ) 2025-05-07T20:32:24.7419143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.7419661Z def test_silu_mul_quant( 2025-05-07T20:32:24.7419902Z self, 2025-05-07T20:32:24.7420096Z T: int, 2025-05-07T20:32:24.7420294Z D: int, 2025-05-07T20:32:24.7420517Z scale_ub: Optional[float], 2025-05-07T20:32:24.7420784Z contiguous: bool, 2025-05-07T20:32:24.7421032Z compiled: bool, 2025-05-07T20:32:24.7421253Z ) -> None: 2025-05-07T20:32:24.7421471Z torch.manual_seed(2025) 2025-05-07T20:32:24.7421710Z 2025-05-07T20:32:24.7421994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.7422336Z 2025-05-07T20:32:24.7422530Z x_sign = torch.sign(x) 2025-05-07T20:32:24.7422819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.7423139Z x = x_sign * x_clamp 2025-05-07T20:32:24.7423371Z x0 = x[:, :D] 2025-05-07T20:32:24.7423595Z x1 = x[:, D:] 2025-05-07T20:32:24.7423803Z 2025-05-07T20:32:24.7423991Z if contiguous: 2025-05-07T20:32:24.7424226Z x0 = x0.contiguous() 2025-05-07T20:32:24.7424497Z x1 = x1.contiguous() 2025-05-07T20:32:24.7424732Z 2025-05-07T20:32:24.7424938Z if scale_ub is not None: 2025-05-07T20:32:24.7425210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.7425549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.7425860Z ) 2025-05-07T20:32:24.7426062Z else: 2025-05-07T20:32:24.7426322Z scale_ub_tensor = None 2025-05-07T20:32:24.7426585Z 2025-05-07T20:32:24.7426819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.7427145Z op = silu_mul_quant 2025-05-07T20:32:24.7427392Z if compiled: 2025-05-07T20:32:24.7427657Z op = torch.compile(op) 2025-05-07T20:32:24.7427952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.7428233Z 2025-05-07T20:32:24.7428427Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.7428604Z 2025-05-07T20:32:24.7428710Z moe/activation_test.py:117: 2025-05-07T20:32:24.7429002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.7429335Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.7429626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.7430311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.7431017Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.7431602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.7432296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.7432955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.7433498Z kernel = self.compile( 2025-05-07T20:32:24.7434090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.7434755Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.7435147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.7435385Z 2025-05-07T20:32:24.7435593Z self = 2025-05-07T20:32:24.7436680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.7438045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050831e7a0>} 2025-05-07T20:32:24.7439391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.7440461Z context = 2025-05-07T20:32:24.7440758Z 2025-05-07T20:32:24.7440930Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.7441450Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.7441928Z module_map=module_map) 2025-05-07T20:32:24.7442304Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.7442711Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.7442966Z E ^ 2025-05-07T20:32:24.7443440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.7443896Z 2025-05-07T20:32:24.7444311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.7444831Z 2025-05-07T20:32:24.7444946Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.7445356Z self=, 2025-05-07T20:32:24.7445765Z T=2048, 2025-05-07T20:32:24.7445956Z D=7168, 2025-05-07T20:32:24.7446155Z scale_ub=1200.0, 2025-05-07T20:32:24.7446380Z contiguous=True, 2025-05-07T20:32:24.7446610Z compiled=False, 2025-05-07T20:32:24.7446864Z ) 2025-05-07T20:32:24.8121128Z self = 2025-05-07T20:32:24.8121736Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.8124403Z 2025-05-07T20:32:24.8124659Z @given( 2025-05-07T20:32:24.8124901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8125225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8125537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8125874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8126236Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8126525Z ) 2025-05-07T20:32:24.8126877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8127318Z def test_silu_mul_quant( 2025-05-07T20:32:24.8127611Z self, 2025-05-07T20:32:24.8127808Z T: int, 2025-05-07T20:32:24.8128008Z D: int, 2025-05-07T20:32:24.8128341Z scale_ub: Optional[float], 2025-05-07T20:32:24.8128619Z contiguous: bool, 2025-05-07T20:32:24.8128858Z compiled: bool, 2025-05-07T20:32:24.8129084Z ) -> None: 2025-05-07T20:32:24.8129304Z torch.manual_seed(2025) 2025-05-07T20:32:24.8129544Z 2025-05-07T20:32:24.8129820Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8131880Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8133876Z 2025-05-07T20:32:24.8133998Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8134209Z 2025-05-07T20:32:24.8134319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8134732Z self=, 2025-05-07T20:32:24.8135138Z T=1, 2025-05-07T20:32:24.8135331Z D=5120, 2025-05-07T20:32:24.8135527Z scale_ub=1200.0, 2025-05-07T20:32:24.8135757Z contiguous=True, 2025-05-07T20:32:24.8136052Z compiled=False, 2025-05-07T20:32:24.8136252Z ) 2025-05-07T20:32:24.8136573Z self = 2025-05-07T20:32:24.8137059Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.8137324Z 2025-05-07T20:32:24.8137411Z @given( 2025-05-07T20:32:24.8137640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8137954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8138265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8138590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8138915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8139202Z ) 2025-05-07T20:32:24.8139547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8139994Z def test_silu_mul_quant( 2025-05-07T20:32:24.8140243Z self, 2025-05-07T20:32:24.8140449Z T: int, 2025-05-07T20:32:24.8140652Z D: int, 2025-05-07T20:32:24.8140874Z scale_ub: Optional[float], 2025-05-07T20:32:24.8141141Z contiguous: bool, 2025-05-07T20:32:24.8141387Z compiled: bool, 2025-05-07T20:32:24.8141614Z ) -> None: 2025-05-07T20:32:24.8141830Z torch.manual_seed(2025) 2025-05-07T20:32:24.8142072Z 2025-05-07T20:32:24.8142347Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8142691Z 2025-05-07T20:32:24.8142958Z x_sign = torch.sign(x) 2025-05-07T20:32:24.8143260Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.8143573Z x = x_sign * x_clamp 2025-05-07T20:32:24.8143814Z x0 = x[:, :D] 2025-05-07T20:32:24.8144039Z x1 = x[:, D:] 2025-05-07T20:32:24.8144247Z 2025-05-07T20:32:24.8144433Z if contiguous: 2025-05-07T20:32:24.8144671Z x0 = x0.contiguous() 2025-05-07T20:32:24.8144941Z x1 = x1.contiguous() 2025-05-07T20:32:24.8145188Z 2025-05-07T20:32:24.8145386Z if scale_ub is not None: 2025-05-07T20:32:24.8145659Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.8145993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.8146305Z ) 2025-05-07T20:32:24.8146503Z else: 2025-05-07T20:32:24.8146714Z scale_ub_tensor = None 2025-05-07T20:32:24.8146969Z 2025-05-07T20:32:24.8147202Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.8147569Z op = silu_mul_quant 2025-05-07T20:32:24.8147820Z if compiled: 2025-05-07T20:32:24.8148067Z op = torch.compile(op) 2025-05-07T20:32:24.8148368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8148636Z 2025-05-07T20:32:24.8148835Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.8149000Z 2025-05-07T20:32:24.8149107Z moe/activation_test.py:117: 2025-05-07T20:32:24.8149400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8149780Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.8150064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8150754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.8151441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.8151983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.8152669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.8153326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.8153861Z kernel = self.compile( 2025-05-07T20:32:24.8154399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.8155101Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.8155491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8155721Z 2025-05-07T20:32:24.8155924Z self = 2025-05-07T20:32:24.8157003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.8158368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050831fb00>} 2025-05-07T20:32:24.8159708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.8160729Z context = 2025-05-07T20:32:24.8161018Z 2025-05-07T20:32:24.8161182Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.8161703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.8162173Z module_map=module_map) 2025-05-07T20:32:24.8162604Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.8169966Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.8170259Z E ^ 2025-05-07T20:32:24.8170740Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.8171190Z 2025-05-07T20:32:24.8171612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.8172133Z 2025-05-07T20:32:24.8172245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8172719Z self=, 2025-05-07T20:32:24.8173127Z T=2048, 2025-05-07T20:32:24.8173321Z D=5120, 2025-05-07T20:32:24.8173521Z scale_ub=None, 2025-05-07T20:32:24.8173738Z contiguous=True, 2025-05-07T20:32:24.8173958Z compiled=False, 2025-05-07T20:32:24.8174162Z ) 2025-05-07T20:32:24.8174488Z self = 2025-05-07T20:32:24.8175050Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.8175327Z 2025-05-07T20:32:24.8175408Z @given( 2025-05-07T20:32:24.8175641Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8175951Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8176263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8176594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8176972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8177253Z ) 2025-05-07T20:32:24.8177605Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8178045Z def test_silu_mul_quant( 2025-05-07T20:32:24.8178282Z self, 2025-05-07T20:32:24.8178477Z T: int, 2025-05-07T20:32:24.8178677Z D: int, 2025-05-07T20:32:24.8178892Z scale_ub: Optional[float], 2025-05-07T20:32:24.8179166Z contiguous: bool, 2025-05-07T20:32:24.8179416Z compiled: bool, 2025-05-07T20:32:24.8179642Z ) -> None: 2025-05-07T20:32:24.8179853Z torch.manual_seed(2025) 2025-05-07T20:32:24.8180097Z 2025-05-07T20:32:24.8180372Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8180713Z 2025-05-07T20:32:24.8180905Z > x_sign = torch.sign(x) 2025-05-07T20:32:24.8182906Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8184811Z 2025-05-07T20:32:24.8184937Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:24.8185148Z 2025-05-07T20:32:24.8185251Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8185665Z self=, 2025-05-07T20:32:24.8186072Z T=16384, 2025-05-07T20:32:24.8186271Z D=5120, 2025-05-07T20:32:24.8186459Z scale_ub=None, 2025-05-07T20:32:24.8186675Z contiguous=True, 2025-05-07T20:32:24.8186900Z compiled=False, 2025-05-07T20:32:24.8187102Z ) 2025-05-07T20:32:24.8885300Z self = 2025-05-07T20:32:24.8885827Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.8886147Z 2025-05-07T20:32:24.8886243Z @given( 2025-05-07T20:32:24.8886496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8886960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8887409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8887824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8888157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8888457Z ) 2025-05-07T20:32:24.8888813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8889256Z def test_silu_mul_quant( 2025-05-07T20:32:24.8889508Z self, 2025-05-07T20:32:24.8889711Z T: int, 2025-05-07T20:32:24.8889919Z D: int, 2025-05-07T20:32:24.8890143Z scale_ub: Optional[float], 2025-05-07T20:32:24.8890418Z contiguous: bool, 2025-05-07T20:32:24.8890661Z compiled: bool, 2025-05-07T20:32:24.8890894Z ) -> None: 2025-05-07T20:32:24.8891112Z torch.manual_seed(2025) 2025-05-07T20:32:24.8891353Z 2025-05-07T20:32:24.8891635Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8893769Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8895701Z 2025-05-07T20:32:24.8895826Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8896040Z 2025-05-07T20:32:24.8896152Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8896567Z self=, 2025-05-07T20:32:24.8896976Z T=4096, 2025-05-07T20:32:24.8897179Z D=5120, 2025-05-07T20:32:24.8897376Z scale_ub=None, 2025-05-07T20:32:24.8897597Z contiguous=True, 2025-05-07T20:32:24.8897837Z compiled=False, 2025-05-07T20:32:24.8898048Z ) 2025-05-07T20:32:24.8898375Z self = 2025-05-07T20:32:24.8898876Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.8899145Z 2025-05-07T20:32:24.8899232Z @given( 2025-05-07T20:32:24.8899464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8899782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8900169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8900499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8900831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8901128Z ) 2025-05-07T20:32:24.8901478Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8901927Z def test_silu_mul_quant( 2025-05-07T20:32:24.8902174Z self, 2025-05-07T20:32:24.8902383Z T: int, 2025-05-07T20:32:24.8902592Z D: int, 2025-05-07T20:32:24.8902826Z scale_ub: Optional[float], 2025-05-07T20:32:24.8903115Z contiguous: bool, 2025-05-07T20:32:24.8903364Z compiled: bool, 2025-05-07T20:32:24.8903596Z ) -> None: 2025-05-07T20:32:24.8903821Z torch.manual_seed(2025) 2025-05-07T20:32:24.8904069Z 2025-05-07T20:32:24.8904349Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8906640Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8908505Z 2025-05-07T20:32:24.8908633Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8908848Z 2025-05-07T20:32:24.8908959Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8909380Z self=, 2025-05-07T20:32:24.8909793Z T=2048, 2025-05-07T20:32:24.8909988Z D=5120, 2025-05-07T20:32:24.8910184Z scale_ub=None, 2025-05-07T20:32:24.8910417Z contiguous=False, 2025-05-07T20:32:24.8910651Z compiled=False, 2025-05-07T20:32:24.8910859Z ) 2025-05-07T20:32:24.8911206Z self = 2025-05-07T20:32:24.8911704Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.8911977Z 2025-05-07T20:32:24.8912062Z @given( 2025-05-07T20:32:24.8912303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8912628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8913005Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8913342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8913676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8913971Z ) 2025-05-07T20:32:24.8914325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8914776Z def test_silu_mul_quant( 2025-05-07T20:32:24.8915094Z self, 2025-05-07T20:32:24.8915303Z T: int, 2025-05-07T20:32:24.8915504Z D: int, 2025-05-07T20:32:24.8915728Z scale_ub: Optional[float], 2025-05-07T20:32:24.8916010Z contiguous: bool, 2025-05-07T20:32:24.8916257Z compiled: bool, 2025-05-07T20:32:24.8916487Z ) -> None: 2025-05-07T20:32:24.8916712Z torch.manual_seed(2025) 2025-05-07T20:32:24.8916954Z 2025-05-07T20:32:24.8917233Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8919310Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8921245Z 2025-05-07T20:32:24.8921368Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8921589Z 2025-05-07T20:32:24.8921695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8922119Z self=, 2025-05-07T20:32:24.8922525Z T=4096, 2025-05-07T20:32:24.8922718Z D=7168, 2025-05-07T20:32:24.8922918Z scale_ub=None, 2025-05-07T20:32:24.8923142Z contiguous=True, 2025-05-07T20:32:24.8923371Z compiled=True, 2025-05-07T20:32:24.8923582Z ) 2025-05-07T20:32:24.8923891Z self = 2025-05-07T20:32:24.8924379Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.8924645Z 2025-05-07T20:32:24.8924723Z @given( 2025-05-07T20:32:24.8924951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8925262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8925566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8925892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8926217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8926506Z ) 2025-05-07T20:32:24.8926854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8927289Z def test_silu_mul_quant( 2025-05-07T20:32:24.8927656Z self, 2025-05-07T20:32:24.8927855Z T: int, 2025-05-07T20:32:24.8928042Z D: int, 2025-05-07T20:32:24.8928256Z scale_ub: Optional[float], 2025-05-07T20:32:24.8928520Z contiguous: bool, 2025-05-07T20:32:24.8928755Z compiled: bool, 2025-05-07T20:32:24.8928968Z ) -> None: 2025-05-07T20:32:24.8929175Z torch.manual_seed(2025) 2025-05-07T20:32:24.8929408Z 2025-05-07T20:32:24.8929667Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8931765Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8933628Z 2025-05-07T20:32:24.8933748Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8933962Z 2025-05-07T20:32:24.8934073Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8934486Z self=, 2025-05-07T20:32:24.8934890Z T=2048, 2025-05-07T20:32:24.8935126Z D=5120, 2025-05-07T20:32:24.8935323Z scale_ub=1200.0, 2025-05-07T20:32:24.8935549Z contiguous=False, 2025-05-07T20:32:24.8935773Z compiled=False, 2025-05-07T20:32:24.8935980Z ) 2025-05-07T20:32:24.8936285Z self = 2025-05-07T20:32:24.8936782Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:24.8937054Z 2025-05-07T20:32:24.8937134Z @given( 2025-05-07T20:32:24.8937364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8937675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8937981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8938302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8938634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8938914Z ) 2025-05-07T20:32:24.8939264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8939771Z def test_silu_mul_quant( 2025-05-07T20:32:24.8940020Z self, 2025-05-07T20:32:24.8940209Z T: int, 2025-05-07T20:32:24.8940404Z D: int, 2025-05-07T20:32:24.8940620Z scale_ub: Optional[float], 2025-05-07T20:32:24.8940893Z contiguous: bool, 2025-05-07T20:32:24.8941120Z compiled: bool, 2025-05-07T20:32:24.8941343Z ) -> None: 2025-05-07T20:32:24.8941556Z torch.manual_seed(2025) 2025-05-07T20:32:24.8941798Z 2025-05-07T20:32:24.8942074Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8944118Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8945979Z 2025-05-07T20:32:24.8946095Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8946308Z 2025-05-07T20:32:24.8946421Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8946824Z self=, 2025-05-07T20:32:24.8947228Z T=4096, 2025-05-07T20:32:24.8947459Z D=7168, 2025-05-07T20:32:24.8947654Z scale_ub=1200.0, 2025-05-07T20:32:24.8947882Z contiguous=True, 2025-05-07T20:32:24.8948104Z compiled=False, 2025-05-07T20:32:24.8948312Z ) 2025-05-07T20:32:24.9864108Z self = 2025-05-07T20:32:24.9865079Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.9865597Z 2025-05-07T20:32:24.9865765Z @given( 2025-05-07T20:32:24.9866188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9866754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9867319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9867927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9868527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9869042Z ) 2025-05-07T20:32:24.9869690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9870676Z def test_silu_mul_quant( 2025-05-07T20:32:24.9871118Z self, 2025-05-07T20:32:24.9871476Z T: int, 2025-05-07T20:32:24.9871842Z D: int, 2025-05-07T20:32:24.9872235Z scale_ub: Optional[float], 2025-05-07T20:32:24.9872678Z contiguous: bool, 2025-05-07T20:32:24.9872930Z compiled: bool, 2025-05-07T20:32:24.9873155Z ) -> None: 2025-05-07T20:32:24.9873374Z torch.manual_seed(2025) 2025-05-07T20:32:24.9873690Z 2025-05-07T20:32:24.9873960Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9876033Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.9877902Z 2025-05-07T20:32:24.9878026Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.9878244Z 2025-05-07T20:32:24.9878347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9878765Z self=, 2025-05-07T20:32:24.9879239Z T=16384, 2025-05-07T20:32:24.9879439Z D=7168, 2025-05-07T20:32:24.9879636Z scale_ub=None, 2025-05-07T20:32:24.9879851Z contiguous=False, 2025-05-07T20:32:24.9880077Z compiled=True, 2025-05-07T20:32:24.9880281Z ) 2025-05-07T20:32:24.9880595Z self = 2025-05-07T20:32:24.9881091Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:24.9881372Z 2025-05-07T20:32:24.9881456Z @given( 2025-05-07T20:32:24.9881692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9881999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9882306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9882638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9882964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9883253Z ) 2025-05-07T20:32:24.9883608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9884048Z def test_silu_mul_quant( 2025-05-07T20:32:24.9884320Z self, 2025-05-07T20:32:24.9884514Z T: int, 2025-05-07T20:32:24.9884716Z D: int, 2025-05-07T20:32:24.9884932Z scale_ub: Optional[float], 2025-05-07T20:32:24.9885206Z contiguous: bool, 2025-05-07T20:32:24.9885448Z compiled: bool, 2025-05-07T20:32:24.9885669Z ) -> None: 2025-05-07T20:32:24.9885959Z torch.manual_seed(2025) 2025-05-07T20:32:24.9886205Z 2025-05-07T20:32:24.9886479Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9888597Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.9890462Z 2025-05-07T20:32:24.9890581Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.9890795Z 2025-05-07T20:32:24.9890902Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9891318Z self=, 2025-05-07T20:32:24.9891766Z T=4096, 2025-05-07T20:32:24.9891954Z D=7168, 2025-05-07T20:32:24.9892146Z scale_ub=None, 2025-05-07T20:32:24.9892360Z contiguous=True, 2025-05-07T20:32:24.9892581Z compiled=False, 2025-05-07T20:32:24.9892797Z ) 2025-05-07T20:32:24.9893123Z self = 2025-05-07T20:32:24.9893618Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.9893941Z 2025-05-07T20:32:24.9894023Z @given( 2025-05-07T20:32:24.9894253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9894562Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9894865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9895194Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9895528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9895811Z ) 2025-05-07T20:32:24.9896163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9896609Z def test_silu_mul_quant( 2025-05-07T20:32:24.9896849Z self, 2025-05-07T20:32:24.9897046Z T: int, 2025-05-07T20:32:24.9897245Z D: int, 2025-05-07T20:32:24.9897455Z scale_ub: Optional[float], 2025-05-07T20:32:24.9897726Z contiguous: bool, 2025-05-07T20:32:24.9897963Z compiled: bool, 2025-05-07T20:32:24.9898234Z ) -> None: 2025-05-07T20:32:24.9898452Z torch.manual_seed(2025) 2025-05-07T20:32:24.9898691Z 2025-05-07T20:32:24.9898960Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9900998Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.9902906Z 2025-05-07T20:32:24.9903027Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.9903243Z 2025-05-07T20:32:24.9903351Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9903765Z self=, 2025-05-07T20:32:24.9904162Z T=16384, 2025-05-07T20:32:24.9904362Z D=7168, 2025-05-07T20:32:24.9904559Z scale_ub=None, 2025-05-07T20:32:24.9904766Z contiguous=True, 2025-05-07T20:32:24.9905001Z compiled=False, 2025-05-07T20:32:24.9905212Z ) 2025-05-07T20:32:24.9905525Z self = 2025-05-07T20:32:24.9906257Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.9906544Z 2025-05-07T20:32:24.9906625Z @given( 2025-05-07T20:32:24.9906857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9907166Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9907470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9907799Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9908126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9908424Z ) 2025-05-07T20:32:24.9908778Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9909215Z def test_silu_mul_quant( 2025-05-07T20:32:24.9909457Z self, 2025-05-07T20:32:24.9909653Z T: int, 2025-05-07T20:32:24.9909843Z D: int, 2025-05-07T20:32:24.9910064Z scale_ub: Optional[float], 2025-05-07T20:32:24.9910339Z contiguous: bool, 2025-05-07T20:32:24.9910581Z compiled: bool, 2025-05-07T20:32:24.9910802Z ) -> None: 2025-05-07T20:32:24.9911086Z torch.manual_seed(2025) 2025-05-07T20:32:24.9911329Z 2025-05-07T20:32:24.9911595Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9913640Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.9915553Z 2025-05-07T20:32:24.9915672Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.9915882Z 2025-05-07T20:32:24.9915990Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9916407Z self=, 2025-05-07T20:32:24.9916804Z T=16384, 2025-05-07T20:32:24.9917000Z D=7168, 2025-05-07T20:32:24.9917199Z scale_ub=1200.0, 2025-05-07T20:32:24.9917420Z contiguous=True, 2025-05-07T20:32:24.9917645Z compiled=False, 2025-05-07T20:32:24.9917850Z ) 2025-05-07T20:32:24.9918165Z self = 2025-05-07T20:32:24.9918731Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.9919007Z 2025-05-07T20:32:24.9919090Z @given( 2025-05-07T20:32:24.9919315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9919630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9919935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9920262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9920589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9920880Z ) 2025-05-07T20:32:24.9921229Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9921666Z def test_silu_mul_quant( 2025-05-07T20:32:24.9921908Z self, 2025-05-07T20:32:24.9922108Z T: int, 2025-05-07T20:32:24.9922299Z D: int, 2025-05-07T20:32:24.9922524Z scale_ub: Optional[float], 2025-05-07T20:32:24.9922794Z contiguous: bool, 2025-05-07T20:32:24.9923036Z compiled: bool, 2025-05-07T20:32:24.9923259Z ) -> None: 2025-05-07T20:32:24.9923479Z torch.manual_seed(2025) 2025-05-07T20:32:24.9923722Z 2025-05-07T20:32:24.9923998Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9926094Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.9928006Z 2025-05-07T20:32:24.9928126Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.9928341Z 2025-05-07T20:32:24.9928449Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9928860Z self=, 2025-05-07T20:32:24.9929260Z T=128, 2025-05-07T20:32:24.9929450Z D=5120, 2025-05-07T20:32:24.9929641Z scale_ub=1200.0, 2025-05-07T20:32:24.9929869Z contiguous=False, 2025-05-07T20:32:24.9930097Z compiled=False, 2025-05-07T20:32:24.9930305Z ) 2025-05-07T20:32:25.0940809Z self = 2025-05-07T20:32:25.0942486Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.0943043Z 2025-05-07T20:32:25.0943171Z @given( 2025-05-07T20:32:25.0943534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.0944038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.0944529Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.0957977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.0958719Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.0959135Z ) 2025-05-07T20:32:25.0959648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.0960306Z def test_silu_mul_quant( 2025-05-07T20:32:25.0960661Z self, 2025-05-07T20:32:25.0960957Z T: int, 2025-05-07T20:32:25.0961254Z D: int, 2025-05-07T20:32:25.0961586Z scale_ub: Optional[float], 2025-05-07T20:32:25.0962032Z contiguous: bool, 2025-05-07T20:32:25.0962419Z compiled: bool, 2025-05-07T20:32:25.0962792Z ) -> None: 2025-05-07T20:32:25.0963106Z torch.manual_seed(2025) 2025-05-07T20:32:25.0963472Z 2025-05-07T20:32:25.0963872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.0964445Z 2025-05-07T20:32:25.0964751Z x_sign = torch.sign(x) 2025-05-07T20:32:25.0965223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.0965926Z x = x_sign * x_clamp 2025-05-07T20:32:25.0966339Z x0 = x[:, :D] 2025-05-07T20:32:25.0966694Z x1 = x[:, D:] 2025-05-07T20:32:25.0967055Z 2025-05-07T20:32:25.0967371Z if contiguous: 2025-05-07T20:32:25.0967906Z x0 = x0.contiguous() 2025-05-07T20:32:25.0968358Z x1 = x1.contiguous() 2025-05-07T20:32:25.0968777Z 2025-05-07T20:32:25.0969090Z if scale_ub is not None: 2025-05-07T20:32:25.0969567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.0970158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.0970696Z ) 2025-05-07T20:32:25.0971012Z else: 2025-05-07T20:32:25.0971371Z scale_ub_tensor = None 2025-05-07T20:32:25.0971782Z 2025-05-07T20:32:25.0972131Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.0972651Z op = silu_mul_quant 2025-05-07T20:32:25.0973081Z if compiled: 2025-05-07T20:32:25.0973469Z op = torch.compile(op) 2025-05-07T20:32:25.0973942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.0974399Z 2025-05-07T20:32:25.0974702Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.0975000Z 2025-05-07T20:32:25.0975171Z moe/activation_test.py:117: 2025-05-07T20:32:25.0975680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.0976242Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.0976863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.0978112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.0979355Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.0980286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.0981526Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.0982722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.0983680Z kernel = self.compile( 2025-05-07T20:32:25.0984581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.0985656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.0986333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.0986830Z 2025-05-07T20:32:25.0987164Z self = 2025-05-07T20:32:25.0988932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.0991343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f005df3e700>} 2025-05-07T20:32:25.0993885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.0995719Z context = 2025-05-07T20:32:25.0996202Z 2025-05-07T20:32:25.0996491Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.0997414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.0998238Z module_map=module_map) 2025-05-07T20:32:25.0998843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.0999440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.0999881Z E ^ 2025-05-07T20:32:25.1000774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1001543Z 2025-05-07T20:32:25.1002262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1003175Z 2025-05-07T20:32:25.1003345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1004048Z self=, 2025-05-07T20:32:25.1004758Z T=2048, 2025-05-07T20:32:25.1005077Z D=7168, 2025-05-07T20:32:25.1005402Z scale_ub=None, 2025-05-07T20:32:25.1006367Z contiguous=False, 2025-05-07T20:32:25.1006750Z compiled=False, 2025-05-07T20:32:25.1007102Z ) 2025-05-07T20:32:25.1007711Z self = 2025-05-07T20:32:25.1008528Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.1009045Z 2025-05-07T20:32:25.1009170Z @given( 2025-05-07T20:32:25.1009526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1009979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1010438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1010990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1011569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1012056Z ) 2025-05-07T20:32:25.1012817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1013608Z def test_silu_mul_quant( 2025-05-07T20:32:25.1014011Z self, 2025-05-07T20:32:25.1014332Z T: int, 2025-05-07T20:32:25.1014657Z D: int, 2025-05-07T20:32:25.1015008Z scale_ub: Optional[float], 2025-05-07T20:32:25.1015468Z contiguous: bool, 2025-05-07T20:32:25.1015877Z compiled: bool, 2025-05-07T20:32:25.1016238Z ) -> None: 2025-05-07T20:32:25.1016605Z torch.manual_seed(2025) 2025-05-07T20:32:25.1017016Z 2025-05-07T20:32:25.1017466Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1021313Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.1024709Z 2025-05-07T20:32:25.1024904Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.1025268Z 2025-05-07T20:32:25.1025433Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1026122Z self=, 2025-05-07T20:32:25.1026888Z T=128, 2025-05-07T20:32:25.1027194Z D=7168, 2025-05-07T20:32:25.1027516Z scale_ub=1200.0, 2025-05-07T20:32:25.1027879Z contiguous=True, 2025-05-07T20:32:25.1028259Z compiled=True, 2025-05-07T20:32:25.1028605Z ) 2025-05-07T20:32:25.1320578Z self = 2025-05-07T20:32:25.1321469Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.1321937Z 2025-05-07T20:32:25.1322077Z @given( 2025-05-07T20:32:25.1322435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1322943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1323444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1323987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1324526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1325002Z ) 2025-05-07T20:32:25.1325783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1326528Z def test_silu_mul_quant( 2025-05-07T20:32:25.1326889Z self, 2025-05-07T20:32:25.1327159Z T: int, 2025-05-07T20:32:25.1327456Z D: int, 2025-05-07T20:32:25.1327910Z scale_ub: Optional[float], 2025-05-07T20:32:25.1328356Z contiguous: bool, 2025-05-07T20:32:25.1328759Z compiled: bool, 2025-05-07T20:32:25.1329122Z ) -> None: 2025-05-07T20:32:25.1329503Z torch.manual_seed(2025) 2025-05-07T20:32:25.1329923Z 2025-05-07T20:32:25.1330333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1330860Z 2025-05-07T20:32:25.1331143Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1331558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1332033Z x = x_sign * x_clamp 2025-05-07T20:32:25.1332428Z x0 = x[:, :D] 2025-05-07T20:32:25.1332784Z x1 = x[:, D:] 2025-05-07T20:32:25.1333084Z 2025-05-07T20:32:25.1333356Z if contiguous: 2025-05-07T20:32:25.1333707Z x0 = x0.contiguous() 2025-05-07T20:32:25.1334127Z x1 = x1.contiguous() 2025-05-07T20:32:25.1334515Z 2025-05-07T20:32:25.1334828Z if scale_ub is not None: 2025-05-07T20:32:25.1335211Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1335680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1336257Z ) 2025-05-07T20:32:25.1336534Z else: 2025-05-07T20:32:25.1336831Z scale_ub_tensor = None 2025-05-07T20:32:25.1337193Z 2025-05-07T20:32:25.1337518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1337982Z op = silu_mul_quant 2025-05-07T20:32:25.1338363Z if compiled: 2025-05-07T20:32:25.1338719Z op = torch.compile(op) 2025-05-07T20:32:25.1339171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1339613Z 2025-05-07T20:32:25.1339903Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.1340147Z 2025-05-07T20:32:25.1340295Z moe/activation_test.py:117: 2025-05-07T20:32:25.1340731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1341216Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.1341613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1342480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.1343441Z return fn(*args, **kwargs) 2025-05-07T20:32:25.1344427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.1345463Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.1346271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1347409Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1348408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1349221Z kernel = self.compile( 2025-05-07T20:32:25.1350039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1351037Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1351633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1351989Z 2025-05-07T20:32:25.1352283Z self = 2025-05-07T20:32:25.1353951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1356199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f005df3ff60>} 2025-05-07T20:32:25.1358300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1359835Z context = 2025-05-07T20:32:25.1360262Z 2025-05-07T20:32:25.1360505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1361302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1362030Z module_map=module_map) 2025-05-07T20:32:25.1362552Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1363052Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1363481Z E ^ 2025-05-07T20:32:25.1364155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1364869Z 2025-05-07T20:32:25.1365524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1366338Z 2025-05-07T20:32:25.1366492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1367180Z self=, 2025-05-07T20:32:25.1367915Z T=128, 2025-05-07T20:32:25.1368208Z D=7168, 2025-05-07T20:32:25.1368514Z scale_ub=1200.0, 2025-05-07T20:32:25.1368857Z contiguous=True, 2025-05-07T20:32:25.1369214Z compiled=False, 2025-05-07T20:32:25.1369529Z ) 2025-05-07T20:32:25.1369973Z self = 2025-05-07T20:32:25.1370707Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.1371131Z 2025-05-07T20:32:25.1371264Z @given( 2025-05-07T20:32:25.1371620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1372127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1372632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1373106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1373602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1374042Z ) 2025-05-07T20:32:25.1374653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1375360Z def test_silu_mul_quant( 2025-05-07T20:32:25.1375707Z self, 2025-05-07T20:32:25.1375983Z T: int, 2025-05-07T20:32:25.1376263Z D: int, 2025-05-07T20:32:25.1376584Z scale_ub: Optional[float], 2025-05-07T20:32:25.1376994Z contiguous: bool, 2025-05-07T20:32:25.1377368Z compiled: bool, 2025-05-07T20:32:25.1377801Z ) -> None: 2025-05-07T20:32:25.1378163Z torch.manual_seed(2025) 2025-05-07T20:32:25.1378529Z 2025-05-07T20:32:25.1378932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1379429Z 2025-05-07T20:32:25.1379705Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1380140Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1383345Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.1386291Z 2025-05-07T20:32:25.1386477Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.1386802Z 2025-05-07T20:32:25.1386965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1387558Z self=, 2025-05-07T20:32:25.1388144Z T=128, 2025-05-07T20:32:25.1388412Z D=5120, 2025-05-07T20:32:25.1388685Z scale_ub=1200.0, 2025-05-07T20:32:25.1389001Z contiguous=True, 2025-05-07T20:32:25.1389321Z compiled=True, 2025-05-07T20:32:25.1389612Z ) 2025-05-07T20:32:25.1390068Z self = 2025-05-07T20:32:25.1390762Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.1391146Z 2025-05-07T20:32:25.1391268Z @given( 2025-05-07T20:32:25.1391596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1392059Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1392522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1393006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1393499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1393916Z ) 2025-05-07T20:32:25.1394415Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1395043Z def test_silu_mul_quant( 2025-05-07T20:32:25.1395383Z self, 2025-05-07T20:32:25.1395650Z T: int, 2025-05-07T20:32:25.1395997Z D: int, 2025-05-07T20:32:25.1396307Z scale_ub: Optional[float], 2025-05-07T20:32:25.1396692Z contiguous: bool, 2025-05-07T20:32:25.1397023Z compiled: bool, 2025-05-07T20:32:25.1397338Z ) -> None: 2025-05-07T20:32:25.1397644Z torch.manual_seed(2025) 2025-05-07T20:32:25.1397984Z 2025-05-07T20:32:25.1398367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1398860Z 2025-05-07T20:32:25.1399136Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1399545Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1402526Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.1405253Z 2025-05-07T20:32:25.1405437Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.1406287Z 2025-05-07T20:32:25.1406451Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1407051Z self=, 2025-05-07T20:32:25.1407870Z T=128, 2025-05-07T20:32:25.1408142Z D=7168, 2025-05-07T20:32:25.1408418Z scale_ub=None, 2025-05-07T20:32:25.1408732Z contiguous=True, 2025-05-07T20:32:25.1409058Z compiled=True, 2025-05-07T20:32:25.1409349Z ) 2025-05-07T20:32:25.3348348Z self = 2025-05-07T20:32:25.3349410Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.3349952Z 2025-05-07T20:32:25.3350115Z @given( 2025-05-07T20:32:25.3350610Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3351241Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3351858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3352444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3352822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3353108Z ) 2025-05-07T20:32:25.3353457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3354067Z def test_silu_mul_quant( 2025-05-07T20:32:25.3354309Z self, 2025-05-07T20:32:25.3354503Z T: int, 2025-05-07T20:32:25.3354704Z D: int, 2025-05-07T20:32:25.3354923Z scale_ub: Optional[float], 2025-05-07T20:32:25.3355191Z contiguous: bool, 2025-05-07T20:32:25.3355433Z compiled: bool, 2025-05-07T20:32:25.3355661Z ) -> None: 2025-05-07T20:32:25.3355875Z torch.manual_seed(2025) 2025-05-07T20:32:25.3356121Z 2025-05-07T20:32:25.3356397Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3358458Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.3360331Z 2025-05-07T20:32:25.3360460Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.3360675Z 2025-05-07T20:32:25.3367693Z FAILED 2025-05-07T20:32:25.3367833Z 2025-05-07T20:32:25.3368103Z =================================== FAILURES =================================== 2025-05-07T20:32:25.3368564Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:25.3369202Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:25.3370127Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:25.3370922Z | yield 2025-05-07T20:32:25.3371510Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:25.3372228Z | self._callTestMethod(testMethod) 2025-05-07T20:32:25.3372978Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:25.3373730Z | if method() is not None: 2025-05-07T20:32:25.3374082Z | ^^^^^^^^ 2025-05-07T20:32:25.3374940Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:25.3376026Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3376438Z | ^^^^^^^ 2025-05-07T20:32:25.3377194Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:25.3378032Z | raise the_error_hypothesis_found 2025-05-07T20:32:25.3378631Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:25.3379286Z +-+---------------- 1 ---------------- 2025-05-07T20:32:25.3379675Z | Traceback (most recent call last): 2025-05-07T20:32:25.3380635Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:25.3381685Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3382192Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3384975Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.3387770Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:25.3388362Z | self=, 2025-05-07T20:32:25.3388907Z | T=2048, 2025-05-07T20:32:25.3389222Z | D=5120, # or any other generated value 2025-05-07T20:32:25.3389681Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:25.3390183Z | contiguous=True, # or any other generated value 2025-05-07T20:32:25.3390668Z | compiled=False, # or any other generated value 2025-05-07T20:32:25.3391084Z | ) 2025-05-07T20:32:25.3391341Z | 2025-05-07T20:32:25.3392044Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:25.3392905Z +---------------- 2 ---------------- 2025-05-07T20:32:25.3393328Z | Traceback (most recent call last): 2025-05-07T20:32:25.3394285Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:25.3395333Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3395841Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3398610Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.3401320Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:25.3401918Z | self=, 2025-05-07T20:32:25.3402486Z | T=128, 2025-05-07T20:32:25.3402773Z | D=7168, 2025-05-07T20:32:25.3403079Z | scale_ub=None, 2025-05-07T20:32:25.3403393Z | contiguous=True, 2025-05-07T20:32:25.3403725Z | compiled=True, 2025-05-07T20:32:25.3422148Z | ) 2025-05-07T20:32:25.3422455Z | 2025-05-07T20:32:25.3423371Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:25.3424221Z +---------------- 3 ---------------- 2025-05-07T20:32:25.3424629Z | Traceback (most recent call last): 2025-05-07T20:32:25.3425631Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:25.3426831Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3427374Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3430173Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.3432944Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:25.3433547Z | self=, 2025-05-07T20:32:25.3434184Z | T=128, 2025-05-07T20:32:25.3434468Z | D=5120, 2025-05-07T20:32:25.3434775Z | scale_ub=1200.0, 2025-05-07T20:32:25.3435104Z | contiguous=True, 2025-05-07T20:32:25.3435445Z | compiled=True, 2025-05-07T20:32:25.3435764Z | ) 2025-05-07T20:32:25.3436011Z | 2025-05-07T20:32:25.3436742Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:25.3437585Z +---------------- 4 ---------------- 2025-05-07T20:32:25.3437986Z | Traceback (most recent call last): 2025-05-07T20:32:25.3438963Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:25.3439956Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3440359Z | ^^^^^^^^ 2025-05-07T20:32:25.3441257Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:25.3442206Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3442677Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3443787Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:25.3444976Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3445768Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:25.3446774Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3447376Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3448408Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:25.3449468Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3450124Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3451049Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:25.3452222Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3452867Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3453772Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:25.3454759Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3455329Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3456144Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:25.3456930Z | fn() 2025-05-07T20:32:25.3457722Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:25.3458590Z | self.fn.run( 2025-05-07T20:32:25.3459325Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:25.3460130Z | kernel = self.compile( 2025-05-07T20:32:25.3460493Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:25.3461306Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:25.3462364Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3462935Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3463821Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.3464907Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3465580Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3466109Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3466588Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3466955Z | ^ 2025-05-07T20:32:25.3467598Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3468365Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:25.3468893Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:25.3469605Z | self=, 2025-05-07T20:32:25.3470198Z | T=1, # or any other generated value 2025-05-07T20:32:25.3470601Z | D=5120, # or any other generated value 2025-05-07T20:32:25.3471060Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:25.3471618Z | contiguous=True, # or any other generated value 2025-05-07T20:32:25.3472121Z | compiled=True, # or any other generated value 2025-05-07T20:32:25.3472552Z | ) 2025-05-07T20:32:25.3472811Z | 2025-05-07T20:32:25.3473520Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:25.3474346Z +------------------------------------ 2025-05-07T20:32:25.3474819Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:25.3475340Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3475915Z self=, 2025-05-07T20:32:25.3476476Z T=1, 2025-05-07T20:32:25.3476735Z D=5120, 2025-05-07T20:32:25.3476991Z scale_ub=None, 2025-05-07T20:32:25.3477277Z contiguous=True, 2025-05-07T20:32:25.3477584Z compiled=True, 2025-05-07T20:32:25.3477861Z ) 2025-05-07T20:32:25.3478312Z self = 2025-05-07T20:32:25.3479037Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.3479367Z 2025-05-07T20:32:25.3479482Z @given( 2025-05-07T20:32:25.3479788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3480210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3480626Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3481105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3481549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3481932Z ) 2025-05-07T20:32:25.3482397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3482987Z def test_silu_mul_quant( 2025-05-07T20:32:25.3483315Z self, 2025-05-07T20:32:25.3483574Z T: int, 2025-05-07T20:32:25.3483849Z D: int, 2025-05-07T20:32:25.3484145Z scale_ub: Optional[float], 2025-05-07T20:32:25.3484512Z contiguous: bool, 2025-05-07T20:32:25.3484827Z compiled: bool, 2025-05-07T20:32:25.3485127Z ) -> None: 2025-05-07T20:32:25.3485415Z torch.manual_seed(2025) 2025-05-07T20:32:25.3485734Z 2025-05-07T20:32:25.3486098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3486561Z 2025-05-07T20:32:25.3486814Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3487205Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3487792Z x = x_sign * x_clamp 2025-05-07T20:32:25.3488114Z x0 = x[:, :D] 2025-05-07T20:32:25.3488418Z x1 = x[:, D:] 2025-05-07T20:32:25.3488713Z 2025-05-07T20:32:25.3488972Z if contiguous: 2025-05-07T20:32:25.3489297Z x0 = x0.contiguous() 2025-05-07T20:32:25.3489657Z x1 = x1.contiguous() 2025-05-07T20:32:25.3489983Z 2025-05-07T20:32:25.3490250Z if scale_ub is not None: 2025-05-07T20:32:25.3490637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3491086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3491496Z ) 2025-05-07T20:32:25.3491763Z else: 2025-05-07T20:32:25.3492060Z scale_ub_tensor = None 2025-05-07T20:32:25.3492416Z 2025-05-07T20:32:25.3492746Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3493192Z op = silu_mul_quant 2025-05-07T20:32:25.3493543Z if compiled: 2025-05-07T20:32:25.3493893Z op = torch.compile(op) 2025-05-07T20:32:25.3494298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3494678Z 2025-05-07T20:32:25.3494928Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3495308Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3495714Z 2025-05-07T20:32:25.3496031Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3496499Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3496962Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3497388Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3497883Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3498322Z 2025-05-07T20:32:25.3498594Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3498875Z 2025-05-07T20:32:25.3499011Z moe/activation_test.py:126: 2025-05-07T20:32:25.3499403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3499854Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3500279Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3501314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3502308Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3503079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3503969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3504868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3506084Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3507193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3508232Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3509263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3510186Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3511041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3511765Z fn() 2025-05-07T20:32:25.3512506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3513310Z self.fn.run( 2025-05-07T20:32:25.3513929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3514646Z kernel = self.compile( 2025-05-07T20:32:25.3515546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3516458Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3517001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3517337Z 2025-05-07T20:32:25.3517617Z self = 2025-05-07T20:32:25.3519125Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3521035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05c48553a0>} 2025-05-07T20:32:25.3522917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3524370Z context = 2025-05-07T20:32:25.3524785Z 2025-05-07T20:32:25.3525016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3525832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3526488Z module_map=module_map) 2025-05-07T20:32:25.3526993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3527494Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3527968Z E ^ 2025-05-07T20:32:25.3528626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3529273Z 2025-05-07T20:32:25.3529859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3530580Z 2025-05-07T20:32:25.3530728Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3531291Z self=, 2025-05-07T20:32:25.3531858Z T=2048, 2025-05-07T20:32:25.3532121Z D=5120, 2025-05-07T20:32:25.3532404Z scale_ub=1200.0, 2025-05-07T20:32:25.3532738Z contiguous=True, 2025-05-07T20:32:25.3533040Z compiled=False, 2025-05-07T20:32:25.3533391Z ) 2025-05-07T20:32:25.3533814Z self = 2025-05-07T20:32:25.3534466Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.3534826Z 2025-05-07T20:32:25.3534946Z @given( 2025-05-07T20:32:25.3535244Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3535663Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3536128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3536562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3537008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3537404Z ) 2025-05-07T20:32:25.3537876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3538494Z def test_silu_mul_quant( 2025-05-07T20:32:25.3538839Z self, 2025-05-07T20:32:25.3539109Z T: int, 2025-05-07T20:32:25.3539396Z D: int, 2025-05-07T20:32:25.3539702Z scale_ub: Optional[float], 2025-05-07T20:32:25.3540072Z contiguous: bool, 2025-05-07T20:32:25.3540397Z compiled: bool, 2025-05-07T20:32:25.3540715Z ) -> None: 2025-05-07T20:32:25.3541018Z torch.manual_seed(2025) 2025-05-07T20:32:25.3541338Z 2025-05-07T20:32:25.3541718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3542268Z 2025-05-07T20:32:25.3542536Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3542945Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3543379Z x = x_sign * x_clamp 2025-05-07T20:32:25.3543714Z x0 = x[:, :D] 2025-05-07T20:32:25.3544023Z x1 = x[:, D:] 2025-05-07T20:32:25.3544322Z 2025-05-07T20:32:25.3544582Z if contiguous: 2025-05-07T20:32:25.3544911Z x0 = x0.contiguous() 2025-05-07T20:32:25.3545279Z x1 = x1.contiguous() 2025-05-07T20:32:25.3545615Z 2025-05-07T20:32:25.3545891Z if scale_ub is not None: 2025-05-07T20:32:25.3546277Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3546729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3547173Z ) 2025-05-07T20:32:25.3547449Z else: 2025-05-07T20:32:25.3547743Z scale_ub_tensor = None 2025-05-07T20:32:25.3548087Z 2025-05-07T20:32:25.3548399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3548823Z op = silu_mul_quant 2025-05-07T20:32:25.3549983Z if compiled: 2025-05-07T20:32:25.3550332Z op = torch.compile(op) 2025-05-07T20:32:25.3550734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3551109Z 2025-05-07T20:32:25.3551383Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3551595Z 2025-05-07T20:32:25.3551737Z moe/activation_test.py:117: 2025-05-07T20:32:25.3552202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3552706Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3553090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3554000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3554910Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3555615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3556566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3557472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3558194Z kernel = self.compile( 2025-05-07T20:32:25.3558929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3559885Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3560415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3560731Z 2025-05-07T20:32:25.3561005Z self = 2025-05-07T20:32:25.3562471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3564362Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05c45102c0>} 2025-05-07T20:32:25.3566178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3567688Z context = 2025-05-07T20:32:25.3568092Z 2025-05-07T20:32:25.3568315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3568907Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3569373Z module_map=module_map) 2025-05-07T20:32:25.3569735Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3570158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3570424Z E ^ 2025-05-07T20:32:25.3570886Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3571340Z 2025-05-07T20:32:25.3571755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3572286Z 2025-05-07T20:32:25.3572406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3572844Z self=, 2025-05-07T20:32:25.3573244Z T=2048, 2025-05-07T20:32:25.3573441Z D=5120, 2025-05-07T20:32:25.3573642Z scale_ub=1200.0, 2025-05-07T20:32:25.3573865Z contiguous=True, 2025-05-07T20:32:25.3574091Z compiled=True, 2025-05-07T20:32:25.3574304Z ) 2025-05-07T20:32:25.3574622Z self = 2025-05-07T20:32:25.3575122Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.3575391Z 2025-05-07T20:32:25.3575480Z @given( 2025-05-07T20:32:25.3575712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3576029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3576347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3576682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3577063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3577358Z ) 2025-05-07T20:32:25.3577602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3577709Z def test_silu_mul_quant( 2025-05-07T20:32:25.3577791Z self, 2025-05-07T20:32:25.3577871Z T: int, 2025-05-07T20:32:25.3577957Z D: int, 2025-05-07T20:32:25.3578057Z scale_ub: Optional[float], 2025-05-07T20:32:25.3578154Z contiguous: bool, 2025-05-07T20:32:25.3578253Z compiled: bool, 2025-05-07T20:32:25.3578336Z ) -> None: 2025-05-07T20:32:25.3578440Z torch.manual_seed(2025) 2025-05-07T20:32:25.3578516Z 2025-05-07T20:32:25.3578688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3578772Z 2025-05-07T20:32:25.3578867Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3578994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3579095Z x = x_sign * x_clamp 2025-05-07T20:32:25.3579221Z x0 = x[:, :D] 2025-05-07T20:32:25.3579318Z x1 = x[:, D:] 2025-05-07T20:32:25.3579395Z 2025-05-07T20:32:25.3579482Z if contiguous: 2025-05-07T20:32:25.3579582Z x0 = x0.contiguous() 2025-05-07T20:32:25.3579676Z x1 = x1.contiguous() 2025-05-07T20:32:25.3579752Z 2025-05-07T20:32:25.3579855Z if scale_ub is not None: 2025-05-07T20:32:25.3579966Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3580144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3580231Z ) 2025-05-07T20:32:25.3580312Z else: 2025-05-07T20:32:25.3580416Z scale_ub_tensor = None 2025-05-07T20:32:25.3580494Z 2025-05-07T20:32:25.3580626Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3580725Z op = silu_mul_quant 2025-05-07T20:32:25.3580813Z if compiled: 2025-05-07T20:32:25.3580919Z op = torch.compile(op) 2025-05-07T20:32:25.3581037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3581115Z 2025-05-07T20:32:25.3581208Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3581336Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3581411Z 2025-05-07T20:32:25.3581547Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3581657Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3581805Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3581934Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3582079Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3582156Z 2025-05-07T20:32:25.3582267Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3582271Z 2025-05-07T20:32:25.3582373Z moe/activation_test.py:126: 2025-05-07T20:32:25.3582504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3582619Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3582755Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3583316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3583421Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3583780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3584021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3584386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3584645Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3585094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3585349Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3585728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3585894Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3586235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3586329Z fn() 2025-05-07T20:32:25.3586731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3586827Z self.fn.run( 2025-05-07T20:32:25.3587164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3587264Z kernel = self.compile( 2025-05-07T20:32:25.3587689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3587866Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3588000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3588005Z 2025-05-07T20:32:25.3588222Z self = 2025-05-07T20:32:25.3588998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3589570Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05c4510900>} 2025-05-07T20:32:25.3590321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3590522Z context = 2025-05-07T20:32:25.3590527Z 2025-05-07T20:32:25.3590694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3590959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3591123Z module_map=module_map) 2025-05-07T20:32:25.3591287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3591394Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3591485Z E ^ 2025-05-07T20:32:25.3591839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3591844Z 2025-05-07T20:32:25.3592267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3592272Z 2025-05-07T20:32:25.3592378Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3592599Z self=, 2025-05-07T20:32:25.3592690Z T=16384, 2025-05-07T20:32:25.3592771Z D=7168, 2025-05-07T20:32:25.3592859Z scale_ub=1200.0, 2025-05-07T20:32:25.3592953Z contiguous=False, 2025-05-07T20:32:25.3593044Z compiled=False, 2025-05-07T20:32:25.3593133Z ) 2025-05-07T20:32:25.3593352Z self = 2025-05-07T20:32:25.3593534Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.3593539Z 2025-05-07T20:32:25.3593625Z @given( 2025-05-07T20:32:25.3593746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3593848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3594015Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3594136Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3594252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3594336Z ) 2025-05-07T20:32:25.3594579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3594682Z def test_silu_mul_quant( 2025-05-07T20:32:25.3594759Z self, 2025-05-07T20:32:25.3594836Z T: int, 2025-05-07T20:32:25.3594925Z D: int, 2025-05-07T20:32:25.3595029Z scale_ub: Optional[float], 2025-05-07T20:32:25.3595119Z contiguous: bool, 2025-05-07T20:32:25.3595210Z compiled: bool, 2025-05-07T20:32:25.3595294Z ) -> None: 2025-05-07T20:32:25.3595390Z torch.manual_seed(2025) 2025-05-07T20:32:25.3595474Z 2025-05-07T20:32:25.3595643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3595720Z 2025-05-07T20:32:25.3595821Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3595990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3596088Z x = x_sign * x_clamp 2025-05-07T20:32:25.3596172Z x0 = x[:, :D] 2025-05-07T20:32:25.3596257Z x1 = x[:, D:] 2025-05-07T20:32:25.3596343Z 2025-05-07T20:32:25.3596432Z if contiguous: 2025-05-07T20:32:25.3596525Z x0 = x0.contiguous() 2025-05-07T20:32:25.3596622Z x1 = x1.contiguous() 2025-05-07T20:32:25.3596740Z 2025-05-07T20:32:25.3596840Z if scale_ub is not None: 2025-05-07T20:32:25.3596957Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3597094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3597171Z ) 2025-05-07T20:32:25.3597259Z else: 2025-05-07T20:32:25.3597356Z scale_ub_tensor = None 2025-05-07T20:32:25.3597433Z 2025-05-07T20:32:25.3597569Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3597664Z op = silu_mul_quant 2025-05-07T20:32:25.3597758Z if compiled: 2025-05-07T20:32:25.3597862Z op = torch.compile(op) 2025-05-07T20:32:25.3597968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3598050Z 2025-05-07T20:32:25.3598143Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3598148Z 2025-05-07T20:32:25.3598246Z moe/activation_test.py:117: 2025-05-07T20:32:25.3598386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3598537Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3598638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3599143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3599244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3599608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3599836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3600178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3600281Z kernel = self.compile( 2025-05-07T20:32:25.3600662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3600844Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3600979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3600984Z 2025-05-07T20:32:25.3601190Z self = 2025-05-07T20:32:25.3602008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3602511Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bf51bc40>} 2025-05-07T20:32:25.3603260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3603459Z context = 2025-05-07T20:32:25.3603464Z 2025-05-07T20:32:25.3603631Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3603901Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3604012Z module_map=module_map) 2025-05-07T20:32:25.3604183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3604288Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3604409Z E ^ 2025-05-07T20:32:25.3604772Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3604777Z 2025-05-07T20:32:25.3605188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3605193Z 2025-05-07T20:32:25.3605311Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3605574Z self=, 2025-05-07T20:32:25.3605908Z T=1, 2025-05-07T20:32:25.3606040Z D=7168, 2025-05-07T20:32:25.3606138Z scale_ub=None, 2025-05-07T20:32:25.3606228Z contiguous=True, 2025-05-07T20:32:25.3606332Z compiled=True, 2025-05-07T20:32:25.3606408Z ) 2025-05-07T20:32:25.3606627Z self = 2025-05-07T20:32:25.3606809Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.3606814Z 2025-05-07T20:32:25.3606897Z @given( 2025-05-07T20:32:25.3607027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3607129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3607246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3607372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3607486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3607738Z ) 2025-05-07T20:32:25.3607998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3608096Z def test_silu_mul_quant( 2025-05-07T20:32:25.3608179Z self, 2025-05-07T20:32:25.3608266Z T: int, 2025-05-07T20:32:25.3608346Z D: int, 2025-05-07T20:32:25.3608447Z scale_ub: Optional[float], 2025-05-07T20:32:25.3608546Z contiguous: bool, 2025-05-07T20:32:25.3608637Z compiled: bool, 2025-05-07T20:32:25.3608725Z ) -> None: 2025-05-07T20:32:25.3608825Z torch.manual_seed(2025) 2025-05-07T20:32:25.3608902Z 2025-05-07T20:32:25.3609078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3609155Z 2025-05-07T20:32:25.3618627Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3618794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3618890Z x = x_sign * x_clamp 2025-05-07T20:32:25.3618989Z x0 = x[:, :D] 2025-05-07T20:32:25.3619067Z x1 = x[:, D:] 2025-05-07T20:32:25.3619142Z 2025-05-07T20:32:25.3619229Z if contiguous: 2025-05-07T20:32:25.3619320Z x0 = x0.contiguous() 2025-05-07T20:32:25.3619414Z x1 = x1.contiguous() 2025-05-07T20:32:25.3619492Z 2025-05-07T20:32:25.3619586Z if scale_ub is not None: 2025-05-07T20:32:25.3619697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3619951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3620035Z ) 2025-05-07T20:32:25.3620115Z else: 2025-05-07T20:32:25.3620225Z scale_ub_tensor = None 2025-05-07T20:32:25.3620303Z 2025-05-07T20:32:25.3620446Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3620541Z op = silu_mul_quant 2025-05-07T20:32:25.3620629Z if compiled: 2025-05-07T20:32:25.3620740Z op = torch.compile(op) 2025-05-07T20:32:25.3620855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3620932Z 2025-05-07T20:32:25.3621034Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3621158Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3621236Z 2025-05-07T20:32:25.3621379Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3621487Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3621594Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3621789Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3621934Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3622017Z 2025-05-07T20:32:25.3622118Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3622123Z 2025-05-07T20:32:25.3622225Z moe/activation_test.py:126: 2025-05-07T20:32:25.3622365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3622559Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3622722Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3623536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3623644Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3624014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3624244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3624615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3624883Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3625284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3625593Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3625972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3626140Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3626490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3626575Z fn() 2025-05-07T20:32:25.3626981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3627074Z self.fn.run( 2025-05-07T20:32:25.3627415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3627520Z kernel = self.compile( 2025-05-07T20:32:25.3627903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3628087Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3628227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3628232Z 2025-05-07T20:32:25.3628440Z self = 2025-05-07T20:32:25.3629264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3629778Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bf2e79c0>} 2025-05-07T20:32:25.3630529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3630735Z context = 2025-05-07T20:32:25.3630740Z 2025-05-07T20:32:25.3630907Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3631177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3631288Z module_map=module_map) 2025-05-07T20:32:25.3631526Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3631640Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3631721Z E ^ 2025-05-07T20:32:25.3632078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3632083Z 2025-05-07T20:32:25.3632508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3632553Z 2025-05-07T20:32:25.3632663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3632893Z self=, 2025-05-07T20:32:25.3632974Z T=4096, 2025-05-07T20:32:25.3633052Z D=5120, 2025-05-07T20:32:25.3633145Z scale_ub=None, 2025-05-07T20:32:25.3633234Z contiguous=False, 2025-05-07T20:32:25.3633323Z compiled=False, 2025-05-07T20:32:25.3633409Z ) 2025-05-07T20:32:25.3633635Z self = 2025-05-07T20:32:25.3633819Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.3633824Z 2025-05-07T20:32:25.3633905Z @given( 2025-05-07T20:32:25.3634027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3634137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3634256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3634421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3634547Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3634626Z ) 2025-05-07T20:32:25.3634873Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3634979Z def test_silu_mul_quant( 2025-05-07T20:32:25.3635061Z self, 2025-05-07T20:32:25.3635154Z T: int, 2025-05-07T20:32:25.3635235Z D: int, 2025-05-07T20:32:25.3635339Z scale_ub: Optional[float], 2025-05-07T20:32:25.3635441Z contiguous: bool, 2025-05-07T20:32:25.3635529Z compiled: bool, 2025-05-07T20:32:25.3635611Z ) -> None: 2025-05-07T20:32:25.3635717Z torch.manual_seed(2025) 2025-05-07T20:32:25.3635795Z 2025-05-07T20:32:25.3635967Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3636054Z 2025-05-07T20:32:25.3636148Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3636280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3636383Z x = x_sign * x_clamp 2025-05-07T20:32:25.3636466Z x0 = x[:, :D] 2025-05-07T20:32:25.3636549Z x1 = x[:, D:] 2025-05-07T20:32:25.3636636Z 2025-05-07T20:32:25.3636723Z if contiguous: 2025-05-07T20:32:25.3636825Z x0 = x0.contiguous() 2025-05-07T20:32:25.3636919Z x1 = x1.contiguous() 2025-05-07T20:32:25.3636995Z 2025-05-07T20:32:25.3637094Z if scale_ub is not None: 2025-05-07T20:32:25.3637247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3637388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3637475Z ) 2025-05-07T20:32:25.3637555Z else: 2025-05-07T20:32:25.3637652Z scale_ub_tensor = None 2025-05-07T20:32:25.3637736Z 2025-05-07T20:32:25.3637867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3637958Z op = silu_mul_quant 2025-05-07T20:32:25.3638061Z if compiled: 2025-05-07T20:32:25.3638163Z op = torch.compile(op) 2025-05-07T20:32:25.3638277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3638354Z 2025-05-07T20:32:25.3638447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3638452Z 2025-05-07T20:32:25.3638560Z moe/activation_test.py:117: 2025-05-07T20:32:25.3638691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3638795Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3638946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3639447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3639554Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3639914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3640175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3640525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3640622Z kernel = self.compile( 2025-05-07T20:32:25.3641008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3641192Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3641326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3641331Z 2025-05-07T20:32:25.3641544Z self = 2025-05-07T20:32:25.3642321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3642867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bf2c40e0>} 2025-05-07T20:32:25.3643622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3643821Z context = 2025-05-07T20:32:25.3643826Z 2025-05-07T20:32:25.3644007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3644271Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3644381Z module_map=module_map) 2025-05-07T20:32:25.3644557Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3644658Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3644754Z E ^ 2025-05-07T20:32:25.3645112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3645117Z 2025-05-07T20:32:25.3645534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3645539Z 2025-05-07T20:32:25.3645657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3645924Z self=, 2025-05-07T20:32:25.3646019Z T=4096, 2025-05-07T20:32:25.3646101Z D=7168, 2025-05-07T20:32:25.3646187Z scale_ub=None, 2025-05-07T20:32:25.3646287Z contiguous=False, 2025-05-07T20:32:25.3646376Z compiled=False, 2025-05-07T20:32:25.3646453Z ) 2025-05-07T20:32:25.3646680Z self = 2025-05-07T20:32:25.3646855Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.3646865Z 2025-05-07T20:32:25.3646947Z @given( 2025-05-07T20:32:25.3647078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3647181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3647305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3647426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3647625Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3647714Z ) 2025-05-07T20:32:25.3648008Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3648108Z def test_silu_mul_quant( 2025-05-07T20:32:25.3648199Z self, 2025-05-07T20:32:25.3648281Z T: int, 2025-05-07T20:32:25.3648361Z D: int, 2025-05-07T20:32:25.3648471Z scale_ub: Optional[float], 2025-05-07T20:32:25.3648566Z contiguous: bool, 2025-05-07T20:32:25.3648654Z compiled: bool, 2025-05-07T20:32:25.3648787Z ) -> None: 2025-05-07T20:32:25.3648884Z torch.manual_seed(2025) 2025-05-07T20:32:25.3648971Z 2025-05-07T20:32:25.3649143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3649221Z 2025-05-07T20:32:25.3649322Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3649449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3649543Z x = x_sign * x_clamp 2025-05-07T20:32:25.3649633Z x0 = x[:, :D] 2025-05-07T20:32:25.3649721Z x1 = x[:, D:] 2025-05-07T20:32:25.3649806Z 2025-05-07T20:32:25.3649904Z if contiguous: 2025-05-07T20:32:25.3649998Z x0 = x0.contiguous() 2025-05-07T20:32:25.3650088Z x1 = x1.contiguous() 2025-05-07T20:32:25.3650172Z 2025-05-07T20:32:25.3650266Z if scale_ub is not None: 2025-05-07T20:32:25.3650377Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3650520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3650645Z ) 2025-05-07T20:32:25.3650730Z else: 2025-05-07T20:32:25.3650826Z scale_ub_tensor = None 2025-05-07T20:32:25.3650900Z 2025-05-07T20:32:25.3651039Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3651133Z op = silu_mul_quant 2025-05-07T20:32:25.3651225Z if compiled: 2025-05-07T20:32:25.3651336Z op = torch.compile(op) 2025-05-07T20:32:25.3651446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3651525Z 2025-05-07T20:32:25.3651629Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3651633Z 2025-05-07T20:32:25.3651732Z moe/activation_test.py:117: 2025-05-07T20:32:25.3651872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3651975Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3652081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3652592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3652697Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3653057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3653289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3653672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3653779Z kernel = self.compile( 2025-05-07T20:32:25.3654160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3654339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3654474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3654479Z 2025-05-07T20:32:25.3654689Z self = 2025-05-07T20:32:25.3655474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3655978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bf2c4ea0>} 2025-05-07T20:32:25.3656761Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3656961Z context = 2025-05-07T20:32:25.3656965Z 2025-05-07T20:32:25.3657133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3657442Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3657553Z module_map=module_map) 2025-05-07T20:32:25.3657717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3657827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3657909Z E ^ 2025-05-07T20:32:25.3658268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3658281Z 2025-05-07T20:32:25.3658700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3658705Z 2025-05-07T20:32:25.3658813Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3659043Z self=, 2025-05-07T20:32:25.3659126Z T=128, 2025-05-07T20:32:25.3659205Z D=7168, 2025-05-07T20:32:25.3659345Z scale_ub=None, 2025-05-07T20:32:25.3659437Z contiguous=False, 2025-05-07T20:32:25.3659522Z compiled=True, 2025-05-07T20:32:25.3659608Z ) 2025-05-07T20:32:25.3659827Z self = 2025-05-07T20:32:25.3660010Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.3660015Z 2025-05-07T20:32:25.3660096Z @given( 2025-05-07T20:32:25.3660220Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3660329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3660449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3660576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3660690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3660768Z ) 2025-05-07T20:32:25.3661020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3661120Z def test_silu_mul_quant( 2025-05-07T20:32:25.3661206Z self, 2025-05-07T20:32:25.3661287Z T: int, 2025-05-07T20:32:25.3661365Z D: int, 2025-05-07T20:32:25.3661473Z scale_ub: Optional[float], 2025-05-07T20:32:25.3661565Z contiguous: bool, 2025-05-07T20:32:25.3661652Z compiled: bool, 2025-05-07T20:32:25.3661739Z ) -> None: 2025-05-07T20:32:25.3661839Z torch.manual_seed(2025) 2025-05-07T20:32:25.3661915Z 2025-05-07T20:32:25.3662165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3662248Z 2025-05-07T20:32:25.3662361Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3662513Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3662616Z x = x_sign * x_clamp 2025-05-07T20:32:25.3662698Z x0 = x[:, :D] 2025-05-07T20:32:25.3662789Z x1 = x[:, D:] 2025-05-07T20:32:25.3662865Z 2025-05-07T20:32:25.3662955Z if contiguous: 2025-05-07T20:32:25.3663051Z x0 = x0.contiguous() 2025-05-07T20:32:25.3663146Z x1 = x1.contiguous() 2025-05-07T20:32:25.3663232Z 2025-05-07T20:32:25.3663324Z if scale_ub is not None: 2025-05-07T20:32:25.3663430Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3663570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3663649Z ) 2025-05-07T20:32:25.3663726Z else: 2025-05-07T20:32:25.3663827Z scale_ub_tensor = None 2025-05-07T20:32:25.3663905Z 2025-05-07T20:32:25.3664082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3664181Z op = silu_mul_quant 2025-05-07T20:32:25.3664268Z if compiled: 2025-05-07T20:32:25.3664375Z op = torch.compile(op) 2025-05-07T20:32:25.3664482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3664558Z 2025-05-07T20:32:25.3664658Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3664781Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3664898Z 2025-05-07T20:32:25.3665041Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3665143Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3665249Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3665380Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3665522Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3665599Z 2025-05-07T20:32:25.3665710Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3665717Z 2025-05-07T20:32:25.3665816Z moe/activation_test.py:126: 2025-05-07T20:32:25.3665952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3666060Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3666194Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3666759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3666905Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3667270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3667494Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3667863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3668127Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3668524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3668776Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3669156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3669330Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3669677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3669759Z fn() 2025-05-07T20:32:25.3670156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3670247Z self.fn.run( 2025-05-07T20:32:25.3670627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3670731Z kernel = self.compile( 2025-05-07T20:32:25.3671110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3671285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3671421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3671431Z 2025-05-07T20:32:25.3671638Z self = 2025-05-07T20:32:25.3672412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3672960Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05becbf060>} 2025-05-07T20:32:25.3673705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3673904Z context = 2025-05-07T20:32:25.3673949Z 2025-05-07T20:32:25.3674115Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3674386Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3674496Z module_map=module_map) 2025-05-07T20:32:25.3674659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3674770Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3674851Z E ^ 2025-05-07T20:32:25.3675208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3675212Z 2025-05-07T20:32:25.3675630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3675635Z 2025-05-07T20:32:25.3675742Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3675975Z self=, 2025-05-07T20:32:25.3676102Z T=128, 2025-05-07T20:32:25.3676183Z D=7168, 2025-05-07T20:32:25.3676280Z scale_ub=None, 2025-05-07T20:32:25.3676371Z contiguous=False, 2025-05-07T20:32:25.3676459Z compiled=False, 2025-05-07T20:32:25.3676543Z ) 2025-05-07T20:32:25.3676760Z self = 2025-05-07T20:32:25.3676936Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.3676941Z 2025-05-07T20:32:25.3677022Z @given( 2025-05-07T20:32:25.3677146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3677254Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3677371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3677491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3677612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3677688Z ) 2025-05-07T20:32:25.3677937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3678043Z def test_silu_mul_quant( 2025-05-07T20:32:25.3678124Z self, 2025-05-07T20:32:25.3678208Z T: int, 2025-05-07T20:32:25.3678288Z D: int, 2025-05-07T20:32:25.3678389Z scale_ub: Optional[float], 2025-05-07T20:32:25.3678486Z contiguous: bool, 2025-05-07T20:32:25.3678576Z compiled: bool, 2025-05-07T20:32:25.3678656Z ) -> None: 2025-05-07T20:32:25.3678800Z torch.manual_seed(2025) 2025-05-07T20:32:25.3678879Z 2025-05-07T20:32:25.3679050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3679131Z 2025-05-07T20:32:25.3679224Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3679348Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3679444Z x = x_sign * x_clamp 2025-05-07T20:32:25.3679527Z x0 = x[:, :D] 2025-05-07T20:32:25.3679613Z x1 = x[:, D:] 2025-05-07T20:32:25.3679697Z 2025-05-07T20:32:25.3679781Z if contiguous: 2025-05-07T20:32:25.3679879Z x0 = x0.contiguous() 2025-05-07T20:32:25.3679971Z x1 = x1.contiguous() 2025-05-07T20:32:25.3680045Z 2025-05-07T20:32:25.3680143Z if scale_ub is not None: 2025-05-07T20:32:25.3680250Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3680386Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3680470Z ) 2025-05-07T20:32:25.3680551Z else: 2025-05-07T20:32:25.3680688Z scale_ub_tensor = None 2025-05-07T20:32:25.3680772Z 2025-05-07T20:32:25.3680902Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3680993Z op = silu_mul_quant 2025-05-07T20:32:25.3681085Z if compiled: 2025-05-07T20:32:25.3681186Z op = torch.compile(op) 2025-05-07T20:32:25.3681299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3681418Z 2025-05-07T20:32:25.3681513Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3681517Z 2025-05-07T20:32:25.3681623Z moe/activation_test.py:117: 2025-05-07T20:32:25.3681754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3681858Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3681967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3682467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3682574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3682932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3683155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3683501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3683641Z kernel = self.compile( 2025-05-07T20:32:25.3684019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3684200Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3684330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3684335Z 2025-05-07T20:32:25.3684545Z self = 2025-05-07T20:32:25.3685320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3685821Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be790cc0>} 2025-05-07T20:32:25.3686576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3686768Z context = 2025-05-07T20:32:25.3686773Z 2025-05-07T20:32:25.3686944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3687251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3687362Z module_map=module_map) 2025-05-07T20:32:25.3687592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3687695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3687780Z E ^ 2025-05-07T20:32:25.3688131Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3688141Z 2025-05-07T20:32:25.3688552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3688557Z 2025-05-07T20:32:25.3688669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3688892Z self=, 2025-05-07T20:32:25.3688982Z T=4096, 2025-05-07T20:32:25.3689061Z D=5120, 2025-05-07T20:32:25.3689146Z scale_ub=1200.0, 2025-05-07T20:32:25.3689241Z contiguous=True, 2025-05-07T20:32:25.3689371Z compiled=False, 2025-05-07T20:32:25.3689451Z ) 2025-05-07T20:32:25.3689677Z self = 2025-05-07T20:32:25.3689853Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.3689858Z 2025-05-07T20:32:25.3689940Z @given( 2025-05-07T20:32:25.3690067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3690209Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3690337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3690455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3690571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3690654Z ) 2025-05-07T20:32:25.3690897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3690992Z def test_silu_mul_quant( 2025-05-07T20:32:25.3691080Z self, 2025-05-07T20:32:25.3691163Z T: int, 2025-05-07T20:32:25.3691248Z D: int, 2025-05-07T20:32:25.3691356Z scale_ub: Optional[float], 2025-05-07T20:32:25.3691448Z contiguous: bool, 2025-05-07T20:32:25.3691537Z compiled: bool, 2025-05-07T20:32:25.3691622Z ) -> None: 2025-05-07T20:32:25.3691718Z torch.manual_seed(2025) 2025-05-07T20:32:25.3691803Z 2025-05-07T20:32:25.3691975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3692123Z 2025-05-07T20:32:25.3692219Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3692343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3692431Z x = x_sign * x_clamp 2025-05-07T20:32:25.3692514Z x0 = x[:, :D] 2025-05-07T20:32:25.3692598Z x1 = x[:, D:] 2025-05-07T20:32:25.3692673Z 2025-05-07T20:32:25.3692765Z if contiguous: 2025-05-07T20:32:25.3692860Z x0 = x0.contiguous() 2025-05-07T20:32:25.3692956Z x1 = x1.contiguous() 2025-05-07T20:32:25.3693041Z 2025-05-07T20:32:25.3693135Z if scale_ub is not None: 2025-05-07T20:32:25.3693241Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3693384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3693464Z ) 2025-05-07T20:32:25.3693548Z else: 2025-05-07T20:32:25.3693643Z scale_ub_tensor = None 2025-05-07T20:32:25.3693722Z 2025-05-07T20:32:25.3693862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3693955Z op = silu_mul_quant 2025-05-07T20:32:25.3694043Z if compiled: 2025-05-07T20:32:25.3694151Z op = torch.compile(op) 2025-05-07T20:32:25.3694258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3694334Z 2025-05-07T20:32:25.3694432Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3694437Z 2025-05-07T20:32:25.3694534Z moe/activation_test.py:117: 2025-05-07T20:32:25.3694718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3694823Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3694924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3695424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3695524Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3695880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3696112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3696451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3696556Z kernel = self.compile( 2025-05-07T20:32:25.3696935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3697153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3697291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3697296Z 2025-05-07T20:32:25.3697499Z self = 2025-05-07T20:32:25.3698277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3698819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be791f80>} 2025-05-07T20:32:25.3699565Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3699762Z context = 2025-05-07T20:32:25.3699767Z 2025-05-07T20:32:25.3699931Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3700200Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3700308Z module_map=module_map) 2025-05-07T20:32:25.3700511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3700616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3700698Z E ^ 2025-05-07T20:32:25.3701051Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3701061Z 2025-05-07T20:32:25.3701473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3701478Z 2025-05-07T20:32:25.3701589Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3701815Z self=, 2025-05-07T20:32:25.3701895Z T=1, 2025-05-07T20:32:25.3701977Z D=5120, 2025-05-07T20:32:25.3702067Z scale_ub=None, 2025-05-07T20:32:25.3702155Z contiguous=True, 2025-05-07T20:32:25.3702239Z compiled=True, 2025-05-07T20:32:25.3702322Z ) 2025-05-07T20:32:25.3702540Z self = 2025-05-07T20:32:25.3702713Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.3702717Z 2025-05-07T20:32:25.3702795Z @given( 2025-05-07T20:32:25.3702915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3703020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3703137Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3703253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3703421Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3703500Z ) 2025-05-07T20:32:25.3703747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3703849Z def test_silu_mul_quant( 2025-05-07T20:32:25.3703929Z self, 2025-05-07T20:32:25.3704014Z T: int, 2025-05-07T20:32:25.3704093Z D: int, 2025-05-07T20:32:25.3704192Z scale_ub: Optional[float], 2025-05-07T20:32:25.3704299Z contiguous: bool, 2025-05-07T20:32:25.3704386Z compiled: bool, 2025-05-07T20:32:25.3704467Z ) -> None: 2025-05-07T20:32:25.3704570Z torch.manual_seed(2025) 2025-05-07T20:32:25.3704645Z 2025-05-07T20:32:25.3704814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3704899Z 2025-05-07T20:32:25.3704994Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3705122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3705224Z x = x_sign * x_clamp 2025-05-07T20:32:25.3705351Z x0 = x[:, :D] 2025-05-07T20:32:25.3705445Z x1 = x[:, D:] 2025-05-07T20:32:25.3705521Z 2025-05-07T20:32:25.3705808Z if contiguous: 2025-05-07T20:32:25.3705957Z x0 = x0.contiguous() 2025-05-07T20:32:25.3706087Z x1 = x1.contiguous() 2025-05-07T20:32:25.3706182Z 2025-05-07T20:32:25.3706283Z if scale_ub is not None: 2025-05-07T20:32:25.3706503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3706645Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3706729Z ) 2025-05-07T20:32:25.3706808Z else: 2025-05-07T20:32:25.3706904Z scale_ub_tensor = None 2025-05-07T20:32:25.3706986Z 2025-05-07T20:32:25.3707117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3707216Z op = silu_mul_quant 2025-05-07T20:32:25.3707304Z if compiled: 2025-05-07T20:32:25.3707411Z op = torch.compile(op) 2025-05-07T20:32:25.3707526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3707602Z 2025-05-07T20:32:25.3707697Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3707825Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3707902Z 2025-05-07T20:32:25.3708040Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3708149Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3708326Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3708449Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3708597Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3708673Z 2025-05-07T20:32:25.3708779Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3708784Z 2025-05-07T20:32:25.3708885Z moe/activation_test.py:126: 2025-05-07T20:32:25.3709021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3709139Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3709272Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3709835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3709941Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3710300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3710536Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3710899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3711161Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3711622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3711877Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3712257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3712426Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3712772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3712857Z fn() 2025-05-07T20:32:25.3713255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3713343Z self.fn.run( 2025-05-07T20:32:25.3713682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3713778Z kernel = self.compile( 2025-05-07T20:32:25.3714225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3714402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3714538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3714543Z 2025-05-07T20:32:25.3714747Z self = 2025-05-07T20:32:25.3715558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3716064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be792fc0>} 2025-05-07T20:32:25.3716812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3717010Z context = 2025-05-07T20:32:25.3717015Z 2025-05-07T20:32:25.3717183Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3717449Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3717599Z module_map=module_map) 2025-05-07T20:32:25.3717761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3717869Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3717950Z E ^ 2025-05-07T20:32:25.3718302Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3718307Z 2025-05-07T20:32:25.3718728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3718733Z 2025-05-07T20:32:25.3718839Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3719067Z self=, 2025-05-07T20:32:25.3719149Z T=2048, 2025-05-07T20:32:25.3719230Z D=5120, 2025-05-07T20:32:25.3719321Z scale_ub=None, 2025-05-07T20:32:25.3719407Z contiguous=True, 2025-05-07T20:32:25.3719497Z compiled=True, 2025-05-07T20:32:25.3719581Z ) 2025-05-07T20:32:25.3719798Z self = 2025-05-07T20:32:25.3719968Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.3719979Z 2025-05-07T20:32:25.3720062Z @given( 2025-05-07T20:32:25.3720184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3720294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3720453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3720577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3720697Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3720775Z ) 2025-05-07T20:32:25.3721020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3721123Z def test_silu_mul_quant( 2025-05-07T20:32:25.3721203Z self, 2025-05-07T20:32:25.3721283Z T: int, 2025-05-07T20:32:25.3721374Z D: int, 2025-05-07T20:32:25.3721474Z scale_ub: Optional[float], 2025-05-07T20:32:25.3721572Z contiguous: bool, 2025-05-07T20:32:25.3721659Z compiled: bool, 2025-05-07T20:32:25.3721738Z ) -> None: 2025-05-07T20:32:25.3721840Z torch.manual_seed(2025) 2025-05-07T20:32:25.3721921Z 2025-05-07T20:32:25.3722091Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3722176Z 2025-05-07T20:32:25.3722272Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3722445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3722543Z x = x_sign * x_clamp 2025-05-07T20:32:25.3722628Z x0 = x[:, :D] 2025-05-07T20:32:25.3722712Z x1 = x[:, D:] 2025-05-07T20:32:25.3722795Z 2025-05-07T20:32:25.3722882Z if contiguous: 2025-05-07T20:32:25.3722982Z x0 = x0.contiguous() 2025-05-07T20:32:25.3723075Z x1 = x1.contiguous() 2025-05-07T20:32:25.3723218Z 2025-05-07T20:32:25.3723316Z if scale_ub is not None: 2025-05-07T20:32:25.3723428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3723565Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3723650Z ) 2025-05-07T20:32:25.3723730Z else: 2025-05-07T20:32:25.3723828Z scale_ub_tensor = None 2025-05-07T20:32:25.3723915Z 2025-05-07T20:32:25.3724044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3724137Z op = silu_mul_quant 2025-05-07T20:32:25.3724237Z if compiled: 2025-05-07T20:32:25.3724339Z op = torch.compile(op) 2025-05-07T20:32:25.3724457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3724534Z 2025-05-07T20:32:25.3724626Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3724753Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3724830Z 2025-05-07T20:32:25.3724966Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3725118Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3725224Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3725347Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3725494Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3725572Z 2025-05-07T20:32:25.3725673Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3725683Z 2025-05-07T20:32:25.3725786Z moe/activation_test.py:126: 2025-05-07T20:32:25.3725919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3726037Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3726172Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3726728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3726843Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3727203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3727432Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3727880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3728182Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3728587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3728839Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3729217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3729392Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3729737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3729824Z fn() 2025-05-07T20:32:25.3730222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3730308Z self.fn.run( 2025-05-07T20:32:25.3730659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3730794Z kernel = self.compile( 2025-05-07T20:32:25.3731175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3731354Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3731485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3731490Z 2025-05-07T20:32:25.3731741Z self = 2025-05-07T20:32:25.3732514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3733020Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be712ac0>} 2025-05-07T20:32:25.3733766Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3733957Z context = 2025-05-07T20:32:25.3733962Z 2025-05-07T20:32:25.3734133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3734439Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3734556Z module_map=module_map) 2025-05-07T20:32:25.3734719Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3734822Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3734907Z E ^ 2025-05-07T20:32:25.3735263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3735270Z 2025-05-07T20:32:25.3735684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3735694Z 2025-05-07T20:32:25.3735802Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3736024Z self=, 2025-05-07T20:32:25.3736113Z T=128, 2025-05-07T20:32:25.3736192Z D=5120, 2025-05-07T20:32:25.3736281Z scale_ub=None, 2025-05-07T20:32:25.3736374Z contiguous=True, 2025-05-07T20:32:25.3736459Z compiled=True, 2025-05-07T20:32:25.3736540Z ) 2025-05-07T20:32:25.3736767Z self = 2025-05-07T20:32:25.3736935Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.3736940Z 2025-05-07T20:32:25.3737034Z @given( 2025-05-07T20:32:25.3737155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3737301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3737427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3737546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3737661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3737744Z ) 2025-05-07T20:32:25.3737990Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3738091Z def test_silu_mul_quant( 2025-05-07T20:32:25.3738181Z self, 2025-05-07T20:32:25.3738261Z T: int, 2025-05-07T20:32:25.3738344Z D: int, 2025-05-07T20:32:25.3738455Z scale_ub: Optional[float], 2025-05-07T20:32:25.3738548Z contiguous: bool, 2025-05-07T20:32:25.3738640Z compiled: bool, 2025-05-07T20:32:25.3738721Z ) -> None: 2025-05-07T20:32:25.3738819Z torch.manual_seed(2025) 2025-05-07T20:32:25.3738907Z 2025-05-07T20:32:25.3739081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3739199Z 2025-05-07T20:32:25.3739301Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3739430Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3739523Z x = x_sign * x_clamp 2025-05-07T20:32:25.3739613Z x0 = x[:, :D] 2025-05-07T20:32:25.3739697Z x1 = x[:, D:] 2025-05-07T20:32:25.3739770Z 2025-05-07T20:32:25.3739864Z if contiguous: 2025-05-07T20:32:25.3740001Z x0 = x0.contiguous() 2025-05-07T20:32:25.3740096Z x1 = x1.contiguous() 2025-05-07T20:32:25.3740171Z 2025-05-07T20:32:25.3740263Z if scale_ub is not None: 2025-05-07T20:32:25.3740373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3740510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3740598Z ) 2025-05-07T20:32:25.3740678Z else: 2025-05-07T20:32:25.3740775Z scale_ub_tensor = None 2025-05-07T20:32:25.3740860Z 2025-05-07T20:32:25.3740996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3741090Z op = silu_mul_quant 2025-05-07T20:32:25.3741190Z if compiled: 2025-05-07T20:32:25.3741296Z op = torch.compile(op) 2025-05-07T20:32:25.3741403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3741491Z 2025-05-07T20:32:25.3741587Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3741711Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3741841Z 2025-05-07T20:32:25.3741980Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3742092Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3742198Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3757990Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3758174Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3758249Z 2025-05-07T20:32:25.3758363Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3758369Z 2025-05-07T20:32:25.3758471Z moe/activation_test.py:126: 2025-05-07T20:32:25.3758604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3758715Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3758853Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3759424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3759537Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3759901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3760127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3760567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3760828Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3761229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3761483Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3761866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3762040Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3762385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3762474Z fn() 2025-05-07T20:32:25.3762926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3763013Z self.fn.run( 2025-05-07T20:32:25.3763408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3763512Z kernel = self.compile( 2025-05-07T20:32:25.3763902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3764080Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3764214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3764259Z 2025-05-07T20:32:25.3764477Z self = 2025-05-07T20:32:25.3765258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3765779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05bea95c60>} 2025-05-07T20:32:25.3766530Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3766734Z context = 2025-05-07T20:32:25.3766779Z 2025-05-07T20:32:25.3766949Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3767216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3767335Z module_map=module_map) 2025-05-07T20:32:25.3767500Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3767702Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3767791Z E ^ 2025-05-07T20:32:25.3768153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3768158Z 2025-05-07T20:32:25.3768579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3768584Z 2025-05-07T20:32:25.3768691Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3768915Z self=, 2025-05-07T20:32:25.3769016Z T=4096, 2025-05-07T20:32:25.3769097Z D=5120, 2025-05-07T20:32:25.3769185Z scale_ub=None, 2025-05-07T20:32:25.3769282Z contiguous=True, 2025-05-07T20:32:25.3769373Z compiled=True, 2025-05-07T20:32:25.3769456Z ) 2025-05-07T20:32:25.3769683Z self = 2025-05-07T20:32:25.3769856Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.3769861Z 2025-05-07T20:32:25.3769994Z @given( 2025-05-07T20:32:25.3770123Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3770228Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3770358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3770478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3770595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3770680Z ) 2025-05-07T20:32:25.3770932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3771034Z def test_silu_mul_quant( 2025-05-07T20:32:25.3771111Z self, 2025-05-07T20:32:25.3771195Z T: int, 2025-05-07T20:32:25.3771274Z D: int, 2025-05-07T20:32:25.3771373Z scale_ub: Optional[float], 2025-05-07T20:32:25.3771472Z contiguous: bool, 2025-05-07T20:32:25.3771559Z compiled: bool, 2025-05-07T20:32:25.3771641Z ) -> None: 2025-05-07T20:32:25.3771749Z torch.manual_seed(2025) 2025-05-07T20:32:25.3771867Z 2025-05-07T20:32:25.3772040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3772129Z 2025-05-07T20:32:25.3772226Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3772366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3772482Z x = x_sign * x_clamp 2025-05-07T20:32:25.3772582Z x0 = x[:, :D] 2025-05-07T20:32:25.3772679Z x1 = x[:, D:] 2025-05-07T20:32:25.3772797Z 2025-05-07T20:32:25.3772881Z if contiguous: 2025-05-07T20:32:25.3772984Z x0 = x0.contiguous() 2025-05-07T20:32:25.3773074Z x1 = x1.contiguous() 2025-05-07T20:32:25.3773151Z 2025-05-07T20:32:25.3773251Z if scale_ub is not None: 2025-05-07T20:32:25.3773359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3773496Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3773583Z ) 2025-05-07T20:32:25.3773663Z else: 2025-05-07T20:32:25.3773762Z scale_ub_tensor = None 2025-05-07T20:32:25.3773849Z 2025-05-07T20:32:25.3773980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3774073Z op = silu_mul_quant 2025-05-07T20:32:25.3774170Z if compiled: 2025-05-07T20:32:25.3774272Z op = torch.compile(op) 2025-05-07T20:32:25.3774390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3774511Z 2025-05-07T20:32:25.3774605Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3774735Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3774808Z 2025-05-07T20:32:25.3774946Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3775057Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3775159Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3775282Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3775434Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3775511Z 2025-05-07T20:32:25.3775621Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3775625Z 2025-05-07T20:32:25.3775728Z moe/activation_test.py:126: 2025-05-07T20:32:25.3775861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3775975Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3776106Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3776670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3776779Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3777139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3777371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3777781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3778039Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3778444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3778695Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3779086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3779254Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3779597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3779688Z fn() 2025-05-07T20:32:25.3780128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3780222Z self.fn.run( 2025-05-07T20:32:25.3780570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3780666Z kernel = self.compile( 2025-05-07T20:32:25.3781052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3781269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3781402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3781407Z 2025-05-07T20:32:25.3781622Z self = 2025-05-07T20:32:25.3782403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3782912Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05beb02840>} 2025-05-07T20:32:25.3783658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3783894Z context = 2025-05-07T20:32:25.3783898Z 2025-05-07T20:32:25.3784069Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3784335Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3784449Z module_map=module_map) 2025-05-07T20:32:25.3784610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3784715Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3784803Z E ^ 2025-05-07T20:32:25.3785157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3785162Z 2025-05-07T20:32:25.3785579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3785584Z 2025-05-07T20:32:25.3785695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3785918Z self=, 2025-05-07T20:32:25.3786009Z T=16384, 2025-05-07T20:32:25.3786089Z D=5120, 2025-05-07T20:32:25.3786172Z scale_ub=None, 2025-05-07T20:32:25.3786268Z contiguous=True, 2025-05-07T20:32:25.3786353Z compiled=True, 2025-05-07T20:32:25.3786429Z ) 2025-05-07T20:32:25.3786651Z self = 2025-05-07T20:32:25.3786869Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.3786874Z 2025-05-07T20:32:25.3786963Z @given( 2025-05-07T20:32:25.3787085Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3787187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3787309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3787427Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3787552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3787637Z ) 2025-05-07T20:32:25.3787883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3787978Z def test_silu_mul_quant( 2025-05-07T20:32:25.3788068Z self, 2025-05-07T20:32:25.3788146Z T: int, 2025-05-07T20:32:25.3788231Z D: int, 2025-05-07T20:32:25.3788334Z scale_ub: Optional[float], 2025-05-07T20:32:25.3788424Z contiguous: bool, 2025-05-07T20:32:25.3788522Z compiled: bool, 2025-05-07T20:32:25.3788646Z ) -> None: 2025-05-07T20:32:25.3788745Z torch.manual_seed(2025) 2025-05-07T20:32:25.3788826Z 2025-05-07T20:32:25.3788997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3789076Z 2025-05-07T20:32:25.3789180Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3789310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3789464Z x = x_sign * x_clamp 2025-05-07T20:32:25.3789558Z x0 = x[:, :D] 2025-05-07T20:32:25.3789642Z x1 = x[:, D:] 2025-05-07T20:32:25.3789718Z 2025-05-07T20:32:25.3789810Z if contiguous: 2025-05-07T20:32:25.3789903Z x0 = x0.contiguous() 2025-05-07T20:32:25.3790004Z x1 = x1.contiguous() 2025-05-07T20:32:25.3790082Z 2025-05-07T20:32:25.3790176Z if scale_ub is not None: 2025-05-07T20:32:25.3790291Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3790435Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3790514Z ) 2025-05-07T20:32:25.3790602Z else: 2025-05-07T20:32:25.3790700Z scale_ub_tensor = None 2025-05-07T20:32:25.3790776Z 2025-05-07T20:32:25.3790913Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3791006Z op = silu_mul_quant 2025-05-07T20:32:25.3791094Z if compiled: 2025-05-07T20:32:25.3791202Z op = torch.compile(op) 2025-05-07T20:32:25.3791354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3791433Z 2025-05-07T20:32:25.3791527Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3791649Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3791728Z 2025-05-07T20:32:25.3791865Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3791972Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3792080Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3792210Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3792351Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3792449Z 2025-05-07T20:32:25.3792562Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3792567Z 2025-05-07T20:32:25.3792697Z moe/activation_test.py:126: 2025-05-07T20:32:25.3792828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3792942Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3793086Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3793645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3793748Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3794114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3794377Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3794751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3795010Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3795406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3795666Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3796040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3796211Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3796554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3796636Z fn() 2025-05-07T20:32:25.3797081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3797170Z self.fn.run( 2025-05-07T20:32:25.3797510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3797613Z kernel = self.compile( 2025-05-07T20:32:25.3797991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3798215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3798344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3798349Z 2025-05-07T20:32:25.3798555Z self = 2025-05-07T20:32:25.3799342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3799842Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509b25d00>} 2025-05-07T20:32:25.3800591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3800825Z context = 2025-05-07T20:32:25.3800830Z 2025-05-07T20:32:25.3800995Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3801267Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3801378Z module_map=module_map) 2025-05-07T20:32:25.3801547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3801650Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3801732Z E ^ 2025-05-07T20:32:25.3802092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3802097Z 2025-05-07T20:32:25.3802510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3802521Z 2025-05-07T20:32:25.3802657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3802906Z self=, 2025-05-07T20:32:25.3802988Z T=1, 2025-05-07T20:32:25.3803071Z D=5120, 2025-05-07T20:32:25.3803159Z scale_ub=1200.0, 2025-05-07T20:32:25.3803246Z contiguous=True, 2025-05-07T20:32:25.3803337Z compiled=True, 2025-05-07T20:32:25.3803412Z ) 2025-05-07T20:32:25.3803674Z self = 2025-05-07T20:32:25.3803846Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.3803851Z 2025-05-07T20:32:25.3803931Z @given( 2025-05-07T20:32:25.3804058Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3804160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3804279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3804410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3804524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3804604Z ) 2025-05-07T20:32:25.3804854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3804948Z def test_silu_mul_quant( 2025-05-07T20:32:25.3805032Z self, 2025-05-07T20:32:25.3805113Z T: int, 2025-05-07T20:32:25.3805191Z D: int, 2025-05-07T20:32:25.3805305Z scale_ub: Optional[float], 2025-05-07T20:32:25.3805439Z contiguous: bool, 2025-05-07T20:32:25.3805533Z compiled: bool, 2025-05-07T20:32:25.3805795Z ) -> None: 2025-05-07T20:32:25.3805934Z torch.manual_seed(2025) 2025-05-07T20:32:25.3806050Z 2025-05-07T20:32:25.3806254Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3806331Z 2025-05-07T20:32:25.3806432Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3806655Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3806748Z x = x_sign * x_clamp 2025-05-07T20:32:25.3806842Z x0 = x[:, :D] 2025-05-07T20:32:25.3806925Z x1 = x[:, D:] 2025-05-07T20:32:25.3807001Z 2025-05-07T20:32:25.3807092Z if contiguous: 2025-05-07T20:32:25.3807188Z x0 = x0.contiguous() 2025-05-07T20:32:25.3807277Z x1 = x1.contiguous() 2025-05-07T20:32:25.3807358Z 2025-05-07T20:32:25.3807450Z if scale_ub is not None: 2025-05-07T20:32:25.3807618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3807754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3807833Z ) 2025-05-07T20:32:25.3807916Z else: 2025-05-07T20:32:25.3808012Z scale_ub_tensor = None 2025-05-07T20:32:25.3808088Z 2025-05-07T20:32:25.3808224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3808314Z op = silu_mul_quant 2025-05-07T20:32:25.3808475Z if compiled: 2025-05-07T20:32:25.3808581Z op = torch.compile(op) 2025-05-07T20:32:25.3808688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3808763Z 2025-05-07T20:32:25.3808860Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3808864Z 2025-05-07T20:32:25.3808961Z moe/activation_test.py:117: 2025-05-07T20:32:25.3809100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3809204Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3809309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3809680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3809773Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3810265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3810375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3810732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3810958Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3811298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3811396Z kernel = self.compile( 2025-05-07T20:32:25.3811853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3812030Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3812163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3812167Z 2025-05-07T20:32:25.3812371Z self = 2025-05-07T20:32:25.3813145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3813657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509a30a40>} 2025-05-07T20:32:25.3814458Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3814652Z context = 2025-05-07T20:32:25.3814657Z 2025-05-07T20:32:25.3814820Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3815082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3815236Z module_map=module_map) 2025-05-07T20:32:25.3815400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3815506Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3815586Z E ^ 2025-05-07T20:32:25.3815939Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3815944Z 2025-05-07T20:32:25.3816366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3816371Z 2025-05-07T20:32:25.3816477Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3816703Z self=, 2025-05-07T20:32:25.3816784Z T=1, 2025-05-07T20:32:25.3816868Z D=5120, 2025-05-07T20:32:25.3816960Z scale_ub=None, 2025-05-07T20:32:25.3817047Z contiguous=False, 2025-05-07T20:32:25.3817133Z compiled=True, 2025-05-07T20:32:25.3817257Z ) 2025-05-07T20:32:25.3817474Z self = 2025-05-07T20:32:25.3817641Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.3817645Z 2025-05-07T20:32:25.3817731Z @given( 2025-05-07T20:32:25.3817851Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3817955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3818073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3818193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3818311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3818390Z ) 2025-05-07T20:32:25.3818632Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3818731Z def test_silu_mul_quant( 2025-05-07T20:32:25.3818808Z self, 2025-05-07T20:32:25.3818886Z T: int, 2025-05-07T20:32:25.3818976Z D: int, 2025-05-07T20:32:25.3819077Z scale_ub: Optional[float], 2025-05-07T20:32:25.3819166Z contiguous: bool, 2025-05-07T20:32:25.3819259Z compiled: bool, 2025-05-07T20:32:25.3819345Z ) -> None: 2025-05-07T20:32:25.3819447Z torch.manual_seed(2025) 2025-05-07T20:32:25.3819523Z 2025-05-07T20:32:25.3819693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3819780Z 2025-05-07T20:32:25.3819878Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3820053Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3820154Z x = x_sign * x_clamp 2025-05-07T20:32:25.3820237Z x0 = x[:, :D] 2025-05-07T20:32:25.3820321Z x1 = x[:, D:] 2025-05-07T20:32:25.3820401Z 2025-05-07T20:32:25.3820488Z if contiguous: 2025-05-07T20:32:25.3820582Z x0 = x0.contiguous() 2025-05-07T20:32:25.3820680Z x1 = x1.contiguous() 2025-05-07T20:32:25.3820759Z 2025-05-07T20:32:25.3820857Z if scale_ub is not None: 2025-05-07T20:32:25.3820972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3821109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3821194Z ) 2025-05-07T20:32:25.3821277Z else: 2025-05-07T20:32:25.3821375Z scale_ub_tensor = None 2025-05-07T20:32:25.3821455Z 2025-05-07T20:32:25.3821588Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3821687Z op = silu_mul_quant 2025-05-07T20:32:25.3821843Z if compiled: 2025-05-07T20:32:25.3821948Z op = torch.compile(op) 2025-05-07T20:32:25.3822057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3822137Z 2025-05-07T20:32:25.3822231Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3822359Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3822462Z 2025-05-07T20:32:25.3822621Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3822777Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3822879Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3823004Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3823152Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3823231Z 2025-05-07T20:32:25.3823339Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3823344Z 2025-05-07T20:32:25.3823449Z moe/activation_test.py:126: 2025-05-07T20:32:25.3823586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3823701Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3823836Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3824395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3824509Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3824909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3825132Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3825508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3825767Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3826173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3826427Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3826804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3826980Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3827326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3827413Z fn() 2025-05-07T20:32:25.3827814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3827901Z self.fn.run( 2025-05-07T20:32:25.3828243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3828384Z kernel = self.compile( 2025-05-07T20:32:25.3828766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3828948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3829078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3829083Z 2025-05-07T20:32:25.3829292Z self = 2025-05-07T20:32:25.3830073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3830573Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509a12200>} 2025-05-07T20:32:25.3831365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3831557Z context = 2025-05-07T20:32:25.3831562Z 2025-05-07T20:32:25.3831732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3832037Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3832149Z module_map=module_map) 2025-05-07T20:32:25.3832317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3832421Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3832508Z E ^ 2025-05-07T20:32:25.3832861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3832868Z 2025-05-07T20:32:25.3833285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3833289Z 2025-05-07T20:32:25.3833400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3833623Z self=, 2025-05-07T20:32:25.3833708Z T=1, 2025-05-07T20:32:25.3833788Z D=5120, 2025-05-07T20:32:25.3833917Z scale_ub=None, 2025-05-07T20:32:25.3834009Z contiguous=True, 2025-05-07T20:32:25.3834096Z compiled=False, 2025-05-07T20:32:25.3834177Z ) 2025-05-07T20:32:25.3834403Z self = 2025-05-07T20:32:25.3834568Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.3834573Z 2025-05-07T20:32:25.3834654Z @given( 2025-05-07T20:32:25.3834787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3834892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3835018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3835137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3835250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3835333Z ) 2025-05-07T20:32:25.3835579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3835676Z def test_silu_mul_quant( 2025-05-07T20:32:25.3835772Z self, 2025-05-07T20:32:25.3835852Z T: int, 2025-05-07T20:32:25.3835931Z D: int, 2025-05-07T20:32:25.3836038Z scale_ub: Optional[float], 2025-05-07T20:32:25.3836130Z contiguous: bool, 2025-05-07T20:32:25.3836218Z compiled: bool, 2025-05-07T20:32:25.3836306Z ) -> None: 2025-05-07T20:32:25.3836403Z torch.manual_seed(2025) 2025-05-07T20:32:25.3836479Z 2025-05-07T20:32:25.3836701Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3836779Z 2025-05-07T20:32:25.3836881Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3837009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3837101Z x = x_sign * x_clamp 2025-05-07T20:32:25.3837187Z x0 = x[:, :D] 2025-05-07T20:32:25.3837272Z x1 = x[:, D:] 2025-05-07T20:32:25.3837346Z 2025-05-07T20:32:25.3837439Z if contiguous: 2025-05-07T20:32:25.3837533Z x0 = x0.contiguous() 2025-05-07T20:32:25.3837631Z x1 = x1.contiguous() 2025-05-07T20:32:25.3837712Z 2025-05-07T20:32:25.3837805Z if scale_ub is not None: 2025-05-07T20:32:25.3837913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3838057Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3838137Z ) 2025-05-07T20:32:25.3838444Z else: 2025-05-07T20:32:25.3838543Z scale_ub_tensor = None 2025-05-07T20:32:25.3838618Z 2025-05-07T20:32:25.3838803Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3838900Z op = silu_mul_quant 2025-05-07T20:32:25.3838988Z if compiled: 2025-05-07T20:32:25.3839096Z op = torch.compile(op) 2025-05-07T20:32:25.3839204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3839279Z 2025-05-07T20:32:25.3839382Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3839387Z 2025-05-07T20:32:25.3839487Z moe/activation_test.py:117: 2025-05-07T20:32:25.3839666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3839770Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3839874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3840379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3840479Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3840842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3841069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3841408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3841512Z kernel = self.compile( 2025-05-07T20:32:25.3841895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3842113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3842246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3842251Z 2025-05-07T20:32:25.3842456Z self = 2025-05-07T20:32:25.3843288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3843785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509a13560>} 2025-05-07T20:32:25.3844530Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3844731Z context = 2025-05-07T20:32:25.3844735Z 2025-05-07T20:32:25.3844901Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3845170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3845279Z module_map=module_map) 2025-05-07T20:32:25.3845484Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3845592Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3845672Z E ^ 2025-05-07T20:32:25.3846027Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3846036Z 2025-05-07T20:32:25.3846449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3846458Z 2025-05-07T20:32:25.3846565Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3846794Z self=, 2025-05-07T20:32:25.3846875Z T=128, 2025-05-07T20:32:25.3846955Z D=5120, 2025-05-07T20:32:25.3847046Z scale_ub=None, 2025-05-07T20:32:25.3847135Z contiguous=False, 2025-05-07T20:32:25.3847221Z compiled=True, 2025-05-07T20:32:25.3847304Z ) 2025-05-07T20:32:25.3847612Z self = 2025-05-07T20:32:25.3847795Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.3847800Z 2025-05-07T20:32:25.3847880Z @given( 2025-05-07T20:32:25.3848003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3848112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3848229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3848390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3848511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3848588Z ) 2025-05-07T20:32:25.3848834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3848936Z def test_silu_mul_quant( 2025-05-07T20:32:25.3849015Z self, 2025-05-07T20:32:25.3849101Z T: int, 2025-05-07T20:32:25.3849180Z D: int, 2025-05-07T20:32:25.3849280Z scale_ub: Optional[float], 2025-05-07T20:32:25.3849380Z contiguous: bool, 2025-05-07T20:32:25.3849468Z compiled: bool, 2025-05-07T20:32:25.3849544Z ) -> None: 2025-05-07T20:32:25.3849646Z torch.manual_seed(2025) 2025-05-07T20:32:25.3849721Z 2025-05-07T20:32:25.3849892Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3849971Z 2025-05-07T20:32:25.3850064Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3850188Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3850326Z x = x_sign * x_clamp 2025-05-07T20:32:25.3850409Z x0 = x[:, :D] 2025-05-07T20:32:25.3850493Z x1 = x[:, D:] 2025-05-07T20:32:25.3850568Z 2025-05-07T20:32:25.3850652Z if contiguous: 2025-05-07T20:32:25.3850750Z x0 = x0.contiguous() 2025-05-07T20:32:25.3850841Z x1 = x1.contiguous() 2025-05-07T20:32:25.3850913Z 2025-05-07T20:32:25.3851010Z if scale_ub is not None: 2025-05-07T20:32:25.3851120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3851256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3851340Z ) 2025-05-07T20:32:25.3851419Z else: 2025-05-07T20:32:25.3851512Z scale_ub_tensor = None 2025-05-07T20:32:25.3851591Z 2025-05-07T20:32:25.3851724Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3851816Z op = silu_mul_quant 2025-05-07T20:32:25.3851914Z if compiled: 2025-05-07T20:32:25.3852016Z op = torch.compile(op) 2025-05-07T20:32:25.3852126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3852201Z 2025-05-07T20:32:25.3852292Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3852296Z 2025-05-07T20:32:25.3852400Z moe/activation_test.py:117: 2025-05-07T20:32:25.3852531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3852632Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3852805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3853174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3853272Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3853762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3853862Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3854229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3854447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3854785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3854886Z kernel = self.compile( 2025-05-07T20:32:25.3855304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3855483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3855610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3855615Z 2025-05-07T20:32:25.3855820Z self = 2025-05-07T20:32:25.3856592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3857136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509a304a0>} 2025-05-07T20:32:25.3857882Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3858071Z context = 2025-05-07T20:32:25.3858075Z 2025-05-07T20:32:25.3858244Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3858504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3858616Z module_map=module_map) 2025-05-07T20:32:25.3858816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3858914Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3859000Z E ^ 2025-05-07T20:32:25.3859352Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3859357Z 2025-05-07T20:32:25.3859771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3859784Z 2025-05-07T20:32:25.3859889Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3860109Z self=, 2025-05-07T20:32:25.3860191Z T=128, 2025-05-07T20:32:25.3860269Z D=7168, 2025-05-07T20:32:25.3860353Z scale_ub=1200.0, 2025-05-07T20:32:25.3860446Z contiguous=False, 2025-05-07T20:32:25.3860532Z compiled=False, 2025-05-07T20:32:25.3860612Z ) 2025-05-07T20:32:25.3860834Z self = 2025-05-07T20:32:25.3861005Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.3861009Z 2025-05-07T20:32:25.3861094Z @given( 2025-05-07T20:32:25.3861213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3861314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3861431Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3861614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3861729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3861808Z ) 2025-05-07T20:32:25.3862052Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3862143Z def test_silu_mul_quant( 2025-05-07T20:32:25.3862225Z self, 2025-05-07T20:32:25.3862303Z T: int, 2025-05-07T20:32:25.3862381Z D: int, 2025-05-07T20:32:25.3862489Z scale_ub: Optional[float], 2025-05-07T20:32:25.3862579Z contiguous: bool, 2025-05-07T20:32:25.3862670Z compiled: bool, 2025-05-07T20:32:25.3862749Z ) -> None: 2025-05-07T20:32:25.3862844Z torch.manual_seed(2025) 2025-05-07T20:32:25.3862929Z 2025-05-07T20:32:25.3863098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3863175Z 2025-05-07T20:32:25.3863271Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3863439Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3863532Z x = x_sign * x_clamp 2025-05-07T20:32:25.3863619Z x0 = x[:, :D] 2025-05-07T20:32:25.3863697Z x1 = x[:, D:] 2025-05-07T20:32:25.3863769Z 2025-05-07T20:32:25.3863858Z if contiguous: 2025-05-07T20:32:25.3863950Z x0 = x0.contiguous() 2025-05-07T20:32:25.3864039Z x1 = x1.contiguous() 2025-05-07T20:32:25.3864117Z 2025-05-07T20:32:25.3864249Z if scale_ub is not None: 2025-05-07T20:32:25.3864361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3864494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3864568Z ) 2025-05-07T20:32:25.3864655Z else: 2025-05-07T20:32:25.3864750Z scale_ub_tensor = None 2025-05-07T20:32:25.3864824Z 2025-05-07T20:32:25.3864961Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3865053Z op = silu_mul_quant 2025-05-07T20:32:25.3865141Z if compiled: 2025-05-07T20:32:25.3865254Z op = torch.compile(op) 2025-05-07T20:32:25.3865358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3865432Z 2025-05-07T20:32:25.3865526Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3865531Z 2025-05-07T20:32:25.3865628Z moe/activation_test.py:117: 2025-05-07T20:32:25.3865760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3865908Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3866007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3866507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3866607Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3866965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3867196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3867536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3867635Z kernel = self.compile( 2025-05-07T20:32:25.3868015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3868187Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3868325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3868329Z 2025-05-07T20:32:25.3868530Z self = 2025-05-07T20:32:25.3869305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3869845Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05091fca40>} 2025-05-07T20:32:25.3870593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3870784Z context = 2025-05-07T20:32:25.3870792Z 2025-05-07T20:32:25.3870956Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3871222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3871329Z module_map=module_map) 2025-05-07T20:32:25.3871489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3871598Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3871717Z E ^ 2025-05-07T20:32:25.3872074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3872079Z 2025-05-07T20:32:25.3872489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3872494Z 2025-05-07T20:32:25.3872597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3872862Z self=, 2025-05-07T20:32:25.3872941Z T=128, 2025-05-07T20:32:25.3873021Z D=5120, 2025-05-07T20:32:25.3873110Z scale_ub=None, 2025-05-07T20:32:25.3873197Z contiguous=False, 2025-05-07T20:32:25.3873285Z compiled=False, 2025-05-07T20:32:25.3873361Z ) 2025-05-07T20:32:25.3873576Z self = 2025-05-07T20:32:25.3873754Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.3873761Z 2025-05-07T20:32:25.3873842Z @given( 2025-05-07T20:32:25.3873963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3874072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3874188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3874303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3874422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3874542Z ) 2025-05-07T20:32:25.3874788Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3874883Z def test_silu_mul_quant( 2025-05-07T20:32:25.3874963Z self, 2025-05-07T20:32:25.3875046Z T: int, 2025-05-07T20:32:25.3875123Z D: int, 2025-05-07T20:32:25.3875222Z scale_ub: Optional[float], 2025-05-07T20:32:25.3875317Z contiguous: bool, 2025-05-07T20:32:25.3875405Z compiled: bool, 2025-05-07T20:32:25.3875482Z ) -> None: 2025-05-07T20:32:25.3875584Z torch.manual_seed(2025) 2025-05-07T20:32:25.3875659Z 2025-05-07T20:32:25.3875827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3875910Z 2025-05-07T20:32:25.3875999Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3876127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3876216Z x = x_sign * x_clamp 2025-05-07T20:32:25.3876302Z x0 = x[:, :D] 2025-05-07T20:32:25.3876391Z x1 = x[:, D:] 2025-05-07T20:32:25.3876466Z 2025-05-07T20:32:25.3876549Z if contiguous: 2025-05-07T20:32:25.3876644Z x0 = x0.contiguous() 2025-05-07T20:32:25.3876735Z x1 = x1.contiguous() 2025-05-07T20:32:25.3876807Z 2025-05-07T20:32:25.3876903Z if scale_ub is not None: 2025-05-07T20:32:25.3877010Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3877145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3877271Z ) 2025-05-07T20:32:25.3877351Z else: 2025-05-07T20:32:25.3877451Z scale_ub_tensor = None 2025-05-07T20:32:25.3877526Z 2025-05-07T20:32:25.3877653Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3877747Z op = silu_mul_quant 2025-05-07T20:32:25.3877838Z if compiled: 2025-05-07T20:32:25.3877942Z op = torch.compile(op) 2025-05-07T20:32:25.3878057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3878130Z 2025-05-07T20:32:25.3878222Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3878226Z 2025-05-07T20:32:25.3878328Z moe/activation_test.py:117: 2025-05-07T20:32:25.3878459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3878563Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3878661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3879200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3879306Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3879663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3879882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3880228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3880363Z kernel = self.compile( 2025-05-07T20:32:25.3880746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3880919Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3881047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3881051Z 2025-05-07T20:32:25.3881265Z self = 2025-05-07T20:32:25.3882034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3882559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508c84720>} 2025-05-07T20:32:25.3883394Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3883581Z context = 2025-05-07T20:32:25.3883586Z 2025-05-07T20:32:25.3883754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3884020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3884134Z module_map=module_map) 2025-05-07T20:32:25.3884298Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3884398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3884484Z E ^ 2025-05-07T20:32:25.3884836Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3884846Z 2025-05-07T20:32:25.3885261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3885265Z 2025-05-07T20:32:25.3885370Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3885591Z self=, 2025-05-07T20:32:25.3885680Z T=128, 2025-05-07T20:32:25.3890263Z D=5120, 2025-05-07T20:32:25.3890445Z scale_ub=1200.0, 2025-05-07T20:32:25.3890540Z contiguous=True, 2025-05-07T20:32:25.3890625Z compiled=False, 2025-05-07T20:32:25.3890704Z ) 2025-05-07T20:32:25.3890930Z self = 2025-05-07T20:32:25.3891104Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.3891109Z 2025-05-07T20:32:25.3891191Z @given( 2025-05-07T20:32:25.3891313Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3891417Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3891537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3891654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3891777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3891853Z ) 2025-05-07T20:32:25.3892100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3892205Z def test_silu_mul_quant( 2025-05-07T20:32:25.3892327Z self, 2025-05-07T20:32:25.3892408Z T: int, 2025-05-07T20:32:25.3892494Z D: int, 2025-05-07T20:32:25.3892596Z scale_ub: Optional[float], 2025-05-07T20:32:25.3892688Z contiguous: bool, 2025-05-07T20:32:25.3892786Z compiled: bool, 2025-05-07T20:32:25.3892874Z ) -> None: 2025-05-07T20:32:25.3892974Z torch.manual_seed(2025) 2025-05-07T20:32:25.3893053Z 2025-05-07T20:32:25.3893302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3893382Z 2025-05-07T20:32:25.3893481Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3893608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3893711Z x = x_sign * x_clamp 2025-05-07T20:32:25.3893794Z x0 = x[:, :D] 2025-05-07T20:32:25.3893878Z x1 = x[:, D:] 2025-05-07T20:32:25.3893958Z 2025-05-07T20:32:25.3894048Z if contiguous: 2025-05-07T20:32:25.3894145Z x0 = x0.contiguous() 2025-05-07T20:32:25.3894245Z x1 = x1.contiguous() 2025-05-07T20:32:25.3894320Z 2025-05-07T20:32:25.3894413Z if scale_ub is not None: 2025-05-07T20:32:25.3894525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3894662Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3894740Z ) 2025-05-07T20:32:25.3894821Z else: 2025-05-07T20:32:25.3894920Z scale_ub_tensor = None 2025-05-07T20:32:25.3895053Z 2025-05-07T20:32:25.3895185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3895278Z op = silu_mul_quant 2025-05-07T20:32:25.3895367Z if compiled: 2025-05-07T20:32:25.3895473Z op = torch.compile(op) 2025-05-07T20:32:25.3895582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3895666Z 2025-05-07T20:32:25.3895764Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3895769Z 2025-05-07T20:32:25.3895872Z moe/activation_test.py:117: 2025-05-07T20:32:25.3896015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3896118Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3896224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3896730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3896834Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3897204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3897427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3897769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3897870Z kernel = self.compile( 2025-05-07T20:32:25.3898301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3898478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3898607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3898612Z 2025-05-07T20:32:25.3898820Z self = 2025-05-07T20:32:25.3899596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3900108Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508c858a0>} 2025-05-07T20:32:25.3900896Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3901090Z context = 2025-05-07T20:32:25.3901095Z 2025-05-07T20:32:25.3901264Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3901529Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3901683Z module_map=module_map) 2025-05-07T20:32:25.3901848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3901950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3902037Z E ^ 2025-05-07T20:32:25.3902395Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3902401Z 2025-05-07T20:32:25.3902863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3902873Z 2025-05-07T20:32:25.3902980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3903202Z self=, 2025-05-07T20:32:25.3903287Z T=1, 2025-05-07T20:32:25.3903368Z D=7168, 2025-05-07T20:32:25.3903456Z scale_ub=1200.0, 2025-05-07T20:32:25.3903547Z contiguous=True, 2025-05-07T20:32:25.3903634Z compiled=True, 2025-05-07T20:32:25.3903755Z ) 2025-05-07T20:32:25.3903978Z self = 2025-05-07T20:32:25.3904144Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.3904149Z 2025-05-07T20:32:25.3904234Z @given( 2025-05-07T20:32:25.3904355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3904458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3904579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3904704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3904820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3904898Z ) 2025-05-07T20:32:25.3905146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3905247Z def test_silu_mul_quant( 2025-05-07T20:32:25.3905331Z self, 2025-05-07T20:32:25.3905411Z T: int, 2025-05-07T20:32:25.3905493Z D: int, 2025-05-07T20:32:25.3905815Z scale_ub: Optional[float], 2025-05-07T20:32:25.3905954Z contiguous: bool, 2025-05-07T20:32:25.3906088Z compiled: bool, 2025-05-07T20:32:25.3906185Z ) -> None: 2025-05-07T20:32:25.3906284Z torch.manual_seed(2025) 2025-05-07T20:32:25.3906366Z 2025-05-07T20:32:25.3906539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3906614Z 2025-05-07T20:32:25.3906710Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3906931Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3907023Z x = x_sign * x_clamp 2025-05-07T20:32:25.3907112Z x0 = x[:, :D] 2025-05-07T20:32:25.3907197Z x1 = x[:, D:] 2025-05-07T20:32:25.3907275Z 2025-05-07T20:32:25.3907366Z if contiguous: 2025-05-07T20:32:25.3907460Z x0 = x0.contiguous() 2025-05-07T20:32:25.3907556Z x1 = x1.contiguous() 2025-05-07T20:32:25.3907635Z 2025-05-07T20:32:25.3907733Z if scale_ub is not None: 2025-05-07T20:32:25.3907852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3907992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3908069Z ) 2025-05-07T20:32:25.3908149Z else: 2025-05-07T20:32:25.3908247Z scale_ub_tensor = None 2025-05-07T20:32:25.3908323Z 2025-05-07T20:32:25.3908467Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3908559Z op = silu_mul_quant 2025-05-07T20:32:25.3908713Z if compiled: 2025-05-07T20:32:25.3908822Z op = torch.compile(op) 2025-05-07T20:32:25.3908930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3909011Z 2025-05-07T20:32:25.3909103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3909108Z 2025-05-07T20:32:25.3909206Z moe/activation_test.py:117: 2025-05-07T20:32:25.3909340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3909505Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3909607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3909978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3910072Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3910567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3910675Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3911038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3911262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3911601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3911699Z kernel = self.compile( 2025-05-07T20:32:25.3912149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3912326Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3912458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3912463Z 2025-05-07T20:32:25.3912666Z self = 2025-05-07T20:32:25.3913444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3913948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508c86e80>} 2025-05-07T20:32:25.3914695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3914894Z context = 2025-05-07T20:32:25.3914898Z 2025-05-07T20:32:25.3915062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3915327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3915486Z module_map=module_map) 2025-05-07T20:32:25.3915652Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3915760Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3915840Z E ^ 2025-05-07T20:32:25.3916197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3916201Z 2025-05-07T20:32:25.3916616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3916625Z 2025-05-07T20:32:25.3916730Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3916958Z self=, 2025-05-07T20:32:25.3917037Z T=1, 2025-05-07T20:32:25.3917117Z D=7168, 2025-05-07T20:32:25.3917207Z scale_ub=1200.0, 2025-05-07T20:32:25.3917298Z contiguous=False, 2025-05-07T20:32:25.3917386Z compiled=True, 2025-05-07T20:32:25.3917468Z ) 2025-05-07T20:32:25.3917739Z self = 2025-05-07T20:32:25.3917910Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.3917915Z 2025-05-07T20:32:25.3918003Z @given( 2025-05-07T20:32:25.3918126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3918233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3918419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3918538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3918657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3918733Z ) 2025-05-07T20:32:25.3918980Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3919085Z def test_silu_mul_quant( 2025-05-07T20:32:25.3919167Z self, 2025-05-07T20:32:25.3919245Z T: int, 2025-05-07T20:32:25.3919332Z D: int, 2025-05-07T20:32:25.3919438Z scale_ub: Optional[float], 2025-05-07T20:32:25.3919530Z contiguous: bool, 2025-05-07T20:32:25.3919620Z compiled: bool, 2025-05-07T20:32:25.3919705Z ) -> None: 2025-05-07T20:32:25.3919805Z torch.manual_seed(2025) 2025-05-07T20:32:25.3919881Z 2025-05-07T20:32:25.3920052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3920135Z 2025-05-07T20:32:25.3920276Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3920403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3920500Z x = x_sign * x_clamp 2025-05-07T20:32:25.3920584Z x0 = x[:, :D] 2025-05-07T20:32:25.3920668Z x1 = x[:, D:] 2025-05-07T20:32:25.3920746Z 2025-05-07T20:32:25.3920832Z if contiguous: 2025-05-07T20:32:25.3920926Z x0 = x0.contiguous() 2025-05-07T20:32:25.3921025Z x1 = x1.contiguous() 2025-05-07T20:32:25.3921102Z 2025-05-07T20:32:25.3921196Z if scale_ub is not None: 2025-05-07T20:32:25.3921314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3921451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3921536Z ) 2025-05-07T20:32:25.3921617Z else: 2025-05-07T20:32:25.3921713Z scale_ub_tensor = None 2025-05-07T20:32:25.3921793Z 2025-05-07T20:32:25.3921924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3922024Z op = silu_mul_quant 2025-05-07T20:32:25.3922116Z if compiled: 2025-05-07T20:32:25.3922217Z op = torch.compile(op) 2025-05-07T20:32:25.3922328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3922409Z 2025-05-07T20:32:25.3922501Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3922506Z 2025-05-07T20:32:25.3922608Z moe/activation_test.py:117: 2025-05-07T20:32:25.3922743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3922891Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3922997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3923366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3923462Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3923960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3924066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3924433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3924653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3924993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3925095Z kernel = self.compile( 2025-05-07T20:32:25.3925519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3925695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3925829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3925834Z 2025-05-07T20:32:25.3926043Z self = 2025-05-07T20:32:25.3927065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3927624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508e44680>} 2025-05-07T20:32:25.3928378Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3928570Z context = 2025-05-07T20:32:25.3928575Z 2025-05-07T20:32:25.3928739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3929004Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3929167Z module_map=module_map) 2025-05-07T20:32:25.3929331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3929431Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3929512Z E ^ 2025-05-07T20:32:25.3929875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3929880Z 2025-05-07T20:32:25.3930297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3930301Z 2025-05-07T20:32:25.3930407Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3930634Z self=, 2025-05-07T20:32:25.3930713Z T=1, 2025-05-07T20:32:25.3930794Z D=7168, 2025-05-07T20:32:25.3930878Z scale_ub=None, 2025-05-07T20:32:25.3930967Z contiguous=False, 2025-05-07T20:32:25.3931062Z compiled=True, 2025-05-07T20:32:25.3931139Z ) 2025-05-07T20:32:25.3931357Z self = 2025-05-07T20:32:25.3931529Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.3931533Z 2025-05-07T20:32:25.3931614Z @given( 2025-05-07T20:32:25.3931741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3931847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3932010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3932137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3932250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3932334Z ) 2025-05-07T20:32:25.3932579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3932677Z def test_silu_mul_quant( 2025-05-07T20:32:25.3932759Z self, 2025-05-07T20:32:25.3932843Z T: int, 2025-05-07T20:32:25.3932921Z D: int, 2025-05-07T20:32:25.3933028Z scale_ub: Optional[float], 2025-05-07T20:32:25.3933120Z contiguous: bool, 2025-05-07T20:32:25.3933214Z compiled: bool, 2025-05-07T20:32:25.3933300Z ) -> None: 2025-05-07T20:32:25.3933395Z torch.manual_seed(2025) 2025-05-07T20:32:25.3933472Z 2025-05-07T20:32:25.3933645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3933722Z 2025-05-07T20:32:25.3933822Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3933992Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3934085Z x = x_sign * x_clamp 2025-05-07T20:32:25.3934169Z x0 = x[:, :D] 2025-05-07T20:32:25.3934253Z x1 = x[:, D:] 2025-05-07T20:32:25.3934327Z 2025-05-07T20:32:25.3934415Z if contiguous: 2025-05-07T20:32:25.3934508Z x0 = x0.contiguous() 2025-05-07T20:32:25.3934599Z x1 = x1.contiguous() 2025-05-07T20:32:25.3934719Z 2025-05-07T20:32:25.3934813Z if scale_ub is not None: 2025-05-07T20:32:25.3934923Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3935064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3935142Z ) 2025-05-07T20:32:25.3935223Z else: 2025-05-07T20:32:25.3935318Z scale_ub_tensor = None 2025-05-07T20:32:25.3935394Z 2025-05-07T20:32:25.3935527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3935625Z op = silu_mul_quant 2025-05-07T20:32:25.3935713Z if compiled: 2025-05-07T20:32:25.3935820Z op = torch.compile(op) 2025-05-07T20:32:25.3935926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3936001Z 2025-05-07T20:32:25.3936098Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.3936220Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.3936294Z 2025-05-07T20:32:25.3936489Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3936593Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.3936698Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.3936823Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.3936962Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3937046Z 2025-05-07T20:32:25.3937150Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.3937155Z 2025-05-07T20:32:25.3937258Z moe/activation_test.py:126: 2025-05-07T20:32:25.3937405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3937512Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.3937648Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.3938208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.3938315Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.3938679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3938900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3939263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.3939567Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3939964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.3940218Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.3940593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.3940763Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.3941107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.3941186Z fn() 2025-05-07T20:32:25.3941585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.3941674Z self.fn.run( 2025-05-07T20:32:25.3942055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3942157Z kernel = self.compile( 2025-05-07T20:32:25.3942539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3942714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3942846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3942889Z 2025-05-07T20:32:25.3943099Z self = 2025-05-07T20:32:25.3943879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3944379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508e45580>} 2025-05-07T20:32:25.3945122Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3945314Z context = 2025-05-07T20:32:25.3945319Z 2025-05-07T20:32:25.3945482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3945789Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3945897Z module_map=module_map) 2025-05-07T20:32:25.3946060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3946172Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.3946250Z E ^ 2025-05-07T20:32:25.3946609Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3946616Z 2025-05-07T20:32:25.3947029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3947034Z 2025-05-07T20:32:25.3947140Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3947369Z self=, 2025-05-07T20:32:25.3947449Z T=1, 2025-05-07T20:32:25.3947533Z D=5120, 2025-05-07T20:32:25.3947623Z scale_ub=1200.0, 2025-05-07T20:32:25.3947713Z contiguous=False, 2025-05-07T20:32:25.3947803Z compiled=True, 2025-05-07T20:32:25.3947878Z ) 2025-05-07T20:32:25.3948096Z self = 2025-05-07T20:32:25.3948267Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.3948271Z 2025-05-07T20:32:25.3948351Z @given( 2025-05-07T20:32:25.3948515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3948627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3948743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3948860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3948977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3949052Z ) 2025-05-07T20:32:25.3949300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3949401Z def test_silu_mul_quant( 2025-05-07T20:32:25.3949478Z self, 2025-05-07T20:32:25.3949559Z T: int, 2025-05-07T20:32:25.3949638Z D: int, 2025-05-07T20:32:25.3949739Z scale_ub: Optional[float], 2025-05-07T20:32:25.3949835Z contiguous: bool, 2025-05-07T20:32:25.3949922Z compiled: bool, 2025-05-07T20:32:25.3950002Z ) -> None: 2025-05-07T20:32:25.3950105Z torch.manual_seed(2025) 2025-05-07T20:32:25.3950180Z 2025-05-07T20:32:25.3950419Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3950501Z 2025-05-07T20:32:25.3950595Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3950723Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3950812Z x = x_sign * x_clamp 2025-05-07T20:32:25.3950894Z x0 = x[:, :D] 2025-05-07T20:32:25.3950978Z x1 = x[:, D:] 2025-05-07T20:32:25.3951052Z 2025-05-07T20:32:25.3951136Z if contiguous: 2025-05-07T20:32:25.3951275Z x0 = x0.contiguous() 2025-05-07T20:32:25.3951369Z x1 = x1.contiguous() 2025-05-07T20:32:25.3951443Z 2025-05-07T20:32:25.3951540Z if scale_ub is not None: 2025-05-07T20:32:25.3951647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3951784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3951867Z ) 2025-05-07T20:32:25.3951946Z else: 2025-05-07T20:32:25.3952044Z scale_ub_tensor = None 2025-05-07T20:32:25.3952125Z 2025-05-07T20:32:25.3952260Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3952356Z op = silu_mul_quant 2025-05-07T20:32:25.3952442Z if compiled: 2025-05-07T20:32:25.3952566Z op = torch.compile(op) 2025-05-07T20:32:25.3952690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3952780Z 2025-05-07T20:32:25.3952874Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3952923Z 2025-05-07T20:32:25.3953028Z moe/activation_test.py:117: 2025-05-07T20:32:25.3953161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3953263Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3953367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3953731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3953832Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3954327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3954427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3954786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3955008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3955355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3955451Z kernel = self.compile( 2025-05-07T20:32:25.3955831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3956009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3956137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3956141Z 2025-05-07T20:32:25.3956389Z self = 2025-05-07T20:32:25.3957166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3957665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508e46b60>} 2025-05-07T20:32:25.3958420Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3958610Z context = 2025-05-07T20:32:25.3958615Z 2025-05-07T20:32:25.3958787Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3959090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3959202Z module_map=module_map) 2025-05-07T20:32:25.3959367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3959468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3959547Z E ^ 2025-05-07T20:32:25.3959905Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3959953Z 2025-05-07T20:32:25.3960365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3960370Z 2025-05-07T20:32:25.3960479Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3960701Z self=, 2025-05-07T20:32:25.3960779Z T=1, 2025-05-07T20:32:25.3960866Z D=5120, 2025-05-07T20:32:25.3960955Z scale_ub=1200.0, 2025-05-07T20:32:25.3961044Z contiguous=False, 2025-05-07T20:32:25.3961135Z compiled=False, 2025-05-07T20:32:25.3961213Z ) 2025-05-07T20:32:25.3961438Z self = 2025-05-07T20:32:25.3961606Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.3961610Z 2025-05-07T20:32:25.3961689Z @given( 2025-05-07T20:32:25.3961858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3961958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3962076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3962196Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3962310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3962386Z ) 2025-05-07T20:32:25.3962658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3962767Z def test_silu_mul_quant( 2025-05-07T20:32:25.3962864Z self, 2025-05-07T20:32:25.3962941Z T: int, 2025-05-07T20:32:25.3963018Z D: int, 2025-05-07T20:32:25.3963122Z scale_ub: Optional[float], 2025-05-07T20:32:25.3963216Z contiguous: bool, 2025-05-07T20:32:25.3963304Z compiled: bool, 2025-05-07T20:32:25.3963390Z ) -> None: 2025-05-07T20:32:25.3963485Z torch.manual_seed(2025) 2025-05-07T20:32:25.3963566Z 2025-05-07T20:32:25.3963743Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3963819Z 2025-05-07T20:32:25.3963911Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3964040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3964129Z x = x_sign * x_clamp 2025-05-07T20:32:25.3964214Z x0 = x[:, :D] 2025-05-07T20:32:25.3964296Z x1 = x[:, D:] 2025-05-07T20:32:25.3964371Z 2025-05-07T20:32:25.3964460Z if contiguous: 2025-05-07T20:32:25.3964602Z x0 = x0.contiguous() 2025-05-07T20:32:25.3964694Z x1 = x1.contiguous() 2025-05-07T20:32:25.3964773Z 2025-05-07T20:32:25.3964866Z if scale_ub is not None: 2025-05-07T20:32:25.3964974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3965114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3965190Z ) 2025-05-07T20:32:25.3965267Z else: 2025-05-07T20:32:25.3965372Z scale_ub_tensor = None 2025-05-07T20:32:25.3965448Z 2025-05-07T20:32:25.3965579Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3965675Z op = silu_mul_quant 2025-05-07T20:32:25.3965762Z if compiled: 2025-05-07T20:32:25.3965867Z op = torch.compile(op) 2025-05-07T20:32:25.3965975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3966051Z 2025-05-07T20:32:25.3966149Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3966157Z 2025-05-07T20:32:25.3966295Z moe/activation_test.py:117: 2025-05-07T20:32:25.3966428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3966533Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3966633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3967133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3967276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3967750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3967978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3968317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3968413Z kernel = self.compile( 2025-05-07T20:32:25.3968801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3968975Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3969105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3969110Z 2025-05-07T20:32:25.3969311Z self = 2025-05-07T20:32:25.3970088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3970638Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508e472e0>} 2025-05-07T20:32:25.3971385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3971578Z context = 2025-05-07T20:32:25.3971583Z 2025-05-07T20:32:25.3971746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3972011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3972131Z module_map=module_map) 2025-05-07T20:32:25.3972293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3972396Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3972477Z E ^ 2025-05-07T20:32:25.3972831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3972835Z 2025-05-07T20:32:25.3973294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3973299Z 2025-05-07T20:32:25.3973405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3973628Z self=, 2025-05-07T20:32:25.3973708Z T=16384, 2025-05-07T20:32:25.3973786Z D=5120, 2025-05-07T20:32:25.3973875Z scale_ub=1200.0, 2025-05-07T20:32:25.3973963Z contiguous=False, 2025-05-07T20:32:25.3974054Z compiled=True, 2025-05-07T20:32:25.3974132Z ) 2025-05-07T20:32:25.3974350Z self = 2025-05-07T20:32:25.3974527Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.3974531Z 2025-05-07T20:32:25.3974612Z @given( 2025-05-07T20:32:25.3974733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3974837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3974955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3975113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3975233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3975309Z ) 2025-05-07T20:32:25.3975555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3975652Z def test_silu_mul_quant( 2025-05-07T20:32:25.3975729Z self, 2025-05-07T20:32:25.3975847Z T: int, 2025-05-07T20:32:25.3975932Z D: int, 2025-05-07T20:32:25.3976031Z scale_ub: Optional[float], 2025-05-07T20:32:25.3976121Z contiguous: bool, 2025-05-07T20:32:25.3976211Z compiled: bool, 2025-05-07T20:32:25.3976290Z ) -> None: 2025-05-07T20:32:25.3976388Z torch.manual_seed(2025) 2025-05-07T20:32:25.3976462Z 2025-05-07T20:32:25.3976633Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3976712Z 2025-05-07T20:32:25.3976809Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3976939Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3977031Z x = x_sign * x_clamp 2025-05-07T20:32:25.3977113Z x0 = x[:, :D] 2025-05-07T20:32:25.3977194Z x1 = x[:, D:] 2025-05-07T20:32:25.3977273Z 2025-05-07T20:32:25.3977358Z if contiguous: 2025-05-07T20:32:25.3977450Z x0 = x0.contiguous() 2025-05-07T20:32:25.3977547Z x1 = x1.contiguous() 2025-05-07T20:32:25.3977672Z 2025-05-07T20:32:25.3977769Z if scale_ub is not None: 2025-05-07T20:32:25.3977877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3978013Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3978095Z ) 2025-05-07T20:32:25.3978173Z else: 2025-05-07T20:32:25.3978268Z scale_ub_tensor = None 2025-05-07T20:32:25.3978345Z 2025-05-07T20:32:25.3978474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3978569Z op = silu_mul_quant 2025-05-07T20:32:25.3978662Z if compiled: 2025-05-07T20:32:25.3978763Z op = torch.compile(op) 2025-05-07T20:32:25.3978870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3978946Z 2025-05-07T20:32:25.3979037Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3979042Z 2025-05-07T20:32:25.3979141Z moe/activation_test.py:117: 2025-05-07T20:32:25.3979271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3979378Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3979481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3979848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3979943Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3980508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3980611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3980974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3981197Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3981535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3981639Z kernel = self.compile( 2025-05-07T20:32:25.3982020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3982199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3982326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3982331Z 2025-05-07T20:32:25.3982538Z self = 2025-05-07T20:32:25.3983356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3983857Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509844fe0>} 2025-05-07T20:32:25.3984644Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3984833Z context = 2025-05-07T20:32:25.3984838Z 2025-05-07T20:32:25.3985004Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3985268Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3985378Z module_map=module_map) 2025-05-07T20:32:25.3985544Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3985644Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3985727Z E ^ 2025-05-07T20:32:25.3986083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3986090Z 2025-05-07T20:32:25.3986545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3986550Z 2025-05-07T20:32:25.3986657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.3986883Z self=, 2025-05-07T20:32:25.3986961Z T=2048, 2025-05-07T20:32:25.3987045Z D=7168, 2025-05-07T20:32:25.3987132Z scale_ub=1200.0, 2025-05-07T20:32:25.3987224Z contiguous=False, 2025-05-07T20:32:25.3987310Z compiled=True, 2025-05-07T20:32:25.3987388Z ) 2025-05-07T20:32:25.3987609Z self = 2025-05-07T20:32:25.3987787Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.3987792Z 2025-05-07T20:32:25.3987871Z @given( 2025-05-07T20:32:25.3987995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.3988097Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.3988221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.3988341Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.3988455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.3988533Z ) 2025-05-07T20:32:25.3988777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.3988872Z def test_silu_mul_quant( 2025-05-07T20:32:25.3988954Z self, 2025-05-07T20:32:25.3989075Z T: int, 2025-05-07T20:32:25.3989159Z D: int, 2025-05-07T20:32:25.3989262Z scale_ub: Optional[float], 2025-05-07T20:32:25.3989352Z contiguous: bool, 2025-05-07T20:32:25.3989440Z compiled: bool, 2025-05-07T20:32:25.3989523Z ) -> None: 2025-05-07T20:32:25.3989619Z torch.manual_seed(2025) 2025-05-07T20:32:25.3989696Z 2025-05-07T20:32:25.3989870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.3989951Z 2025-05-07T20:32:25.3990047Z x_sign = torch.sign(x) 2025-05-07T20:32:25.3990173Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.3990262Z x = x_sign * x_clamp 2025-05-07T20:32:25.3990348Z x0 = x[:, :D] 2025-05-07T20:32:25.3990429Z x1 = x[:, D:] 2025-05-07T20:32:25.3990504Z 2025-05-07T20:32:25.3990594Z if contiguous: 2025-05-07T20:32:25.3990686Z x0 = x0.contiguous() 2025-05-07T20:32:25.3990776Z x1 = x1.contiguous() 2025-05-07T20:32:25.3990860Z 2025-05-07T20:32:25.3990998Z if scale_ub is not None: 2025-05-07T20:32:25.3991107Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.3991247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.3991324Z ) 2025-05-07T20:32:25.3991408Z else: 2025-05-07T20:32:25.3991504Z scale_ub_tensor = None 2025-05-07T20:32:25.3991578Z 2025-05-07T20:32:25.3991712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.3991845Z op = silu_mul_quant 2025-05-07T20:32:25.3991933Z if compiled: 2025-05-07T20:32:25.3992036Z op = torch.compile(op) 2025-05-07T20:32:25.3992144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3992219Z 2025-05-07T20:32:25.3992313Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.3992318Z 2025-05-07T20:32:25.3992428Z moe/activation_test.py:117: 2025-05-07T20:32:25.3992581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3992706Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.3992806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.3993178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.3993274Z return fn(*args, **kwargs) 2025-05-07T20:32:25.3993766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.3993917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.3994274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.3994501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.3994839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.3994936Z kernel = self.compile( 2025-05-07T20:32:25.3995322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.3995496Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3995623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.3995627Z 2025-05-07T20:32:25.3995835Z self = 2025-05-07T20:32:25.3996612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.3997113Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509845b20>} 2025-05-07T20:32:25.3997900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.3998096Z context = 2025-05-07T20:32:25.3998100Z 2025-05-07T20:32:25.3998265Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.3998530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3998644Z module_map=module_map) 2025-05-07T20:32:25.3998806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3998907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3998990Z E ^ 2025-05-07T20:32:25.3999343Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3999348Z 2025-05-07T20:32:25.3999810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.3999815Z 2025-05-07T20:32:25.3999921Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4000142Z self=, 2025-05-07T20:32:25.4000229Z T=1, 2025-05-07T20:32:25.4000307Z D=5120, 2025-05-07T20:32:25.4000390Z scale_ub=None, 2025-05-07T20:32:25.4000524Z contiguous=False, 2025-05-07T20:32:25.4000610Z compiled=False, 2025-05-07T20:32:25.4000690Z ) 2025-05-07T20:32:25.4000908Z self = 2025-05-07T20:32:25.4001073Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.4001078Z 2025-05-07T20:32:25.4001159Z @given( 2025-05-07T20:32:25.4001280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4001381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4001503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4001621Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4001736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4001816Z ) 2025-05-07T20:32:25.4002059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4002160Z def test_silu_mul_quant( 2025-05-07T20:32:25.4002283Z self, 2025-05-07T20:32:25.4002362Z T: int, 2025-05-07T20:32:25.4002443Z D: int, 2025-05-07T20:32:25.4002542Z scale_ub: Optional[float], 2025-05-07T20:32:25.4002633Z contiguous: bool, 2025-05-07T20:32:25.4002723Z compiled: bool, 2025-05-07T20:32:25.4002807Z ) -> None: 2025-05-07T20:32:25.4002923Z torch.manual_seed(2025) 2025-05-07T20:32:25.4003007Z 2025-05-07T20:32:25.4003194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4003272Z 2025-05-07T20:32:25.4003371Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4003495Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4003589Z x = x_sign * x_clamp 2025-05-07T20:32:25.4003672Z x0 = x[:, :D] 2025-05-07T20:32:25.4003754Z x1 = x[:, D:] 2025-05-07T20:32:25.4003830Z 2025-05-07T20:32:25.4003915Z if contiguous: 2025-05-07T20:32:25.4004009Z x0 = x0.contiguous() 2025-05-07T20:32:25.4004108Z x1 = x1.contiguous() 2025-05-07T20:32:25.4004182Z 2025-05-07T20:32:25.4004274Z if scale_ub is not None: 2025-05-07T20:32:25.4004384Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4004520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4004596Z ) 2025-05-07T20:32:25.4004675Z else: 2025-05-07T20:32:25.4004768Z scale_ub_tensor = None 2025-05-07T20:32:25.4004845Z 2025-05-07T20:32:25.4005018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4005111Z op = silu_mul_quant 2025-05-07T20:32:25.4005197Z if compiled: 2025-05-07T20:32:25.4005296Z op = torch.compile(op) 2025-05-07T20:32:25.4005401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4005476Z 2025-05-07T20:32:25.4005566Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4005571Z 2025-05-07T20:32:25.4005871Z moe/activation_test.py:117: 2025-05-07T20:32:25.4006063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4006164Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4006268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4006762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4006861Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4007303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4007582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4007922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4008022Z kernel = self.compile( 2025-05-07T20:32:25.4008403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4008645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4008773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4008778Z 2025-05-07T20:32:25.4008981Z self = 2025-05-07T20:32:25.4009759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4010258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509846e80>} 2025-05-07T20:32:25.4011004Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4011288Z context = 2025-05-07T20:32:25.4011292Z 2025-05-07T20:32:25.4011457Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4011722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4011831Z module_map=module_map) 2025-05-07T20:32:25.4012002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4012102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4012180Z E ^ 2025-05-07T20:32:25.4012571Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4012577Z 2025-05-07T20:32:25.4016458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4016475Z 2025-05-07T20:32:25.4016594Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4016822Z self=, 2025-05-07T20:32:25.4016900Z T=4096, 2025-05-07T20:32:25.4016981Z D=7168, 2025-05-07T20:32:25.4017066Z scale_ub=1200.0, 2025-05-07T20:32:25.4017153Z contiguous=False, 2025-05-07T20:32:25.4017242Z compiled=False, 2025-05-07T20:32:25.4017316Z ) 2025-05-07T20:32:25.4017629Z self = 2025-05-07T20:32:25.4017812Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.4017817Z 2025-05-07T20:32:25.4017895Z @given( 2025-05-07T20:32:25.4018028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4018128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4018244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4018370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4018484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4018561Z ) 2025-05-07T20:32:25.4018809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4018906Z def test_silu_mul_quant( 2025-05-07T20:32:25.4018988Z self, 2025-05-07T20:32:25.4019070Z T: int, 2025-05-07T20:32:25.4019149Z D: int, 2025-05-07T20:32:25.4019253Z scale_ub: Optional[float], 2025-05-07T20:32:25.4019391Z contiguous: bool, 2025-05-07T20:32:25.4019480Z compiled: bool, 2025-05-07T20:32:25.4019563Z ) -> None: 2025-05-07T20:32:25.4019659Z torch.manual_seed(2025) 2025-05-07T20:32:25.4019734Z 2025-05-07T20:32:25.4019910Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4019986Z 2025-05-07T20:32:25.4020081Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4020211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4020348Z x = x_sign * x_clamp 2025-05-07T20:32:25.4020437Z x0 = x[:, :D] 2025-05-07T20:32:25.4020519Z x1 = x[:, D:] 2025-05-07T20:32:25.4020592Z 2025-05-07T20:32:25.4020679Z if contiguous: 2025-05-07T20:32:25.4020771Z x0 = x0.contiguous() 2025-05-07T20:32:25.4020862Z x1 = x1.contiguous() 2025-05-07T20:32:25.4020937Z 2025-05-07T20:32:25.4021029Z if scale_ub is not None: 2025-05-07T20:32:25.4021141Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4021284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4021358Z ) 2025-05-07T20:32:25.4021434Z else: 2025-05-07T20:32:25.4021530Z scale_ub_tensor = None 2025-05-07T20:32:25.4021603Z 2025-05-07T20:32:25.4021733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4021824Z op = silu_mul_quant 2025-05-07T20:32:25.4021956Z if compiled: 2025-05-07T20:32:25.4022060Z op = torch.compile(op) 2025-05-07T20:32:25.4022166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4022238Z 2025-05-07T20:32:25.4022331Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4022336Z 2025-05-07T20:32:25.4022432Z moe/activation_test.py:117: 2025-05-07T20:32:25.4022560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4022664Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4022769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4023273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4023372Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4023733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4023952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4024295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4024390Z kernel = self.compile( 2025-05-07T20:32:25.4024770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4024943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4025113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4025118Z 2025-05-07T20:32:25.4025319Z self = 2025-05-07T20:32:25.4026090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4026589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509424040>} 2025-05-07T20:32:25.4027333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4027519Z context = 2025-05-07T20:32:25.4027527Z 2025-05-07T20:32:25.4027759Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4028027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4028133Z module_map=module_map) 2025-05-07T20:32:25.4028298Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4028395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4028512Z E ^ 2025-05-07T20:32:25.4028867Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4028872Z 2025-05-07T20:32:25.4029282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4029286Z 2025-05-07T20:32:25.4029394Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4029613Z self=, 2025-05-07T20:32:25.4029694Z T=16384, 2025-05-07T20:32:25.4029775Z D=7168, 2025-05-07T20:32:25.4029856Z scale_ub=None, 2025-05-07T20:32:25.4029937Z contiguous=True, 2025-05-07T20:32:25.4030022Z compiled=True, 2025-05-07T20:32:25.4030093Z ) 2025-05-07T20:32:25.4030306Z self = 2025-05-07T20:32:25.4030480Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.4030528Z 2025-05-07T20:32:25.4030607Z @given( 2025-05-07T20:32:25.4030726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4030823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4030937Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4031055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4031169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4031244Z ) 2025-05-07T20:32:25.4031494Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4031587Z def test_silu_mul_quant( 2025-05-07T20:32:25.4031663Z self, 2025-05-07T20:32:25.4031742Z T: int, 2025-05-07T20:32:25.4031819Z D: int, 2025-05-07T20:32:25.4031916Z scale_ub: Optional[float], 2025-05-07T20:32:25.4032007Z contiguous: bool, 2025-05-07T20:32:25.4032090Z compiled: bool, 2025-05-07T20:32:25.4032173Z ) -> None: 2025-05-07T20:32:25.4032268Z torch.manual_seed(2025) 2025-05-07T20:32:25.4032340Z 2025-05-07T20:32:25.4032510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4032583Z 2025-05-07T20:32:25.4032675Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4032800Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4032888Z x = x_sign * x_clamp 2025-05-07T20:32:25.4032968Z x0 = x[:, :D] 2025-05-07T20:32:25.4033051Z x1 = x[:, D:] 2025-05-07T20:32:25.4033169Z 2025-05-07T20:32:25.4033256Z if contiguous: 2025-05-07T20:32:25.4033353Z x0 = x0.contiguous() 2025-05-07T20:32:25.4033444Z x1 = x1.contiguous() 2025-05-07T20:32:25.4033518Z 2025-05-07T20:32:25.4033609Z if scale_ub is not None: 2025-05-07T20:32:25.4033715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4033852Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4033933Z ) 2025-05-07T20:32:25.4034008Z else: 2025-05-07T20:32:25.4034104Z scale_ub_tensor = None 2025-05-07T20:32:25.4034177Z 2025-05-07T20:32:25.4034305Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4034397Z op = silu_mul_quant 2025-05-07T20:32:25.4034482Z if compiled: 2025-05-07T20:32:25.4034580Z op = torch.compile(op) 2025-05-07T20:32:25.4034685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4034765Z 2025-05-07T20:32:25.4034896Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4034905Z 2025-05-07T20:32:25.4035002Z moe/activation_test.py:117: 2025-05-07T20:32:25.4035130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4035235Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4035333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4035698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4035835Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4036323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4036421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4036773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4036996Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4037334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4037428Z kernel = self.compile( 2025-05-07T20:32:25.4037806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4037981Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4038152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4038156Z 2025-05-07T20:32:25.4038357Z self = 2025-05-07T20:32:25.4039124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4039623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509425260>} 2025-05-07T20:32:25.4040373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4040559Z context = 2025-05-07T20:32:25.4040569Z 2025-05-07T20:32:25.4040733Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4040991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4041097Z module_map=module_map) 2025-05-07T20:32:25.4041261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4041358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4041480Z E ^ 2025-05-07T20:32:25.4041837Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4041841Z 2025-05-07T20:32:25.4042248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4042253Z 2025-05-07T20:32:25.4042358Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4042603Z self=, 2025-05-07T20:32:25.4042691Z T=4096, 2025-05-07T20:32:25.4042784Z D=5120, 2025-05-07T20:32:25.4042875Z scale_ub=None, 2025-05-07T20:32:25.4042971Z contiguous=False, 2025-05-07T20:32:25.4043055Z compiled=True, 2025-05-07T20:32:25.4043127Z ) 2025-05-07T20:32:25.4043344Z self = 2025-05-07T20:32:25.4043515Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.4043520Z 2025-05-07T20:32:25.4043637Z @given( 2025-05-07T20:32:25.4043767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4044088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4044394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4044720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4045044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4045392Z ) 2025-05-07T20:32:25.4045741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4046174Z def test_silu_mul_quant( 2025-05-07T20:32:25.4046413Z self, 2025-05-07T20:32:25.4046610Z T: int, 2025-05-07T20:32:25.4046809Z D: int, 2025-05-07T20:32:25.4047021Z scale_ub: Optional[float], 2025-05-07T20:32:25.4047286Z contiguous: bool, 2025-05-07T20:32:25.4047593Z compiled: bool, 2025-05-07T20:32:25.4047814Z ) -> None: 2025-05-07T20:32:25.4048033Z torch.manual_seed(2025) 2025-05-07T20:32:25.4048268Z 2025-05-07T20:32:25.4048540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4048880Z 2025-05-07T20:32:25.4049072Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4049361Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4049663Z x = x_sign * x_clamp 2025-05-07T20:32:25.4049902Z x0 = x[:, :D] 2025-05-07T20:32:25.4050171Z x1 = x[:, D:] 2025-05-07T20:32:25.4050371Z 2025-05-07T20:32:25.4050554Z if contiguous: 2025-05-07T20:32:25.4050787Z x0 = x0.contiguous() 2025-05-07T20:32:25.4051043Z x1 = x1.contiguous() 2025-05-07T20:32:25.4051280Z 2025-05-07T20:32:25.4051481Z if scale_ub is not None: 2025-05-07T20:32:25.4051749Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4052081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4052392Z ) 2025-05-07T20:32:25.4052819Z else: 2025-05-07T20:32:25.4053033Z scale_ub_tensor = None 2025-05-07T20:32:25.4053282Z 2025-05-07T20:32:25.4053513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4053821Z op = silu_mul_quant 2025-05-07T20:32:25.4054069Z if compiled: 2025-05-07T20:32:25.4054313Z op = torch.compile(op) 2025-05-07T20:32:25.4054602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4054877Z 2025-05-07T20:32:25.4055065Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4055226Z 2025-05-07T20:32:25.4055321Z moe/activation_test.py:117: 2025-05-07T20:32:25.4055612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4055940Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4056211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4056812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4057367Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4058021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4058704Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4059235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4059911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4060564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4061085Z kernel = self.compile( 2025-05-07T20:32:25.4061616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4062262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4062744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4062975Z 2025-05-07T20:32:25.4063181Z self = 2025-05-07T20:32:25.4064251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4065665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509425da0>} 2025-05-07T20:32:25.4066993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4068010Z context = 2025-05-07T20:32:25.4068297Z 2025-05-07T20:32:25.4068459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4068973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4069432Z module_map=module_map) 2025-05-07T20:32:25.4069787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4070181Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4070436Z E ^ 2025-05-07T20:32:25.4070896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4071343Z 2025-05-07T20:32:25.4071754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4072263Z 2025-05-07T20:32:25.4072368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4072780Z self=, 2025-05-07T20:32:25.4073174Z T=4096, 2025-05-07T20:32:25.4073360Z D=5120, 2025-05-07T20:32:25.4073549Z scale_ub=1200.0, 2025-05-07T20:32:25.4073764Z contiguous=False, 2025-05-07T20:32:25.4073984Z compiled=False, 2025-05-07T20:32:25.4074181Z ) 2025-05-07T20:32:25.4074493Z self = 2025-05-07T20:32:25.4074990Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.4075263Z 2025-05-07T20:32:25.4075346Z @given( 2025-05-07T20:32:25.4075570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4075878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4076178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4076496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4076813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4077136Z ) 2025-05-07T20:32:25.4077483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4077911Z def test_silu_mul_quant( 2025-05-07T20:32:25.4078149Z self, 2025-05-07T20:32:25.4078341Z T: int, 2025-05-07T20:32:25.4078532Z D: int, 2025-05-07T20:32:25.4078747Z scale_ub: Optional[float], 2025-05-07T20:32:25.4079017Z contiguous: bool, 2025-05-07T20:32:25.4079258Z compiled: bool, 2025-05-07T20:32:25.4079479Z ) -> None: 2025-05-07T20:32:25.4079687Z torch.manual_seed(2025) 2025-05-07T20:32:25.4079919Z 2025-05-07T20:32:25.4080185Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4080525Z 2025-05-07T20:32:25.4080716Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4080999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4081298Z x = x_sign * x_clamp 2025-05-07T20:32:25.4081533Z x0 = x[:, :D] 2025-05-07T20:32:25.4081788Z x1 = x[:, D:] 2025-05-07T20:32:25.4081995Z 2025-05-07T20:32:25.4082181Z if contiguous: 2025-05-07T20:32:25.4082403Z x0 = x0.contiguous() 2025-05-07T20:32:25.4082656Z x1 = x1.contiguous() 2025-05-07T20:32:25.4082888Z 2025-05-07T20:32:25.4083071Z if scale_ub is not None: 2025-05-07T20:32:25.4083338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4083716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4084019Z ) 2025-05-07T20:32:25.4084210Z else: 2025-05-07T20:32:25.4084419Z scale_ub_tensor = None 2025-05-07T20:32:25.4084663Z 2025-05-07T20:32:25.4084888Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4085196Z op = silu_mul_quant 2025-05-07T20:32:25.4085436Z if compiled: 2025-05-07T20:32:25.4085680Z op = torch.compile(op) 2025-05-07T20:32:25.4085974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4086242Z 2025-05-07T20:32:25.4086429Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4086598Z 2025-05-07T20:32:25.4086694Z moe/activation_test.py:117: 2025-05-07T20:32:25.4086984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4087308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4087662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4088401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4089083Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4089608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4090280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4090937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4091460Z kernel = self.compile( 2025-05-07T20:32:25.4091993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4092692Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4093082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4093312Z 2025-05-07T20:32:25.4093516Z self = 2025-05-07T20:32:25.4094582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4095984Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509427420>} 2025-05-07T20:32:25.4097314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4098322Z context = 2025-05-07T20:32:25.4098608Z 2025-05-07T20:32:25.4098771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4099289Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4099748Z module_map=module_map) 2025-05-07T20:32:25.4100104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4100452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4100706Z E ^ 2025-05-07T20:32:25.4101206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4101653Z 2025-05-07T20:32:25.4102066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4102573Z 2025-05-07T20:32:25.4102676Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4103080Z self=, 2025-05-07T20:32:25.4103472Z T=4096, 2025-05-07T20:32:25.4103704Z D=5120, 2025-05-07T20:32:25.4103895Z scale_ub=1200.0, 2025-05-07T20:32:25.4104110Z contiguous=False, 2025-05-07T20:32:25.4104332Z compiled=True, 2025-05-07T20:32:25.4104527Z ) 2025-05-07T20:32:25.4104836Z self = 2025-05-07T20:32:25.4105323Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.4105778Z 2025-05-07T20:32:25.4105894Z @given( 2025-05-07T20:32:25.4106178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4106480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4106779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4107105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4107423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4107702Z ) 2025-05-07T20:32:25.4108046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4108587Z def test_silu_mul_quant( 2025-05-07T20:32:25.4108824Z self, 2025-05-07T20:32:25.4109012Z T: int, 2025-05-07T20:32:25.4109206Z D: int, 2025-05-07T20:32:25.4109417Z scale_ub: Optional[float], 2025-05-07T20:32:25.4109681Z contiguous: bool, 2025-05-07T20:32:25.4109916Z compiled: bool, 2025-05-07T20:32:25.4110131Z ) -> None: 2025-05-07T20:32:25.4110341Z torch.manual_seed(2025) 2025-05-07T20:32:25.4110575Z 2025-05-07T20:32:25.4110842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4111181Z 2025-05-07T20:32:25.4111372Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4111654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4111962Z x = x_sign * x_clamp 2025-05-07T20:32:25.4112195Z x0 = x[:, :D] 2025-05-07T20:32:25.4112407Z x1 = x[:, D:] 2025-05-07T20:32:25.4112640Z 2025-05-07T20:32:25.4112844Z if contiguous: 2025-05-07T20:32:25.4113070Z x0 = x0.contiguous() 2025-05-07T20:32:25.4113323Z x1 = x1.contiguous() 2025-05-07T20:32:25.4113559Z 2025-05-07T20:32:25.4113745Z if scale_ub is not None: 2025-05-07T20:32:25.4114011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4114341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4114642Z ) 2025-05-07T20:32:25.4114831Z else: 2025-05-07T20:32:25.4115108Z scale_ub_tensor = None 2025-05-07T20:32:25.4115355Z 2025-05-07T20:32:25.4115583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4115892Z op = silu_mul_quant 2025-05-07T20:32:25.4116132Z if compiled: 2025-05-07T20:32:25.4116376Z op = torch.compile(op) 2025-05-07T20:32:25.4116666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4116931Z 2025-05-07T20:32:25.4117118Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4117287Z 2025-05-07T20:32:25.4117383Z moe/activation_test.py:117: 2025-05-07T20:32:25.4117671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4117997Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4118273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4118823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4119369Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4120118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4120803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4121335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4122005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4122721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4123246Z kernel = self.compile( 2025-05-07T20:32:25.4123775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4124421Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4124809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4125040Z 2025-05-07T20:32:25.4125249Z self = 2025-05-07T20:32:25.4126313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4127744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be074860>} 2025-05-07T20:32:25.4129143Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4130160Z context = 2025-05-07T20:32:25.4130445Z 2025-05-07T20:32:25.4130613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4131126Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4131585Z module_map=module_map) 2025-05-07T20:32:25.4131944Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4132291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4132549Z E ^ 2025-05-07T20:32:25.4133009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4133452Z 2025-05-07T20:32:25.4133863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4134367Z 2025-05-07T20:32:25.4134471Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4134880Z self=, 2025-05-07T20:32:25.4135327Z T=2048, 2025-05-07T20:32:25.4135513Z D=7168, 2025-05-07T20:32:25.4135705Z scale_ub=1200.0, 2025-05-07T20:32:25.4135923Z contiguous=False, 2025-05-07T20:32:25.4136141Z compiled=False, 2025-05-07T20:32:25.4136345Z ) 2025-05-07T20:32:25.4136660Z self = 2025-05-07T20:32:25.4137151Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.4137426Z 2025-05-07T20:32:25.4137509Z @given( 2025-05-07T20:32:25.4137735Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4138046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4138342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4138664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4138982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4139262Z ) 2025-05-07T20:32:25.4139607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4140092Z def test_silu_mul_quant( 2025-05-07T20:32:25.4140330Z self, 2025-05-07T20:32:25.4140516Z T: int, 2025-05-07T20:32:25.4140709Z D: int, 2025-05-07T20:32:25.4140927Z scale_ub: Optional[float], 2025-05-07T20:32:25.4141186Z contiguous: bool, 2025-05-07T20:32:25.4141421Z compiled: bool, 2025-05-07T20:32:25.4141635Z ) -> None: 2025-05-07T20:32:25.4141843Z torch.manual_seed(2025) 2025-05-07T20:32:25.4142132Z 2025-05-07T20:32:25.4142404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4142739Z 2025-05-07T20:32:25.4142930Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4143218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4143520Z x = x_sign * x_clamp 2025-05-07T20:32:25.4143757Z x0 = x[:, :D] 2025-05-07T20:32:25.4143970Z x1 = x[:, D:] 2025-05-07T20:32:25.4144171Z 2025-05-07T20:32:25.4144357Z if contiguous: 2025-05-07T20:32:25.4144585Z x0 = x0.contiguous() 2025-05-07T20:32:25.4144838Z x1 = x1.contiguous() 2025-05-07T20:32:25.4145074Z 2025-05-07T20:32:25.4145263Z if scale_ub is not None: 2025-05-07T20:32:25.4149997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4150354Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4150664Z ) 2025-05-07T20:32:25.4150934Z else: 2025-05-07T20:32:25.4151141Z scale_ub_tensor = None 2025-05-07T20:32:25.4151391Z 2025-05-07T20:32:25.4151626Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4151933Z op = silu_mul_quant 2025-05-07T20:32:25.4152185Z if compiled: 2025-05-07T20:32:25.4152433Z op = torch.compile(op) 2025-05-07T20:32:25.4152725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4152997Z 2025-05-07T20:32:25.4153190Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4153353Z 2025-05-07T20:32:25.4153458Z moe/activation_test.py:117: 2025-05-07T20:32:25.4153748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4154082Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4154364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4155056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4155745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4156276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4156951Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4157605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4158134Z kernel = self.compile( 2025-05-07T20:32:25.4158727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4159376Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4159769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4160000Z 2025-05-07T20:32:25.4160205Z self = 2025-05-07T20:32:25.4161276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4162643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be0756c0>} 2025-05-07T20:32:25.4164018Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4165036Z context = 2025-05-07T20:32:25.4165325Z 2025-05-07T20:32:25.4165487Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4166003Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4166507Z module_map=module_map) 2025-05-07T20:32:25.4166862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4167211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4167469Z E ^ 2025-05-07T20:32:25.4168008Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4168458Z 2025-05-07T20:32:25.4168877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4169387Z 2025-05-07T20:32:25.4169490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4169900Z self=, 2025-05-07T20:32:25.4170295Z T=1, 2025-05-07T20:32:25.4170480Z D=7168, 2025-05-07T20:32:25.4170674Z scale_ub=None, 2025-05-07T20:32:25.4170882Z contiguous=True, 2025-05-07T20:32:25.4171157Z compiled=False, 2025-05-07T20:32:25.4171357Z ) 2025-05-07T20:32:25.4171669Z self = 2025-05-07T20:32:25.4172148Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4172412Z 2025-05-07T20:32:25.4172492Z @given( 2025-05-07T20:32:25.4172722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4173028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4173340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4173660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4173980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4174263Z ) 2025-05-07T20:32:25.4174604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4175036Z def test_silu_mul_quant( 2025-05-07T20:32:25.4175275Z self, 2025-05-07T20:32:25.4175479Z T: int, 2025-05-07T20:32:25.4175672Z D: int, 2025-05-07T20:32:25.4175883Z scale_ub: Optional[float], 2025-05-07T20:32:25.4176153Z contiguous: bool, 2025-05-07T20:32:25.4176388Z compiled: bool, 2025-05-07T20:32:25.4176604Z ) -> None: 2025-05-07T20:32:25.4176815Z torch.manual_seed(2025) 2025-05-07T20:32:25.4177052Z 2025-05-07T20:32:25.4177317Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4177656Z 2025-05-07T20:32:25.4177893Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4178177Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4178484Z x = x_sign * x_clamp 2025-05-07T20:32:25.4178723Z x0 = x[:, :D] 2025-05-07T20:32:25.4178938Z x1 = x[:, D:] 2025-05-07T20:32:25.4179141Z 2025-05-07T20:32:25.4179323Z if contiguous: 2025-05-07T20:32:25.4179546Z x0 = x0.contiguous() 2025-05-07T20:32:25.4179803Z x1 = x1.contiguous() 2025-05-07T20:32:25.4180049Z 2025-05-07T20:32:25.4180233Z if scale_ub is not None: 2025-05-07T20:32:25.4180502Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4180832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4181135Z ) 2025-05-07T20:32:25.4181326Z else: 2025-05-07T20:32:25.4181533Z scale_ub_tensor = None 2025-05-07T20:32:25.4181786Z 2025-05-07T20:32:25.4182011Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4182376Z op = silu_mul_quant 2025-05-07T20:32:25.4182622Z if compiled: 2025-05-07T20:32:25.4182864Z op = torch.compile(op) 2025-05-07T20:32:25.4183155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4183427Z 2025-05-07T20:32:25.4183616Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4183779Z 2025-05-07T20:32:25.4183876Z moe/activation_test.py:117: 2025-05-07T20:32:25.4184168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4184538Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4184816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4185501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4186188Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4186717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4187394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4188051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4188580Z kernel = self.compile( 2025-05-07T20:32:25.4189113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4189806Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4190196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4190420Z 2025-05-07T20:32:25.4190624Z self = 2025-05-07T20:32:25.4191699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4193116Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be074fe0>} 2025-05-07T20:32:25.4194444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4195467Z context = 2025-05-07T20:32:25.4195751Z 2025-05-07T20:32:25.4195913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4196438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4196898Z module_map=module_map) 2025-05-07T20:32:25.4197326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4197681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4197939Z E ^ 2025-05-07T20:32:25.4198398Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4198841Z 2025-05-07T20:32:25.4199250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4199762Z 2025-05-07T20:32:25.4199868Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4200274Z self=, 2025-05-07T20:32:25.4200672Z T=16384, 2025-05-07T20:32:25.4200859Z D=7168, 2025-05-07T20:32:25.4201048Z scale_ub=1200.0, 2025-05-07T20:32:25.4201267Z contiguous=False, 2025-05-07T20:32:25.4201484Z compiled=True, 2025-05-07T20:32:25.4201684Z ) 2025-05-07T20:32:25.4202008Z self = 2025-05-07T20:32:25.4202545Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.4202837Z 2025-05-07T20:32:25.4202927Z @given( 2025-05-07T20:32:25.4203181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4203489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4203785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4204111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4204479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4204760Z ) 2025-05-07T20:32:25.4205099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4205529Z def test_silu_mul_quant( 2025-05-07T20:32:25.4206021Z self, 2025-05-07T20:32:25.4206226Z T: int, 2025-05-07T20:32:25.4206422Z D: int, 2025-05-07T20:32:25.4206635Z scale_ub: Optional[float], 2025-05-07T20:32:25.4206898Z contiguous: bool, 2025-05-07T20:32:25.4207140Z compiled: bool, 2025-05-07T20:32:25.4207362Z ) -> None: 2025-05-07T20:32:25.4207610Z torch.manual_seed(2025) 2025-05-07T20:32:25.4207847Z 2025-05-07T20:32:25.4208116Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4208448Z 2025-05-07T20:32:25.4208642Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4208927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4209324Z x = x_sign * x_clamp 2025-05-07T20:32:25.4209559Z x0 = x[:, :D] 2025-05-07T20:32:25.4209773Z x1 = x[:, D:] 2025-05-07T20:32:25.4209979Z 2025-05-07T20:32:25.4210159Z if contiguous: 2025-05-07T20:32:25.4210385Z x0 = x0.contiguous() 2025-05-07T20:32:25.4210640Z x1 = x1.contiguous() 2025-05-07T20:32:25.4210870Z 2025-05-07T20:32:25.4211056Z if scale_ub is not None: 2025-05-07T20:32:25.4211323Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4211653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4211958Z ) 2025-05-07T20:32:25.4212149Z else: 2025-05-07T20:32:25.4212354Z scale_ub_tensor = None 2025-05-07T20:32:25.4212600Z 2025-05-07T20:32:25.4212826Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4213129Z op = silu_mul_quant 2025-05-07T20:32:25.4213373Z if compiled: 2025-05-07T20:32:25.4213618Z op = torch.compile(op) 2025-05-07T20:32:25.4213904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4214175Z 2025-05-07T20:32:25.4214366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4214533Z 2025-05-07T20:32:25.4214635Z moe/activation_test.py:117: 2025-05-07T20:32:25.4214923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4215252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4215596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4216147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4216702Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4217355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4218036Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4218562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4219240Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4219892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4220413Z kernel = self.compile( 2025-05-07T20:32:25.4220949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4221656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4222048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4222274Z 2025-05-07T20:32:25.4222477Z self = 2025-05-07T20:32:25.4223544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4224961Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05be077b00>} 2025-05-07T20:32:25.4226297Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4227311Z context = 2025-05-07T20:32:25.4227598Z 2025-05-07T20:32:25.4227761Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4228279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4228740Z module_map=module_map) 2025-05-07T20:32:25.4229143Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4229491Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4229750Z E ^ 2025-05-07T20:32:25.4230204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4230653Z 2025-05-07T20:32:25.4231064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4231574Z 2025-05-07T20:32:25.4231678Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4232087Z self=, 2025-05-07T20:32:25.4232481Z T=1, 2025-05-07T20:32:25.4232665Z D=7168, 2025-05-07T20:32:25.4232854Z scale_ub=None, 2025-05-07T20:32:25.4233064Z contiguous=False, 2025-05-07T20:32:25.4233287Z compiled=False, 2025-05-07T20:32:25.4233489Z ) 2025-05-07T20:32:25.4233801Z self = 2025-05-07T20:32:25.4234288Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.4234551Z 2025-05-07T20:32:25.4234629Z @given( 2025-05-07T20:32:25.4234855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4235159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4235461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4235833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4236155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4236436Z ) 2025-05-07T20:32:25.4236779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4237215Z def test_silu_mul_quant( 2025-05-07T20:32:25.4237449Z self, 2025-05-07T20:32:25.4237639Z T: int, 2025-05-07T20:32:25.4237829Z D: int, 2025-05-07T20:32:25.4238039Z scale_ub: Optional[float], 2025-05-07T20:32:25.4238307Z contiguous: bool, 2025-05-07T20:32:25.4238542Z compiled: bool, 2025-05-07T20:32:25.4238758Z ) -> None: 2025-05-07T20:32:25.4238970Z torch.manual_seed(2025) 2025-05-07T20:32:25.4239042Z 2025-05-07T20:32:25.4239214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4239288Z 2025-05-07T20:32:25.4239377Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4239508Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4239648Z x = x_sign * x_clamp 2025-05-07T20:32:25.4239729Z x0 = x[:, :D] 2025-05-07T20:32:25.4239812Z x1 = x[:, D:] 2025-05-07T20:32:25.4239882Z 2025-05-07T20:32:25.4239971Z if contiguous: 2025-05-07T20:32:25.4240062Z x0 = x0.contiguous() 2025-05-07T20:32:25.4240150Z x1 = x1.contiguous() 2025-05-07T20:32:25.4240225Z 2025-05-07T20:32:25.4240314Z if scale_ub is not None: 2025-05-07T20:32:25.4240461Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4240598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4240674Z ) 2025-05-07T20:32:25.4240749Z else: 2025-05-07T20:32:25.4240846Z scale_ub_tensor = None 2025-05-07T20:32:25.4240918Z 2025-05-07T20:32:25.4241047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4241142Z op = silu_mul_quant 2025-05-07T20:32:25.4241227Z if compiled: 2025-05-07T20:32:25.4241329Z op = torch.compile(op) 2025-05-07T20:32:25.4241440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4241513Z 2025-05-07T20:32:25.4241604Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4241608Z 2025-05-07T20:32:25.4241703Z moe/activation_test.py:117: 2025-05-07T20:32:25.4241832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4241934Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4242076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4242571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4242671Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4243028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4243260Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4243602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4243696Z kernel = self.compile( 2025-05-07T20:32:25.4244079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4244251Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4244381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4244389Z 2025-05-07T20:32:25.4244591Z self = 2025-05-07T20:32:25.4245365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4245911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509070680>} 2025-05-07T20:32:25.4246658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4246849Z context = 2025-05-07T20:32:25.4246860Z 2025-05-07T20:32:25.4247022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4247286Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4247393Z module_map=module_map) 2025-05-07T20:32:25.4247629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4247730Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4247807Z E ^ 2025-05-07T20:32:25.4248206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4248211Z 2025-05-07T20:32:25.4248626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4248631Z 2025-05-07T20:32:25.4248733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4248953Z self=, 2025-05-07T20:32:25.4249107Z T=2048, 2025-05-07T20:32:25.4249183Z D=7168, 2025-05-07T20:32:25.4249268Z scale_ub=None, 2025-05-07T20:32:25.4249353Z contiguous=False, 2025-05-07T20:32:25.4249436Z compiled=True, 2025-05-07T20:32:25.4249510Z ) 2025-05-07T20:32:25.4249725Z self = 2025-05-07T20:32:25.4249896Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.4249900Z 2025-05-07T20:32:25.4249983Z @given( 2025-05-07T20:32:25.4250106Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4250204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4250319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4250434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4250547Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4250620Z ) 2025-05-07T20:32:25.4250933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4251027Z def test_silu_mul_quant( 2025-05-07T20:32:25.4251102Z self, 2025-05-07T20:32:25.4251181Z T: int, 2025-05-07T20:32:25.4251259Z D: int, 2025-05-07T20:32:25.4251356Z scale_ub: Optional[float], 2025-05-07T20:32:25.4251442Z contiguous: bool, 2025-05-07T20:32:25.4251528Z compiled: bool, 2025-05-07T20:32:25.4251608Z ) -> None: 2025-05-07T20:32:25.4251703Z torch.manual_seed(2025) 2025-05-07T20:32:25.4251786Z 2025-05-07T20:32:25.4251954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4252029Z 2025-05-07T20:32:25.4252121Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4252243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4252333Z x = x_sign * x_clamp 2025-05-07T20:32:25.4252412Z x0 = x[:, :D] 2025-05-07T20:32:25.4252493Z x1 = x[:, D:] 2025-05-07T20:32:25.4252575Z 2025-05-07T20:32:25.4252658Z if contiguous: 2025-05-07T20:32:25.4252748Z x0 = x0.contiguous() 2025-05-07T20:32:25.4252843Z x1 = x1.contiguous() 2025-05-07T20:32:25.4252914Z 2025-05-07T20:32:25.4253003Z if scale_ub is not None: 2025-05-07T20:32:25.4253111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4253243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4253324Z ) 2025-05-07T20:32:25.4253448Z else: 2025-05-07T20:32:25.4253547Z scale_ub_tensor = None 2025-05-07T20:32:25.4253622Z 2025-05-07T20:32:25.4253749Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4253841Z op = silu_mul_quant 2025-05-07T20:32:25.4253928Z if compiled: 2025-05-07T20:32:25.4254025Z op = torch.compile(op) 2025-05-07T20:32:25.4254127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4254208Z 2025-05-07T20:32:25.4254298Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4254303Z 2025-05-07T20:32:25.4254399Z moe/activation_test.py:117: 2025-05-07T20:32:25.4254530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4254630Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4254729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4255100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4255233Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4255730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4255830Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4256188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4256411Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4256797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4256893Z kernel = self.compile( 2025-05-07T20:32:25.4257271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4257447Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4257578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4257582Z 2025-05-07T20:32:25.4257784Z self = 2025-05-07T20:32:25.4258561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4259103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509071d00>} 2025-05-07T20:32:25.4259851Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4260040Z context = 2025-05-07T20:32:25.4260044Z 2025-05-07T20:32:25.4260210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4260473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4260578Z module_map=module_map) 2025-05-07T20:32:25.4260742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4260839Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4260922Z E ^ 2025-05-07T20:32:25.4261276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4261281Z 2025-05-07T20:32:25.4261691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4261696Z 2025-05-07T20:32:25.4261801Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4262062Z self=, 2025-05-07T20:32:25.4262143Z T=4096, 2025-05-07T20:32:25.4262224Z D=7168, 2025-05-07T20:32:25.4262307Z scale_ub=None, 2025-05-07T20:32:25.4262398Z contiguous=False, 2025-05-07T20:32:25.4262481Z compiled=True, 2025-05-07T20:32:25.4262554Z ) 2025-05-07T20:32:25.4262773Z self = 2025-05-07T20:32:25.4262942Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.4262951Z 2025-05-07T20:32:25.4263029Z @given( 2025-05-07T20:32:25.4263151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4263249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4263360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4263477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4263587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4263664Z ) 2025-05-07T20:32:25.4263949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4264047Z def test_silu_mul_quant( 2025-05-07T20:32:25.4264124Z self, 2025-05-07T20:32:25.4264202Z T: int, 2025-05-07T20:32:25.4264276Z D: int, 2025-05-07T20:32:25.4264377Z scale_ub: Optional[float], 2025-05-07T20:32:25.4264467Z contiguous: bool, 2025-05-07T20:32:25.4264551Z compiled: bool, 2025-05-07T20:32:25.4264678Z ) -> None: 2025-05-07T20:32:25.4264770Z torch.manual_seed(2025) 2025-05-07T20:32:25.4264843Z 2025-05-07T20:32:25.4265014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4265088Z 2025-05-07T20:32:25.4265178Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4265304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4265392Z x = x_sign * x_clamp 2025-05-07T20:32:25.4265475Z x0 = x[:, :D] 2025-05-07T20:32:25.4265557Z x1 = x[:, D:] 2025-05-07T20:32:25.4265630Z 2025-05-07T20:32:25.4265712Z if contiguous: 2025-05-07T20:32:25.4265803Z x0 = x0.contiguous() 2025-05-07T20:32:25.4265890Z x1 = x1.contiguous() 2025-05-07T20:32:25.4265966Z 2025-05-07T20:32:25.4266056Z if scale_ub is not None: 2025-05-07T20:32:25.4266162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4266299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4266422Z ) 2025-05-07T20:32:25.4266498Z else: 2025-05-07T20:32:25.4266595Z scale_ub_tensor = None 2025-05-07T20:32:25.4266668Z 2025-05-07T20:32:25.4266798Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4266888Z op = silu_mul_quant 2025-05-07T20:32:25.4266971Z if compiled: 2025-05-07T20:32:25.4267072Z op = torch.compile(op) 2025-05-07T20:32:25.4267175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4267252Z 2025-05-07T20:32:25.4267348Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4267352Z 2025-05-07T20:32:25.4267449Z moe/activation_test.py:117: 2025-05-07T20:32:25.4267577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4267683Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4267781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4268150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4268248Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4268739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4268841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4269195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4269505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4269847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4269940Z kernel = self.compile( 2025-05-07T20:32:25.4270321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4270491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4270621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4270626Z 2025-05-07T20:32:25.4270830Z self = 2025-05-07T20:32:25.4271600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4272139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0509072840>} 2025-05-07T20:32:25.4272882Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4273108Z context = 2025-05-07T20:32:25.4273117Z 2025-05-07T20:32:25.4273279Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4273539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4273647Z module_map=module_map) 2025-05-07T20:32:25.4273805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4273906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4273989Z E ^ 2025-05-07T20:32:25.4274340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4274345Z 2025-05-07T20:32:25.4274756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4274761Z 2025-05-07T20:32:25.4274861Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4275122Z self=, 2025-05-07T20:32:25.4275205Z T=16384, 2025-05-07T20:32:25.4275281Z D=5120, 2025-05-07T20:32:25.4275362Z scale_ub=1200.0, 2025-05-07T20:32:25.4275449Z contiguous=False, 2025-05-07T20:32:25.4275536Z compiled=False, 2025-05-07T20:32:25.4275609Z ) 2025-05-07T20:32:25.4275826Z self = 2025-05-07T20:32:25.4276005Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.4276012Z 2025-05-07T20:32:25.4276093Z @given( 2025-05-07T20:32:25.4276210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4276309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4276426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4276540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4276650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4276733Z ) 2025-05-07T20:32:25.4276974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4277071Z def test_silu_mul_quant( 2025-05-07T20:32:25.4277148Z self, 2025-05-07T20:32:25.4277224Z T: int, 2025-05-07T20:32:25.4277304Z D: int, 2025-05-07T20:32:25.4277401Z scale_ub: Optional[float], 2025-05-07T20:32:25.4277489Z contiguous: bool, 2025-05-07T20:32:25.4277576Z compiled: bool, 2025-05-07T20:32:25.4277698Z ) -> None: 2025-05-07T20:32:25.4277794Z torch.manual_seed(2025) 2025-05-07T20:32:25.4277867Z 2025-05-07T20:32:25.4278033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4278105Z 2025-05-07T20:32:25.4278197Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4278321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4278409Z x = x_sign * x_clamp 2025-05-07T20:32:25.4278500Z x0 = x[:, :D] 2025-05-07T20:32:25.4278579Z x1 = x[:, D:] 2025-05-07T20:32:25.4278654Z 2025-05-07T20:32:25.4278738Z if contiguous: 2025-05-07T20:32:25.4278827Z x0 = x0.contiguous() 2025-05-07T20:32:25.4278918Z x1 = x1.contiguous() 2025-05-07T20:32:25.4278990Z 2025-05-07T20:32:25.4279081Z if scale_ub is not None: 2025-05-07T20:32:25.4279190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4279324Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4279437Z ) 2025-05-07T20:32:25.4279516Z else: 2025-05-07T20:32:25.4279610Z scale_ub_tensor = None 2025-05-07T20:32:25.4279683Z 2025-05-07T20:32:25.4279813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4279903Z op = silu_mul_quant 2025-05-07T20:32:25.4279989Z if compiled: 2025-05-07T20:32:25.4280090Z op = torch.compile(op) 2025-05-07T20:32:25.4280258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4280334Z 2025-05-07T20:32:25.4280424Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4280429Z 2025-05-07T20:32:25.4280523Z moe/activation_test.py:117: 2025-05-07T20:32:25.4280653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4280752Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4280848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4281350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4281447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4281804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4282021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4282360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4282503Z kernel = self.compile( 2025-05-07T20:32:25.4282930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4283106Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4283229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4283234Z 2025-05-07T20:32:25.4283436Z self = 2025-05-07T20:32:25.4284208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4284704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508758040>} 2025-05-07T20:32:25.4285453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4285639Z context = 2025-05-07T20:32:25.4285644Z 2025-05-07T20:32:25.4285804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4286109Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4286217Z module_map=module_map) 2025-05-07T20:32:25.4286380Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4286478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4286555Z E ^ 2025-05-07T20:32:25.4290932Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4290949Z 2025-05-07T20:32:25.4291386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4291391Z 2025-05-07T20:32:25.4291501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4291725Z self=, 2025-05-07T20:32:25.4291805Z T=16384, 2025-05-07T20:32:25.4291886Z D=5120, 2025-05-07T20:32:25.4292042Z scale_ub=1200.0, 2025-05-07T20:32:25.4292131Z contiguous=True, 2025-05-07T20:32:25.4292217Z compiled=True, 2025-05-07T20:32:25.4292290Z ) 2025-05-07T20:32:25.4292510Z self = 2025-05-07T20:32:25.4292690Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.4292694Z 2025-05-07T20:32:25.4292773Z @given( 2025-05-07T20:32:25.4292942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4293042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4293157Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4293276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4293388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4293464Z ) 2025-05-07T20:32:25.4293719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4293816Z def test_silu_mul_quant( 2025-05-07T20:32:25.4293903Z self, 2025-05-07T20:32:25.4293983Z T: int, 2025-05-07T20:32:25.4294061Z D: int, 2025-05-07T20:32:25.4294164Z scale_ub: Optional[float], 2025-05-07T20:32:25.4294254Z contiguous: bool, 2025-05-07T20:32:25.4294340Z compiled: bool, 2025-05-07T20:32:25.4294424Z ) -> None: 2025-05-07T20:32:25.4294521Z torch.manual_seed(2025) 2025-05-07T20:32:25.4294596Z 2025-05-07T20:32:25.4294820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4294896Z 2025-05-07T20:32:25.4294989Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4295125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4295216Z x = x_sign * x_clamp 2025-05-07T20:32:25.4295299Z x0 = x[:, :D] 2025-05-07T20:32:25.4295383Z x1 = x[:, D:] 2025-05-07T20:32:25.4295456Z 2025-05-07T20:32:25.4295544Z if contiguous: 2025-05-07T20:32:25.4295639Z x0 = x0.contiguous() 2025-05-07T20:32:25.4295732Z x1 = x1.contiguous() 2025-05-07T20:32:25.4295808Z 2025-05-07T20:32:25.4295900Z if scale_ub is not None: 2025-05-07T20:32:25.4296008Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4296145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4296221Z ) 2025-05-07T20:32:25.4296295Z else: 2025-05-07T20:32:25.4296394Z scale_ub_tensor = None 2025-05-07T20:32:25.4296470Z 2025-05-07T20:32:25.4296598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4296694Z op = silu_mul_quant 2025-05-07T20:32:25.4296780Z if compiled: 2025-05-07T20:32:25.4296887Z op = torch.compile(op) 2025-05-07T20:32:25.4296990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4297062Z 2025-05-07T20:32:25.4297153Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4297158Z 2025-05-07T20:32:25.4297301Z moe/activation_test.py:117: 2025-05-07T20:32:25.4297432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4297540Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4297638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4298010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4298102Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4298603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4298699Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4299053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4299274Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4299654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4299749Z kernel = self.compile( 2025-05-07T20:32:25.4300131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4300304Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4300433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4300479Z 2025-05-07T20:32:25.4300681Z self = 2025-05-07T20:32:25.4301452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4301956Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508759300>} 2025-05-07T20:32:25.4302703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4302896Z context = 2025-05-07T20:32:25.4302900Z 2025-05-07T20:32:25.4303107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4303368Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4303474Z module_map=module_map) 2025-05-07T20:32:25.4303635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4303737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4303816Z E ^ 2025-05-07T20:32:25.4304174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4304179Z 2025-05-07T20:32:25.4304592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4304596Z 2025-05-07T20:32:25.4304699Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4304920Z self=, 2025-05-07T20:32:25.4305002Z T=16384, 2025-05-07T20:32:25.4305077Z D=5120, 2025-05-07T20:32:25.4305161Z scale_ub=None, 2025-05-07T20:32:25.4305247Z contiguous=False, 2025-05-07T20:32:25.4305328Z compiled=True, 2025-05-07T20:32:25.4305403Z ) 2025-05-07T20:32:25.4305812Z self = 2025-05-07T20:32:25.4306064Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.4306074Z 2025-05-07T20:32:25.4306149Z @given( 2025-05-07T20:32:25.4306369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4306473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4306585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4306699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4306810Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4306883Z ) 2025-05-07T20:32:25.4307122Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4307219Z def test_silu_mul_quant( 2025-05-07T20:32:25.4307291Z self, 2025-05-07T20:32:25.4307363Z T: int, 2025-05-07T20:32:25.4307438Z D: int, 2025-05-07T20:32:25.4307531Z scale_ub: Optional[float], 2025-05-07T20:32:25.4307620Z contiguous: bool, 2025-05-07T20:32:25.4307701Z compiled: bool, 2025-05-07T20:32:25.4307774Z ) -> None: 2025-05-07T20:32:25.4307868Z torch.manual_seed(2025) 2025-05-07T20:32:25.4307939Z 2025-05-07T20:32:25.4308168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4308244Z 2025-05-07T20:32:25.4308332Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4308451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4308540Z x = x_sign * x_clamp 2025-05-07T20:32:25.4308617Z x0 = x[:, :D] 2025-05-07T20:32:25.4308697Z x1 = x[:, D:] 2025-05-07T20:32:25.4308768Z 2025-05-07T20:32:25.4308909Z if contiguous: 2025-05-07T20:32:25.4308996Z x0 = x0.contiguous() 2025-05-07T20:32:25.4309084Z x1 = x1.contiguous() 2025-05-07T20:32:25.4309154Z 2025-05-07T20:32:25.4309244Z if scale_ub is not None: 2025-05-07T20:32:25.4309345Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4309477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4309553Z ) 2025-05-07T20:32:25.4309624Z else: 2025-05-07T20:32:25.4309719Z scale_ub_tensor = None 2025-05-07T20:32:25.4309794Z 2025-05-07T20:32:25.4309919Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4310005Z op = silu_mul_quant 2025-05-07T20:32:25.4310089Z if compiled: 2025-05-07T20:32:25.4310185Z op = torch.compile(op) 2025-05-07T20:32:25.4310286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4310357Z 2025-05-07T20:32:25.4310449Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4310517Z 2025-05-07T20:32:25.4310613Z moe/activation_test.py:117: 2025-05-07T20:32:25.4310740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4310837Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4310934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4311298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4311391Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4311885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4311979Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4312332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4312548Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4312889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4312981Z kernel = self.compile( 2025-05-07T20:32:25.4313357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4313527Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4313694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4313702Z 2025-05-07T20:32:25.4313902Z self = 2025-05-07T20:32:25.4314674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4315166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508759e40>} 2025-05-07T20:32:25.4315912Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4316096Z context = 2025-05-07T20:32:25.4316101Z 2025-05-07T20:32:25.4316334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4316594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4316699Z module_map=module_map) 2025-05-07T20:32:25.4316859Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4316954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4317029Z E ^ 2025-05-07T20:32:25.4317428Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4317433Z 2025-05-07T20:32:25.4317839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4317843Z 2025-05-07T20:32:25.4317947Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4318165Z self=, 2025-05-07T20:32:25.4318242Z T=2048, 2025-05-07T20:32:25.4318324Z D=5120, 2025-05-07T20:32:25.4318403Z scale_ub=None, 2025-05-07T20:32:25.4318485Z contiguous=False, 2025-05-07T20:32:25.4318568Z compiled=True, 2025-05-07T20:32:25.4318638Z ) 2025-05-07T20:32:25.4318849Z self = 2025-05-07T20:32:25.4319020Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.4319072Z 2025-05-07T20:32:25.4319147Z @given( 2025-05-07T20:32:25.4319263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4319361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4319472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4319588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4319698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4319769Z ) 2025-05-07T20:32:25.4320012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4320105Z def test_silu_mul_quant( 2025-05-07T20:32:25.4320179Z self, 2025-05-07T20:32:25.4320254Z T: int, 2025-05-07T20:32:25.4320329Z D: int, 2025-05-07T20:32:25.4320423Z scale_ub: Optional[float], 2025-05-07T20:32:25.4320513Z contiguous: bool, 2025-05-07T20:32:25.4320595Z compiled: bool, 2025-05-07T20:32:25.4320674Z ) -> None: 2025-05-07T20:32:25.4320767Z torch.manual_seed(2025) 2025-05-07T20:32:25.4320841Z 2025-05-07T20:32:25.4321008Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4321079Z 2025-05-07T20:32:25.4321167Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4321290Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4321377Z x = x_sign * x_clamp 2025-05-07T20:32:25.4321454Z x0 = x[:, :D] 2025-05-07T20:32:25.4321535Z x1 = x[:, D:] 2025-05-07T20:32:25.4321603Z 2025-05-07T20:32:25.4321729Z if contiguous: 2025-05-07T20:32:25.4321825Z x0 = x0.contiguous() 2025-05-07T20:32:25.4321910Z x1 = x1.contiguous() 2025-05-07T20:32:25.4321982Z 2025-05-07T20:32:25.4322070Z if scale_ub is not None: 2025-05-07T20:32:25.4322173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4322305Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4322378Z ) 2025-05-07T20:32:25.4322457Z else: 2025-05-07T20:32:25.4322553Z scale_ub_tensor = None 2025-05-07T20:32:25.4322632Z 2025-05-07T20:32:25.4322775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4322885Z op = silu_mul_quant 2025-05-07T20:32:25.4322970Z if compiled: 2025-05-07T20:32:25.4323065Z op = torch.compile(op) 2025-05-07T20:32:25.4323171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4323241Z 2025-05-07T20:32:25.4323331Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4323379Z 2025-05-07T20:32:25.4323475Z moe/activation_test.py:117: 2025-05-07T20:32:25.4323602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4323702Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4323798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4324159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4324290Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4324776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4324869Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4325222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4325441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4325780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4325870Z kernel = self.compile( 2025-05-07T20:32:25.4326247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4326419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4326585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4326589Z 2025-05-07T20:32:25.4326791Z self = 2025-05-07T20:32:25.4327607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4328105Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050875b240>} 2025-05-07T20:32:25.4328846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4329031Z context = 2025-05-07T20:32:25.4329041Z 2025-05-07T20:32:25.4329205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4329461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4329565Z module_map=module_map) 2025-05-07T20:32:25.4329724Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4329820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4329894Z E ^ 2025-05-07T20:32:25.4330288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4330294Z 2025-05-07T20:32:25.4330702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4330706Z 2025-05-07T20:32:25.4330810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4331025Z self=, 2025-05-07T20:32:25.4331110Z T=2048, 2025-05-07T20:32:25.4331187Z D=5120, 2025-05-07T20:32:25.4331269Z scale_ub=1200.0, 2025-05-07T20:32:25.4331352Z contiguous=False, 2025-05-07T20:32:25.4331430Z compiled=True, 2025-05-07T20:32:25.4331502Z ) 2025-05-07T20:32:25.4331717Z self = 2025-05-07T20:32:25.4331889Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.4331897Z 2025-05-07T20:32:25.4332018Z @given( 2025-05-07T20:32:25.4332136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4332234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4332347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4332459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4332571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4332654Z ) 2025-05-07T20:32:25.4332972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4333067Z def test_silu_mul_quant( 2025-05-07T20:32:25.4333139Z self, 2025-05-07T20:32:25.4333210Z T: int, 2025-05-07T20:32:25.4333284Z D: int, 2025-05-07T20:32:25.4333380Z scale_ub: Optional[float], 2025-05-07T20:32:25.4333465Z contiguous: bool, 2025-05-07T20:32:25.4333549Z compiled: bool, 2025-05-07T20:32:25.4333624Z ) -> None: 2025-05-07T20:32:25.4333715Z torch.manual_seed(2025) 2025-05-07T20:32:25.4333789Z 2025-05-07T20:32:25.4333952Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4334028Z 2025-05-07T20:32:25.4334117Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4334237Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4334323Z x = x_sign * x_clamp 2025-05-07T20:32:25.4334401Z x0 = x[:, :D] 2025-05-07T20:32:25.4334525Z x1 = x[:, D:] 2025-05-07T20:32:25.4334596Z 2025-05-07T20:32:25.4334677Z if contiguous: 2025-05-07T20:32:25.4334765Z x0 = x0.contiguous() 2025-05-07T20:32:25.4334853Z x1 = x1.contiguous() 2025-05-07T20:32:25.4334923Z 2025-05-07T20:32:25.4335011Z if scale_ub is not None: 2025-05-07T20:32:25.4335115Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4335247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4335317Z ) 2025-05-07T20:32:25.4335396Z else: 2025-05-07T20:32:25.4335490Z scale_ub_tensor = None 2025-05-07T20:32:25.4335564Z 2025-05-07T20:32:25.4335690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4335776Z op = silu_mul_quant 2025-05-07T20:32:25.4335859Z if compiled: 2025-05-07T20:32:25.4335955Z op = torch.compile(op) 2025-05-07T20:32:25.4336056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4336134Z 2025-05-07T20:32:25.4336222Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4336226Z 2025-05-07T20:32:25.4336321Z moe/activation_test.py:117: 2025-05-07T20:32:25.4336451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4336549Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4336647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4337051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4337145Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4337634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4337727Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4338079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4338302Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4338635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4338728Z kernel = self.compile( 2025-05-07T20:32:25.4339103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4339273Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4339444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4339449Z 2025-05-07T20:32:25.4339648Z self = 2025-05-07T20:32:25.4340415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4340953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050886c720>} 2025-05-07T20:32:25.4341693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4341882Z context = 2025-05-07T20:32:25.4341888Z 2025-05-07T20:32:25.4342048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4342308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4342415Z module_map=module_map) 2025-05-07T20:32:25.4342572Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4342669Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4342788Z E ^ 2025-05-07T20:32:25.4343137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4343142Z 2025-05-07T20:32:25.4343548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4343552Z 2025-05-07T20:32:25.4343652Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4343875Z self=, 2025-05-07T20:32:25.4343950Z T=4096, 2025-05-07T20:32:25.4344024Z D=5120, 2025-05-07T20:32:25.4344106Z scale_ub=1200.0, 2025-05-07T20:32:25.4344184Z contiguous=True, 2025-05-07T20:32:25.4344267Z compiled=True, 2025-05-07T20:32:25.4344338Z ) 2025-05-07T20:32:25.4344551Z self = 2025-05-07T20:32:25.4344720Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.4344730Z 2025-05-07T20:32:25.4344803Z @given( 2025-05-07T20:32:25.4344921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4345028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4345141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4345253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4345367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4345439Z ) 2025-05-07T20:32:25.4345748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4345840Z def test_silu_mul_quant( 2025-05-07T20:32:25.4345911Z self, 2025-05-07T20:32:25.4345984Z T: int, 2025-05-07T20:32:25.4346056Z D: int, 2025-05-07T20:32:25.4346152Z scale_ub: Optional[float], 2025-05-07T20:32:25.4346241Z contiguous: bool, 2025-05-07T20:32:25.4346324Z compiled: bool, 2025-05-07T20:32:25.4346403Z ) -> None: 2025-05-07T20:32:25.4346496Z torch.manual_seed(2025) 2025-05-07T20:32:25.4346565Z 2025-05-07T20:32:25.4346729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4346803Z 2025-05-07T20:32:25.4346892Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4347014Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4347098Z x = x_sign * x_clamp 2025-05-07T20:32:25.4347174Z x0 = x[:, :D] 2025-05-07T20:32:25.4347260Z x1 = x[:, D:] 2025-05-07T20:32:25.4347374Z 2025-05-07T20:32:25.4347456Z if contiguous: 2025-05-07T20:32:25.4347547Z x0 = x0.contiguous() 2025-05-07T20:32:25.4347633Z x1 = x1.contiguous() 2025-05-07T20:32:25.4347702Z 2025-05-07T20:32:25.4347790Z if scale_ub is not None: 2025-05-07T20:32:25.4347892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4348020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4348139Z ) 2025-05-07T20:32:25.4348211Z else: 2025-05-07T20:32:25.4348301Z scale_ub_tensor = None 2025-05-07T20:32:25.4348374Z 2025-05-07T20:32:25.4348499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4348589Z op = silu_mul_quant 2025-05-07T20:32:25.4348669Z if compiled: 2025-05-07T20:32:25.4348765Z op = torch.compile(op) 2025-05-07T20:32:25.4348873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4348943Z 2025-05-07T20:32:25.4349034Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4349038Z 2025-05-07T20:32:25.4349139Z moe/activation_test.py:117: 2025-05-07T20:32:25.4349267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4349365Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4349463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4349825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4349965Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4350453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4350547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4350903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4351124Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4351461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4351553Z kernel = self.compile( 2025-05-07T20:32:25.4351928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4352099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4352225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4352229Z 2025-05-07T20:32:25.4352428Z self = 2025-05-07T20:32:25.4353287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4353783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050886d260>} 2025-05-07T20:32:25.4354520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4354708Z context = 2025-05-07T20:32:25.4354713Z 2025-05-07T20:32:25.4354873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4355130Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4355234Z module_map=module_map) 2025-05-07T20:32:25.4355394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4355492Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4355606Z E ^ 2025-05-07T20:32:25.4355957Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4355962Z 2025-05-07T20:32:25.4356368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4356373Z 2025-05-07T20:32:25.4356476Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4356732Z self=, 2025-05-07T20:32:25.4356808Z T=128, 2025-05-07T20:32:25.4356887Z D=5120, 2025-05-07T20:32:25.4356968Z scale_ub=1200.0, 2025-05-07T20:32:25.4357049Z contiguous=False, 2025-05-07T20:32:25.4357130Z compiled=True, 2025-05-07T20:32:25.4357200Z ) 2025-05-07T20:32:25.4357413Z self = 2025-05-07T20:32:25.4357590Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.4357594Z 2025-05-07T20:32:25.4357669Z @given( 2025-05-07T20:32:25.4357786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4357882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4357991Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4358105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4358215Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4358330Z ) 2025-05-07T20:32:25.4358570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4358659Z def test_silu_mul_quant( 2025-05-07T20:32:25.4358735Z self, 2025-05-07T20:32:25.4358809Z T: int, 2025-05-07T20:32:25.4358880Z D: int, 2025-05-07T20:32:25.4358977Z scale_ub: Optional[float], 2025-05-07T20:32:25.4359064Z contiguous: bool, 2025-05-07T20:32:25.4359144Z compiled: bool, 2025-05-07T20:32:25.4359227Z ) -> None: 2025-05-07T20:32:25.4359318Z torch.manual_seed(2025) 2025-05-07T20:32:25.4359389Z 2025-05-07T20:32:25.4359556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4359628Z 2025-05-07T20:32:25.4359715Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4359837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4359922Z x = x_sign * x_clamp 2025-05-07T20:32:25.4360008Z x0 = x[:, :D] 2025-05-07T20:32:25.4360085Z x1 = x[:, D:] 2025-05-07T20:32:25.4360156Z 2025-05-07T20:32:25.4360238Z if contiguous: 2025-05-07T20:32:25.4360327Z x0 = x0.contiguous() 2025-05-07T20:32:25.4360412Z x1 = x1.contiguous() 2025-05-07T20:32:25.4360484Z 2025-05-07T20:32:25.4360572Z if scale_ub is not None: 2025-05-07T20:32:25.4360672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4360853Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4360930Z ) 2025-05-07T20:32:25.4361003Z else: 2025-05-07T20:32:25.4361101Z scale_ub_tensor = None 2025-05-07T20:32:25.4361169Z 2025-05-07T20:32:25.4361295Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4361386Z op = silu_mul_quant 2025-05-07T20:32:25.4361468Z if compiled: 2025-05-07T20:32:25.4361566Z op = torch.compile(op) 2025-05-07T20:32:25.4361675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4361743Z 2025-05-07T20:32:25.4361830Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4361835Z 2025-05-07T20:32:25.4361929Z moe/activation_test.py:117: 2025-05-07T20:32:25.4362054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4362154Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4362248Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4362659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4362754Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4363241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4363340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4363693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4363950Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4364286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4364375Z kernel = self.compile( 2025-05-07T20:32:25.4364754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4364930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4365054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4365059Z 2025-05-07T20:32:25.4365261Z self = 2025-05-07T20:32:25.4366028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4366571Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050886e480>} 2025-05-07T20:32:25.4367308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4367496Z context = 2025-05-07T20:32:25.4367500Z 2025-05-07T20:32:25.4367718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4367979Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4368088Z module_map=module_map) 2025-05-07T20:32:25.4368247Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4368351Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4368429Z E ^ 2025-05-07T20:32:25.4368778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4368782Z 2025-05-07T20:32:25.4369198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4369202Z 2025-05-07T20:32:25.4369347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4369569Z self=, 2025-05-07T20:32:25.4369650Z T=16384, 2025-05-07T20:32:25.4369726Z D=7168, 2025-05-07T20:32:25.4369808Z scale_ub=1200.0, 2025-05-07T20:32:25.4369893Z contiguous=True, 2025-05-07T20:32:25.4369975Z compiled=True, 2025-05-07T20:32:25.4370048Z ) 2025-05-07T20:32:25.4370266Z self = 2025-05-07T20:32:25.4370443Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.4370448Z 2025-05-07T20:32:25.4370525Z @given( 2025-05-07T20:32:25.4370643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4370739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4370853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4370969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4371083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4371200Z ) 2025-05-07T20:32:25.4371443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4371535Z def test_silu_mul_quant( 2025-05-07T20:32:25.4371613Z self, 2025-05-07T20:32:25.4371690Z T: int, 2025-05-07T20:32:25.4371768Z D: int, 2025-05-07T20:32:25.4371865Z scale_ub: Optional[float], 2025-05-07T20:32:25.4371953Z contiguous: bool, 2025-05-07T20:32:25.4372082Z compiled: bool, 2025-05-07T20:32:25.4372161Z ) -> None: 2025-05-07T20:32:25.4372254Z torch.manual_seed(2025) 2025-05-07T20:32:25.4372329Z 2025-05-07T20:32:25.4372495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4372569Z 2025-05-07T20:32:25.4372661Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4372784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4372871Z x = x_sign * x_clamp 2025-05-07T20:32:25.4372960Z x0 = x[:, :D] 2025-05-07T20:32:25.4373039Z x1 = x[:, D:] 2025-05-07T20:32:25.4373111Z 2025-05-07T20:32:25.4373197Z if contiguous: 2025-05-07T20:32:25.4373288Z x0 = x0.contiguous() 2025-05-07T20:32:25.4373378Z x1 = x1.contiguous() 2025-05-07T20:32:25.4373452Z 2025-05-07T20:32:25.4373540Z if scale_ub is not None: 2025-05-07T20:32:25.4373646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4373847Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4373921Z ) 2025-05-07T20:32:25.4374002Z else: 2025-05-07T20:32:25.4374094Z scale_ub_tensor = None 2025-05-07T20:32:25.4374167Z 2025-05-07T20:32:25.4374301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4374392Z op = silu_mul_quant 2025-05-07T20:32:25.4374476Z if compiled: 2025-05-07T20:32:25.4374578Z op = torch.compile(op) 2025-05-07T20:32:25.4374687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4374763Z 2025-05-07T20:32:25.4374852Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4374857Z 2025-05-07T20:32:25.4374952Z moe/activation_test.py:117: 2025-05-07T20:32:25.4375081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4375181Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4375279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4375649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4375742Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4376234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4376331Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4376728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4376951Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4377287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4377380Z kernel = self.compile( 2025-05-07T20:32:25.4377760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4377938Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4378067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4378072Z 2025-05-07T20:32:25.4378271Z self = 2025-05-07T20:32:25.4379081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4379586Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050886fd80>} 2025-05-07T20:32:25.4380324Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4380557Z context = 2025-05-07T20:32:25.4380562Z 2025-05-07T20:32:25.4380723Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4380985Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4381090Z module_map=module_map) 2025-05-07T20:32:25.4381256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4381356Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4381432Z E ^ 2025-05-07T20:32:25.4381786Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4381790Z 2025-05-07T20:32:25.4382199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4382244Z 2025-05-07T20:32:25.4382350Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4382569Z self=, 2025-05-07T20:32:25.4382647Z T=16384, 2025-05-07T20:32:25.4382725Z D=5120, 2025-05-07T20:32:25.4382807Z scale_ub=1200.0, 2025-05-07T20:32:25.4382890Z contiguous=True, 2025-05-07T20:32:25.4382975Z compiled=False, 2025-05-07T20:32:25.4383048Z ) 2025-05-07T20:32:25.4383266Z self = 2025-05-07T20:32:25.4383447Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.4383452Z 2025-05-07T20:32:25.4383527Z @given( 2025-05-07T20:32:25.4383644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4383744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4383857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4383974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4384090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4384165Z ) 2025-05-07T20:32:25.4384407Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4384499Z def test_silu_mul_quant( 2025-05-07T20:32:25.4384575Z self, 2025-05-07T20:32:25.4384655Z T: int, 2025-05-07T20:32:25.4384730Z D: int, 2025-05-07T20:32:25.4384827Z scale_ub: Optional[float], 2025-05-07T20:32:25.4384963Z contiguous: bool, 2025-05-07T20:32:25.4385053Z compiled: bool, 2025-05-07T20:32:25.4385133Z ) -> None: 2025-05-07T20:32:25.4385227Z torch.manual_seed(2025) 2025-05-07T20:32:25.4385300Z 2025-05-07T20:32:25.4385468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4385542Z 2025-05-07T20:32:25.4385633Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4385758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4385853Z x = x_sign * x_clamp 2025-05-07T20:32:25.4385932Z x0 = x[:, :D] 2025-05-07T20:32:25.4386015Z x1 = x[:, D:] 2025-05-07T20:32:25.4386087Z 2025-05-07T20:32:25.4386170Z if contiguous: 2025-05-07T20:32:25.4386263Z x0 = x0.contiguous() 2025-05-07T20:32:25.4386351Z x1 = x1.contiguous() 2025-05-07T20:32:25.4386422Z 2025-05-07T20:32:25.4386513Z if scale_ub is not None: 2025-05-07T20:32:25.4386622Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4386801Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4386876Z ) 2025-05-07T20:32:25.4386953Z else: 2025-05-07T20:32:25.4387048Z scale_ub_tensor = None 2025-05-07T20:32:25.4387119Z 2025-05-07T20:32:25.4387247Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4387342Z op = silu_mul_quant 2025-05-07T20:32:25.4387426Z if compiled: 2025-05-07T20:32:25.4387568Z op = torch.compile(op) 2025-05-07T20:32:25.4387674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4387747Z 2025-05-07T20:32:25.4387838Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4387845Z 2025-05-07T20:32:25.4387941Z moe/activation_test.py:117: 2025-05-07T20:32:25.4388069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4388171Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4388271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4388765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4388866Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4389221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4389436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4389816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4389909Z kernel = self.compile( 2025-05-07T20:32:25.4390292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4390463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4390592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4390598Z 2025-05-07T20:32:25.4390801Z self = 2025-05-07T20:32:25.4391567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4392068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050856ccc0>} 2025-05-07T20:32:25.4392811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4393002Z context = 2025-05-07T20:32:25.4393007Z 2025-05-07T20:32:25.4393210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4393472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4393580Z module_map=module_map) 2025-05-07T20:32:25.4393741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4393838Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4393920Z E ^ 2025-05-07T20:32:25.4394275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4394280Z 2025-05-07T20:32:25.4394690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4394695Z 2025-05-07T20:32:25.4394800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4395019Z self=, 2025-05-07T20:32:25.4395102Z T=1, 2025-05-07T20:32:25.4395220Z D=7168, 2025-05-07T20:32:25.4395304Z scale_ub=1200.0, 2025-05-07T20:32:25.4395391Z contiguous=False, 2025-05-07T20:32:25.4395473Z compiled=False, 2025-05-07T20:32:25.4395549Z ) 2025-05-07T20:32:25.4395762Z self = 2025-05-07T20:32:25.4395932Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.4395982Z 2025-05-07T20:32:25.4396062Z @given( 2025-05-07T20:32:25.4396179Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4396280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4396396Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4396511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4396622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4396699Z ) 2025-05-07T20:32:25.4396947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4397044Z def test_silu_mul_quant( 2025-05-07T20:32:25.4397119Z self, 2025-05-07T20:32:25.4397195Z T: int, 2025-05-07T20:32:25.4397275Z D: int, 2025-05-07T20:32:25.4397371Z scale_ub: Optional[float], 2025-05-07T20:32:25.4397458Z contiguous: bool, 2025-05-07T20:32:25.4397547Z compiled: bool, 2025-05-07T20:32:25.4397628Z ) -> None: 2025-05-07T20:32:25.4397822Z torch.manual_seed(2025) 2025-05-07T20:32:25.4397896Z 2025-05-07T20:32:25.4398064Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4398137Z 2025-05-07T20:32:25.4398234Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4398356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4398448Z x = x_sign * x_clamp 2025-05-07T20:32:25.4398527Z x0 = x[:, :D] 2025-05-07T20:32:25.4398606Z x1 = x[:, D:] 2025-05-07T20:32:25.4398682Z 2025-05-07T20:32:25.4398767Z if contiguous: 2025-05-07T20:32:25.4398857Z x0 = x0.contiguous() 2025-05-07T20:32:25.4398950Z x1 = x1.contiguous() 2025-05-07T20:32:25.4399022Z 2025-05-07T20:32:25.4399112Z if scale_ub is not None: 2025-05-07T20:32:25.4399219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4399353Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4399431Z ) 2025-05-07T20:32:25.4399513Z else: 2025-05-07T20:32:25.4399606Z scale_ub_tensor = None 2025-05-07T20:32:25.4399678Z 2025-05-07T20:32:25.4399810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4399899Z op = silu_mul_quant 2025-05-07T20:32:25.4399984Z if compiled: 2025-05-07T20:32:25.4400082Z op = torch.compile(op) 2025-05-07T20:32:25.4400185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4400263Z 2025-05-07T20:32:25.4400396Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4400403Z 2025-05-07T20:32:25.4400499Z moe/activation_test.py:117: 2025-05-07T20:32:25.4400630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4400730Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4400827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4401323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4401425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4401782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4402004Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4402339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4402439Z kernel = self.compile( 2025-05-07T20:32:25.4402860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4403035Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4403160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4403164Z 2025-05-07T20:32:25.4403364Z self = 2025-05-07T20:32:25.4404201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4404697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050856d080>} 2025-05-07T20:32:25.4405444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4405834Z context = 2025-05-07T20:32:25.4405843Z 2025-05-07T20:32:25.4406068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4406345Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4406543Z module_map=module_map) 2025-05-07T20:32:25.4406703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4406797Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4406872Z E ^ 2025-05-07T20:32:25.4407222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4407227Z 2025-05-07T20:32:25.4407691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4407696Z 2025-05-07T20:32:25.4407800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4408014Z self=, 2025-05-07T20:32:25.4408088Z T=4096, 2025-05-07T20:32:25.4408162Z D=7168, 2025-05-07T20:32:25.4408242Z scale_ub=1200.0, 2025-05-07T20:32:25.4408327Z contiguous=False, 2025-05-07T20:32:25.4408411Z compiled=True, 2025-05-07T20:32:25.4408483Z ) 2025-05-07T20:32:25.4408702Z self = 2025-05-07T20:32:25.4417671Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.4417682Z 2025-05-07T20:32:25.4417789Z @given( 2025-05-07T20:32:25.4417954Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4418222Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4418362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4418490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4418608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4418689Z ) 2025-05-07T20:32:25.4418953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4419053Z def test_silu_mul_quant( 2025-05-07T20:32:25.4419146Z self, 2025-05-07T20:32:25.4419229Z T: int, 2025-05-07T20:32:25.4419312Z D: int, 2025-05-07T20:32:25.4419423Z scale_ub: Optional[float], 2025-05-07T20:32:25.4419518Z contiguous: bool, 2025-05-07T20:32:25.4419609Z compiled: bool, 2025-05-07T20:32:25.4419699Z ) -> None: 2025-05-07T20:32:25.4419801Z torch.manual_seed(2025) 2025-05-07T20:32:25.4419880Z 2025-05-07T20:32:25.4420067Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4420151Z 2025-05-07T20:32:25.4420319Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4420460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4420556Z x = x_sign * x_clamp 2025-05-07T20:32:25.4420646Z x0 = x[:, :D] 2025-05-07T20:32:25.4420731Z x1 = x[:, D:] 2025-05-07T20:32:25.4420837Z 2025-05-07T20:32:25.4420929Z if contiguous: 2025-05-07T20:32:25.4421029Z x0 = x0.contiguous() 2025-05-07T20:32:25.4421185Z x1 = x1.contiguous() 2025-05-07T20:32:25.4421266Z 2025-05-07T20:32:25.4421360Z if scale_ub is not None: 2025-05-07T20:32:25.4421467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4421614Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4421695Z ) 2025-05-07T20:32:25.4421774Z else: 2025-05-07T20:32:25.4421873Z scale_ub_tensor = None 2025-05-07T20:32:25.4421948Z 2025-05-07T20:32:25.4422087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4422182Z op = silu_mul_quant 2025-05-07T20:32:25.4422269Z if compiled: 2025-05-07T20:32:25.4422375Z op = torch.compile(op) 2025-05-07T20:32:25.4422515Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4422590Z 2025-05-07T20:32:25.4422683Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4422688Z 2025-05-07T20:32:25.4422787Z moe/activation_test.py:117: 2025-05-07T20:32:25.4422970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4423076Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4423178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4423559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4423660Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4424163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4424270Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4424630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4424854Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4425198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4425301Z kernel = self.compile( 2025-05-07T20:32:25.4425689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4425865Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4425995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4426000Z 2025-05-07T20:32:25.4426253Z self = 2025-05-07T20:32:25.4427040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4427546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050856f060>} 2025-05-07T20:32:25.4428305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4428501Z context = 2025-05-07T20:32:25.4428506Z 2025-05-07T20:32:25.4428671Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4428981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4429092Z module_map=module_map) 2025-05-07T20:32:25.4429256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4429363Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4429447Z E ^ 2025-05-07T20:32:25.4429802Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4429847Z 2025-05-07T20:32:25.4430265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4430270Z 2025-05-07T20:32:25.4430374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4430604Z self=, 2025-05-07T20:32:25.4430685Z T=128, 2025-05-07T20:32:25.4430764Z D=7168, 2025-05-07T20:32:25.4430857Z scale_ub=1200.0, 2025-05-07T20:32:25.4430950Z contiguous=False, 2025-05-07T20:32:25.4431038Z compiled=True, 2025-05-07T20:32:25.4431121Z ) 2025-05-07T20:32:25.4431342Z self = 2025-05-07T20:32:25.4431513Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.4431521Z 2025-05-07T20:32:25.4431602Z @given( 2025-05-07T20:32:25.4431728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4431883Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4431999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4432119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4432238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4432318Z ) 2025-05-07T20:32:25.4432565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4432673Z def test_silu_mul_quant( 2025-05-07T20:32:25.4432787Z self, 2025-05-07T20:32:25.4432874Z T: int, 2025-05-07T20:32:25.4432953Z D: int, 2025-05-07T20:32:25.4433054Z scale_ub: Optional[float], 2025-05-07T20:32:25.4433152Z contiguous: bool, 2025-05-07T20:32:25.4433244Z compiled: bool, 2025-05-07T20:32:25.4433326Z ) -> None: 2025-05-07T20:32:25.4433429Z torch.manual_seed(2025) 2025-05-07T20:32:25.4433532Z 2025-05-07T20:32:25.4454435Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4454529Z 2025-05-07T20:32:25.4454627Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4454755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4454849Z x = x_sign * x_clamp 2025-05-07T20:32:25.4454931Z x0 = x[:, :D] 2025-05-07T20:32:25.4455012Z x1 = x[:, D:] 2025-05-07T20:32:25.4455088Z 2025-05-07T20:32:25.4455173Z if contiguous: 2025-05-07T20:32:25.4455264Z x0 = x0.contiguous() 2025-05-07T20:32:25.4455423Z x1 = x1.contiguous() 2025-05-07T20:32:25.4455498Z 2025-05-07T20:32:25.4455592Z if scale_ub is not None: 2025-05-07T20:32:25.4455698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4455835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4455914Z ) 2025-05-07T20:32:25.4455991Z else: 2025-05-07T20:32:25.4456086Z scale_ub_tensor = None 2025-05-07T20:32:25.4456169Z 2025-05-07T20:32:25.4456300Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4456391Z op = silu_mul_quant 2025-05-07T20:32:25.4456480Z if compiled: 2025-05-07T20:32:25.4456581Z op = torch.compile(op) 2025-05-07T20:32:25.4456688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4456765Z 2025-05-07T20:32:25.4456857Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4456861Z 2025-05-07T20:32:25.4456961Z moe/activation_test.py:117: 2025-05-07T20:32:25.4457144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4457248Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4457350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4457723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4457817Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4458358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4458461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4458825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4459048Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4459395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4459492Z kernel = self.compile( 2025-05-07T20:32:25.4459875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4460050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4460182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4460190Z 2025-05-07T20:32:25.4460437Z self = 2025-05-07T20:32:25.4461220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4461724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508438360>} 2025-05-07T20:32:25.4462478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4462668Z context = 2025-05-07T20:32:25.4462673Z 2025-05-07T20:32:25.4462837Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4463107Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4463215Z module_map=module_map) 2025-05-07T20:32:25.4463381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4463481Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4463559Z E ^ 2025-05-07T20:32:25.4463957Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4463964Z 2025-05-07T20:32:25.4464377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4464382Z 2025-05-07T20:32:25.4464490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4464711Z self=, 2025-05-07T20:32:25.4464789Z T=2048, 2025-05-07T20:32:25.4464873Z D=7168, 2025-05-07T20:32:25.4464956Z scale_ub=None, 2025-05-07T20:32:25.4465045Z contiguous=True, 2025-05-07T20:32:25.4465128Z compiled=True, 2025-05-07T20:32:25.4465202Z ) 2025-05-07T20:32:25.4465421Z self = 2025-05-07T20:32:25.4465590Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.4465595Z 2025-05-07T20:32:25.4465672Z @given( 2025-05-07T20:32:25.4465799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4465941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4466061Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4466179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4466293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4466370Z ) 2025-05-07T20:32:25.4466613Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4466774Z def test_silu_mul_quant( 2025-05-07T20:32:25.4466854Z self, 2025-05-07T20:32:25.4466932Z T: int, 2025-05-07T20:32:25.4467009Z D: int, 2025-05-07T20:32:25.4467111Z scale_ub: Optional[float], 2025-05-07T20:32:25.4467201Z contiguous: bool, 2025-05-07T20:32:25.4467286Z compiled: bool, 2025-05-07T20:32:25.4467368Z ) -> None: 2025-05-07T20:32:25.4467463Z torch.manual_seed(2025) 2025-05-07T20:32:25.4467539Z 2025-05-07T20:32:25.4467714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4467790Z 2025-05-07T20:32:25.4467884Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4468009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4468099Z x = x_sign * x_clamp 2025-05-07T20:32:25.4468179Z x0 = x[:, :D] 2025-05-07T20:32:25.4468259Z x1 = x[:, D:] 2025-05-07T20:32:25.4468332Z 2025-05-07T20:32:25.4468419Z if contiguous: 2025-05-07T20:32:25.4468557Z x0 = x0.contiguous() 2025-05-07T20:32:25.4468647Z x1 = x1.contiguous() 2025-05-07T20:32:25.4468722Z 2025-05-07T20:32:25.4468813Z if scale_ub is not None: 2025-05-07T20:32:25.4468918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4469057Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4469133Z ) 2025-05-07T20:32:25.4469212Z else: 2025-05-07T20:32:25.4469306Z scale_ub_tensor = None 2025-05-07T20:32:25.4469386Z 2025-05-07T20:32:25.4469521Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4469618Z op = silu_mul_quant 2025-05-07T20:32:25.4469716Z if compiled: 2025-05-07T20:32:25.4469819Z op = torch.compile(op) 2025-05-07T20:32:25.4469933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4470009Z 2025-05-07T20:32:25.4470102Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4470112Z 2025-05-07T20:32:25.4470219Z moe/activation_test.py:117: 2025-05-07T20:32:25.4470352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4470457Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4470564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4470934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4471036Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4471577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4471681Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4472046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4472274Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4472669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4472775Z kernel = self.compile( 2025-05-07T20:32:25.4473159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4473341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4473473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4473478Z 2025-05-07T20:32:25.4473727Z self = 2025-05-07T20:32:25.4474516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4475019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0508438ea0>} 2025-05-07T20:32:25.4475819Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4476010Z context = 2025-05-07T20:32:25.4476015Z 2025-05-07T20:32:25.4476190Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4476458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4476568Z module_map=module_map) 2025-05-07T20:32:25.4476741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4476842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4476924Z E ^ 2025-05-07T20:32:25.4477284Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4477331Z 2025-05-07T20:32:25.4477748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4477752Z 2025-05-07T20:32:25.4477865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4478090Z self=, 2025-05-07T20:32:25.4478171Z T=16384, 2025-05-07T20:32:25.4478257Z D=5120, 2025-05-07T20:32:25.4478349Z scale_ub=None, 2025-05-07T20:32:25.4478437Z contiguous=False, 2025-05-07T20:32:25.4478528Z compiled=False, 2025-05-07T20:32:25.4478606Z ) 2025-05-07T20:32:25.4478828Z self = 2025-05-07T20:32:25.4479014Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.4479018Z 2025-05-07T20:32:25.4479097Z @given( 2025-05-07T20:32:25.4479231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4479332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4479449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4479575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4479690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4479770Z ) 2025-05-07T20:32:25.4480022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4480163Z def test_silu_mul_quant( 2025-05-07T20:32:25.4480245Z self, 2025-05-07T20:32:25.4480333Z T: int, 2025-05-07T20:32:25.4480414Z D: int, 2025-05-07T20:32:25.4480518Z scale_ub: Optional[float], 2025-05-07T20:32:25.4480612Z contiguous: bool, 2025-05-07T20:32:25.4480704Z compiled: bool, 2025-05-07T20:32:25.4480796Z ) -> None: 2025-05-07T20:32:25.4480895Z torch.manual_seed(2025) 2025-05-07T20:32:25.4480977Z 2025-05-07T20:32:25.4481158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4481240Z 2025-05-07T20:32:25.4481335Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4481468Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4483367Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4483374Z 2025-05-07T20:32:25.4483507Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.4483551Z 2025-05-07T20:32:25.4483659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4483890Z self=, 2025-05-07T20:32:25.4483970Z T=4096, 2025-05-07T20:32:25.4484054Z D=7168, 2025-05-07T20:32:25.4484151Z scale_ub=1200.0, 2025-05-07T20:32:25.4484238Z contiguous=True, 2025-05-07T20:32:25.4484325Z compiled=True, 2025-05-07T20:32:25.4484411Z ) 2025-05-07T20:32:25.4484637Z self = 2025-05-07T20:32:25.4484812Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.4484817Z 2025-05-07T20:32:25.4484903Z @given( 2025-05-07T20:32:25.4485025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4485132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4485250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4485370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4485532Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4485609Z ) 2025-05-07T20:32:25.4485859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4485964Z def test_silu_mul_quant( 2025-05-07T20:32:25.4486047Z self, 2025-05-07T20:32:25.4486127Z T: int, 2025-05-07T20:32:25.4486215Z D: int, 2025-05-07T20:32:25.4486317Z scale_ub: Optional[float], 2025-05-07T20:32:25.4486413Z contiguous: bool, 2025-05-07T20:32:25.4486510Z compiled: bool, 2025-05-07T20:32:25.4486594Z ) -> None: 2025-05-07T20:32:25.4486699Z torch.manual_seed(2025) 2025-05-07T20:32:25.4486777Z 2025-05-07T20:32:25.4486947Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4487034Z 2025-05-07T20:32:25.4487130Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4487259Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4489154Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4489160Z 2025-05-07T20:32:25.4489285Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.4489289Z 2025-05-07T20:32:25.4489402Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4489626Z self=, 2025-05-07T20:32:25.4489706Z T=16384, 2025-05-07T20:32:25.4489790Z D=7168, 2025-05-07T20:32:25.4489879Z scale_ub=None, 2025-05-07T20:32:25.4489977Z contiguous=False, 2025-05-07T20:32:25.4490063Z compiled=False, 2025-05-07T20:32:25.4490140Z ) 2025-05-07T20:32:25.4490361Z self = 2025-05-07T20:32:25.4490539Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.4490544Z 2025-05-07T20:32:25.4490624Z @given( 2025-05-07T20:32:25.4490750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4490900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4491019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4491149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4491267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4491354Z ) 2025-05-07T20:32:25.4491600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4491700Z def test_silu_mul_quant( 2025-05-07T20:32:25.4491825Z self, 2025-05-07T20:32:25.4491904Z T: int, 2025-05-07T20:32:25.4491983Z D: int, 2025-05-07T20:32:25.4492093Z scale_ub: Optional[float], 2025-05-07T20:32:25.4492186Z contiguous: bool, 2025-05-07T20:32:25.4492274Z compiled: bool, 2025-05-07T20:32:25.4492363Z ) -> None: 2025-05-07T20:32:25.4492458Z torch.manual_seed(2025) 2025-05-07T20:32:25.4492534Z 2025-05-07T20:32:25.4492708Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4494497Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4494546Z 2025-05-07T20:32:25.4494673Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4494678Z 2025-05-07T20:32:25.4494782Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4495008Z self=, 2025-05-07T20:32:25.4495089Z T=2048, 2025-05-07T20:32:25.4495169Z D=7168, 2025-05-07T20:32:25.4495264Z scale_ub=1200.0, 2025-05-07T20:32:25.4495357Z contiguous=True, 2025-05-07T20:32:25.4495444Z compiled=True, 2025-05-07T20:32:25.4495529Z ) 2025-05-07T20:32:25.4495745Z self = 2025-05-07T20:32:25.4495917Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.4495929Z 2025-05-07T20:32:25.4496009Z @given( 2025-05-07T20:32:25.4496131Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4496243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4496362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4496480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4496602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4496682Z ) 2025-05-07T20:32:25.4496926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4497072Z def test_silu_mul_quant( 2025-05-07T20:32:25.4497155Z self, 2025-05-07T20:32:25.4497235Z T: int, 2025-05-07T20:32:25.4497324Z D: int, 2025-05-07T20:32:25.4497425Z scale_ub: Optional[float], 2025-05-07T20:32:25.4497517Z contiguous: bool, 2025-05-07T20:32:25.4497610Z compiled: bool, 2025-05-07T20:32:25.4497692Z ) -> None: 2025-05-07T20:32:25.4497789Z torch.manual_seed(2025) 2025-05-07T20:32:25.4497872Z 2025-05-07T20:32:25.4498046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4498124Z 2025-05-07T20:32:25.4498227Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4498354Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4500193Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4500200Z 2025-05-07T20:32:25.4500320Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.4500325Z 2025-05-07T20:32:25.4500477Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4500699Z self=, 2025-05-07T20:32:25.4500782Z T=2048, 2025-05-07T20:32:25.4500870Z D=7168, 2025-05-07T20:32:25.4500955Z scale_ub=None, 2025-05-07T20:32:25.4501045Z contiguous=True, 2025-05-07T20:32:25.4501136Z compiled=False, 2025-05-07T20:32:25.4501212Z ) 2025-05-07T20:32:25.4501428Z self = 2025-05-07T20:32:25.4501610Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4501614Z 2025-05-07T20:32:25.4501693Z @given( 2025-05-07T20:32:25.4501819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4501920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4502037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4502160Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4502320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4502397Z ) 2025-05-07T20:32:25.4502677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4502792Z def test_silu_mul_quant( 2025-05-07T20:32:25.4502871Z self, 2025-05-07T20:32:25.4502958Z T: int, 2025-05-07T20:32:25.4503037Z D: int, 2025-05-07T20:32:25.4503143Z scale_ub: Optional[float], 2025-05-07T20:32:25.4503234Z contiguous: bool, 2025-05-07T20:32:25.4503325Z compiled: bool, 2025-05-07T20:32:25.4503417Z ) -> None: 2025-05-07T20:32:25.4503514Z torch.manual_seed(2025) 2025-05-07T20:32:25.4503591Z 2025-05-07T20:32:25.4503765Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4503840Z 2025-05-07T20:32:25.4503935Z > x_sign = torch.sign(x) 2025-05-07T20:32:25.4505945Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4505964Z 2025-05-07T20:32:25.4506236Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:25.4506242Z 2025-05-07T20:32:25.4506354Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4506577Z self=, 2025-05-07T20:32:25.4506664Z T=1, 2025-05-07T20:32:25.4506743Z D=7168, 2025-05-07T20:32:25.4506826Z scale_ub=1200.0, 2025-05-07T20:32:25.4506914Z contiguous=True, 2025-05-07T20:32:25.4506999Z compiled=False, 2025-05-07T20:32:25.4507077Z ) 2025-05-07T20:32:25.4507294Z self = 2025-05-07T20:32:25.4507456Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.4507461Z 2025-05-07T20:32:25.4507537Z @given( 2025-05-07T20:32:25.4507657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4507756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4507876Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4508056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4508170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4508256Z ) 2025-05-07T20:32:25.4508495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4508587Z def test_silu_mul_quant( 2025-05-07T20:32:25.4508669Z self, 2025-05-07T20:32:25.4508745Z T: int, 2025-05-07T20:32:25.4508880Z D: int, 2025-05-07T20:32:25.4508986Z scale_ub: Optional[float], 2025-05-07T20:32:25.4509076Z contiguous: bool, 2025-05-07T20:32:25.4509162Z compiled: bool, 2025-05-07T20:32:25.4509246Z ) -> None: 2025-05-07T20:32:25.4509340Z torch.manual_seed(2025) 2025-05-07T20:32:25.4509414Z 2025-05-07T20:32:25.4509586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4509660Z 2025-05-07T20:32:25.4509755Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4509883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4509972Z x = x_sign * x_clamp 2025-05-07T20:32:25.4510059Z x0 = x[:, :D] 2025-05-07T20:32:25.4510142Z x1 = x[:, D:] 2025-05-07T20:32:25.4510215Z 2025-05-07T20:32:25.4510305Z if contiguous: 2025-05-07T20:32:25.4510401Z x0 = x0.contiguous() 2025-05-07T20:32:25.4510495Z x1 = x1.contiguous() 2025-05-07T20:32:25.4510580Z 2025-05-07T20:32:25.4510743Z if scale_ub is not None: 2025-05-07T20:32:25.4510850Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4510989Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4511066Z ) 2025-05-07T20:32:25.4511147Z else: 2025-05-07T20:32:25.4511240Z scale_ub_tensor = None 2025-05-07T20:32:25.4511314Z 2025-05-07T20:32:25.4511455Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4511548Z op = silu_mul_quant 2025-05-07T20:32:25.4511636Z if compiled: 2025-05-07T20:32:25.4511744Z op = torch.compile(op) 2025-05-07T20:32:25.4511850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4511922Z 2025-05-07T20:32:25.4512018Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4512023Z 2025-05-07T20:32:25.4512117Z moe/activation_test.py:117: 2025-05-07T20:32:25.4512253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4512361Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4512461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4513018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4513115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4513471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4513745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4514087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4514189Z kernel = self.compile( 2025-05-07T20:32:25.4514569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4514741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4514879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4514883Z 2025-05-07T20:32:25.4515083Z self = 2025-05-07T20:32:25.4515861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4516397Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050831c680>} 2025-05-07T20:32:25.4517139Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4517370Z context = 2025-05-07T20:32:25.4517374Z 2025-05-07T20:32:25.4517537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4517804Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4517911Z module_map=module_map) 2025-05-07T20:32:25.4518072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4518180Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4518260Z E ^ 2025-05-07T20:32:25.4518611Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4518622Z 2025-05-07T20:32:25.4519037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4519042Z 2025-05-07T20:32:25.4519145Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4519419Z self=, 2025-05-07T20:32:25.4519496Z T=128, 2025-05-07T20:32:25.4519576Z D=5120, 2025-05-07T20:32:25.4519663Z scale_ub=None, 2025-05-07T20:32:25.4519749Z contiguous=True, 2025-05-07T20:32:25.4519832Z compiled=False, 2025-05-07T20:32:25.4519913Z ) 2025-05-07T20:32:25.4520128Z self = 2025-05-07T20:32:25.4520306Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4520313Z 2025-05-07T20:32:25.4520391Z @given( 2025-05-07T20:32:25.4520510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4520617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4520733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4520847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4520966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4521049Z ) 2025-05-07T20:32:25.4521292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4521393Z def test_silu_mul_quant( 2025-05-07T20:32:25.4521469Z self, 2025-05-07T20:32:25.4521552Z T: int, 2025-05-07T20:32:25.4521627Z D: int, 2025-05-07T20:32:25.4521725Z scale_ub: Optional[float], 2025-05-07T20:32:25.4521820Z contiguous: bool, 2025-05-07T20:32:25.4521905Z compiled: bool, 2025-05-07T20:32:25.4522029Z ) -> None: 2025-05-07T20:32:25.4522133Z torch.manual_seed(2025) 2025-05-07T20:32:25.4522206Z 2025-05-07T20:32:25.4522374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4522454Z 2025-05-07T20:32:25.4522547Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4522671Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4522765Z x = x_sign * x_clamp 2025-05-07T20:32:25.4522850Z x0 = x[:, :D] 2025-05-07T20:32:25.4522938Z x1 = x[:, D:] 2025-05-07T20:32:25.4523012Z 2025-05-07T20:32:25.4523095Z if contiguous: 2025-05-07T20:32:25.4523194Z x0 = x0.contiguous() 2025-05-07T20:32:25.4523283Z x1 = x1.contiguous() 2025-05-07T20:32:25.4523357Z 2025-05-07T20:32:25.4523453Z if scale_ub is not None: 2025-05-07T20:32:25.4523557Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4523697Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4523820Z ) 2025-05-07T20:32:25.4523900Z else: 2025-05-07T20:32:25.4523996Z scale_ub_tensor = None 2025-05-07T20:32:25.4524075Z 2025-05-07T20:32:25.4524207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4524305Z op = silu_mul_quant 2025-05-07T20:32:25.4524389Z if compiled: 2025-05-07T20:32:25.4524491Z op = torch.compile(op) 2025-05-07T20:32:25.4524643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4524716Z 2025-05-07T20:32:25.4524806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4524810Z 2025-05-07T20:32:25.4524915Z moe/activation_test.py:117: 2025-05-07T20:32:25.4525046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4525145Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4525252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4525751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4525856Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4526215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4526436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4526779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4526917Z kernel = self.compile( 2025-05-07T20:32:25.4527296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4527473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4527670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4527675Z 2025-05-07T20:32:25.4527893Z self = 2025-05-07T20:32:25.4528669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4529176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050831d8a0>} 2025-05-07T20:32:25.4529928Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4530120Z context = 2025-05-07T20:32:25.4530125Z 2025-05-07T20:32:25.4530368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4530637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4530756Z module_map=module_map) 2025-05-07T20:32:25.4530922Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4531025Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4531112Z E ^ 2025-05-07T20:32:25.4531466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4531476Z 2025-05-07T20:32:25.4531888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4531900Z 2025-05-07T20:32:25.4532007Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4532231Z self=, 2025-05-07T20:32:25.4532315Z T=128, 2025-05-07T20:32:25.4532395Z D=7168, 2025-05-07T20:32:25.4532521Z scale_ub=None, 2025-05-07T20:32:25.4532617Z contiguous=True, 2025-05-07T20:32:25.4532711Z compiled=False, 2025-05-07T20:32:25.4532789Z ) 2025-05-07T20:32:25.4533012Z self = 2025-05-07T20:32:25.4533188Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4533193Z 2025-05-07T20:32:25.4533276Z @given( 2025-05-07T20:32:25.4533444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4533549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4533672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4533792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4533906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4533993Z ) 2025-05-07T20:32:25.4534239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4534336Z def test_silu_mul_quant( 2025-05-07T20:32:25.4534425Z self, 2025-05-07T20:32:25.4534507Z T: int, 2025-05-07T20:32:25.4534587Z D: int, 2025-05-07T20:32:25.4534694Z scale_ub: Optional[float], 2025-05-07T20:32:25.4534787Z contiguous: bool, 2025-05-07T20:32:25.4534883Z compiled: bool, 2025-05-07T20:32:25.4534966Z ) -> None: 2025-05-07T20:32:25.4535062Z torch.manual_seed(2025) 2025-05-07T20:32:25.4535146Z 2025-05-07T20:32:25.4535363Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4535441Z 2025-05-07T20:32:25.4535542Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4535669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4535761Z x = x_sign * x_clamp 2025-05-07T20:32:25.4535851Z x0 = x[:, :D] 2025-05-07T20:32:25.4535934Z x1 = x[:, D:] 2025-05-07T20:32:25.4536010Z 2025-05-07T20:32:25.4536101Z if contiguous: 2025-05-07T20:32:25.4536198Z x0 = x0.contiguous() 2025-05-07T20:32:25.4536293Z x1 = x1.contiguous() 2025-05-07T20:32:25.4536374Z 2025-05-07T20:32:25.4536467Z if scale_ub is not None: 2025-05-07T20:32:25.4536581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4536720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4536799Z ) 2025-05-07T20:32:25.4536883Z else: 2025-05-07T20:32:25.4536985Z scale_ub_tensor = None 2025-05-07T20:32:25.4537069Z 2025-05-07T20:32:25.4537206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4537299Z op = silu_mul_quant 2025-05-07T20:32:25.4537388Z if compiled: 2025-05-07T20:32:25.4537496Z op = torch.compile(op) 2025-05-07T20:32:25.4537609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4537686Z 2025-05-07T20:32:25.4537786Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4537791Z 2025-05-07T20:32:25.4537957Z moe/activation_test.py:117: 2025-05-07T20:32:25.4538098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4538205Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4538306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4538810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4538917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4539274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4539500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4539840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4539943Z kernel = self.compile( 2025-05-07T20:32:25.4540366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4540544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4540680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4540685Z 2025-05-07T20:32:25.4540892Z self = 2025-05-07T20:32:25.4541671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4542218Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050831e7a0>} 2025-05-07T20:32:25.4542968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4543165Z context = 2025-05-07T20:32:25.4543170Z 2025-05-07T20:32:25.4543336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4543612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4543768Z module_map=module_map) 2025-05-07T20:32:25.4543931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4544041Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4544124Z E ^ 2025-05-07T20:32:25.4544485Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4544490Z 2025-05-07T20:32:25.4544909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4544913Z 2025-05-07T20:32:25.4545021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4545252Z self=, 2025-05-07T20:32:25.4545334Z T=2048, 2025-05-07T20:32:25.4545419Z D=7168, 2025-05-07T20:32:25.4545512Z scale_ub=1200.0, 2025-05-07T20:32:25.4545600Z contiguous=True, 2025-05-07T20:32:25.4545692Z compiled=False, 2025-05-07T20:32:25.4545775Z ) 2025-05-07T20:32:25.4545995Z self = 2025-05-07T20:32:25.4546177Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.4546181Z 2025-05-07T20:32:25.4546261Z @given( 2025-05-07T20:32:25.4546384Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4546492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4546653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4546780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4546901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4546979Z ) 2025-05-07T20:32:25.4547232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4547330Z def test_silu_mul_quant( 2025-05-07T20:32:25.4547410Z self, 2025-05-07T20:32:25.4547497Z T: int, 2025-05-07T20:32:25.4547581Z D: int, 2025-05-07T20:32:25.4547682Z scale_ub: Optional[float], 2025-05-07T20:32:25.4547784Z contiguous: bool, 2025-05-07T20:32:25.4547871Z compiled: bool, 2025-05-07T20:32:25.4547955Z ) -> None: 2025-05-07T20:32:25.4548060Z torch.manual_seed(2025) 2025-05-07T20:32:25.4548138Z 2025-05-07T20:32:25.4548317Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4550135Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4550179Z 2025-05-07T20:32:25.4550310Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4550314Z 2025-05-07T20:32:25.4550421Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4550650Z self=, 2025-05-07T20:32:25.4550729Z T=1, 2025-05-07T20:32:25.4550812Z D=5120, 2025-05-07T20:32:25.4550906Z scale_ub=1200.0, 2025-05-07T20:32:25.4550996Z contiguous=True, 2025-05-07T20:32:25.4551087Z compiled=False, 2025-05-07T20:32:25.4551173Z ) 2025-05-07T20:32:25.4551392Z self = 2025-05-07T20:32:25.4551558Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.4551563Z 2025-05-07T20:32:25.4551651Z @given( 2025-05-07T20:32:25.4551772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4551880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4552042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4552161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4552285Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4552362Z ) 2025-05-07T20:32:25.4552609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4552715Z def test_silu_mul_quant( 2025-05-07T20:32:25.4552794Z self, 2025-05-07T20:32:25.4552875Z T: int, 2025-05-07T20:32:25.4552963Z D: int, 2025-05-07T20:32:25.4553068Z scale_ub: Optional[float], 2025-05-07T20:32:25.4553164Z contiguous: bool, 2025-05-07T20:32:25.4553253Z compiled: bool, 2025-05-07T20:32:25.4553335Z ) -> None: 2025-05-07T20:32:25.4553439Z torch.manual_seed(2025) 2025-05-07T20:32:25.4553516Z 2025-05-07T20:32:25.4553685Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4553772Z 2025-05-07T20:32:25.4553872Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4554002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4554094Z x = x_sign * x_clamp 2025-05-07T20:32:25.4554179Z x0 = x[:, :D] 2025-05-07T20:32:25.4554263Z x1 = x[:, D:] 2025-05-07T20:32:25.4554352Z 2025-05-07T20:32:25.4554441Z if contiguous: 2025-05-07T20:32:25.4554537Z x0 = x0.contiguous() 2025-05-07T20:32:25.4554635Z x1 = x1.contiguous() 2025-05-07T20:32:25.4554714Z 2025-05-07T20:32:25.4554856Z if scale_ub is not None: 2025-05-07T20:32:25.4554974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4555114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4555199Z ) 2025-05-07T20:32:25.4555278Z else: 2025-05-07T20:32:25.4555377Z scale_ub_tensor = None 2025-05-07T20:32:25.4555462Z 2025-05-07T20:32:25.4555593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4555691Z op = silu_mul_quant 2025-05-07T20:32:25.4555785Z if compiled: 2025-05-07T20:32:25.4555889Z op = torch.compile(op) 2025-05-07T20:32:25.4555998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4556081Z 2025-05-07T20:32:25.4556175Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4556180Z 2025-05-07T20:32:25.4556291Z moe/activation_test.py:117: 2025-05-07T20:32:25.4556433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4556649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4556768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4557269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4557372Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4557736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4558002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4558350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4558449Z kernel = self.compile( 2025-05-07T20:32:25.4558831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4559021Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4559154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4559159Z 2025-05-07T20:32:25.4559365Z self = 2025-05-07T20:32:25.4560150Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4560723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f050831fb00>} 2025-05-07T20:32:25.4566731Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4566956Z context = 2025-05-07T20:32:25.4566962Z 2025-05-07T20:32:25.4567130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4567398Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4567508Z module_map=module_map) 2025-05-07T20:32:25.4567755Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4567864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4567942Z E ^ 2025-05-07T20:32:25.4568298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4568303Z 2025-05-07T20:32:25.4568720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4568725Z 2025-05-07T20:32:25.4568899Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4569127Z self=, 2025-05-07T20:32:25.4569210Z T=2048, 2025-05-07T20:32:25.4569294Z D=5120, 2025-05-07T20:32:25.4569379Z scale_ub=None, 2025-05-07T20:32:25.4569468Z contiguous=True, 2025-05-07T20:32:25.4569560Z compiled=False, 2025-05-07T20:32:25.4569635Z ) 2025-05-07T20:32:25.4569855Z self = 2025-05-07T20:32:25.4570041Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4570046Z 2025-05-07T20:32:25.4570124Z @given( 2025-05-07T20:32:25.4570256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4570359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4570476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4570601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4570720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4570840Z ) 2025-05-07T20:32:25.4571093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4571188Z def test_silu_mul_quant( 2025-05-07T20:32:25.4571266Z self, 2025-05-07T20:32:25.4571349Z T: int, 2025-05-07T20:32:25.4571428Z D: int, 2025-05-07T20:32:25.4571532Z scale_ub: Optional[float], 2025-05-07T20:32:25.4571628Z contiguous: bool, 2025-05-07T20:32:25.4571757Z compiled: bool, 2025-05-07T20:32:25.4571841Z ) -> None: 2025-05-07T20:32:25.4571938Z torch.manual_seed(2025) 2025-05-07T20:32:25.4572012Z 2025-05-07T20:32:25.4572186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4572261Z 2025-05-07T20:32:25.4572359Z > x_sign = torch.sign(x) 2025-05-07T20:32:25.4574153Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4574201Z 2025-05-07T20:32:25.4574324Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:25.4574329Z 2025-05-07T20:32:25.4574436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4574657Z self=, 2025-05-07T20:32:25.4574740Z T=16384, 2025-05-07T20:32:25.4574819Z D=5120, 2025-05-07T20:32:25.4574904Z scale_ub=None, 2025-05-07T20:32:25.4574993Z contiguous=True, 2025-05-07T20:32:25.4575079Z compiled=False, 2025-05-07T20:32:25.4575158Z ) 2025-05-07T20:32:25.4575383Z self = 2025-05-07T20:32:25.4575560Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4575565Z 2025-05-07T20:32:25.4575643Z @given( 2025-05-07T20:32:25.4575767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4575867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4575985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4576110Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4576224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4576302Z ) 2025-05-07T20:32:25.4576548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4576648Z def test_silu_mul_quant( 2025-05-07T20:32:25.4576726Z self, 2025-05-07T20:32:25.4576803Z T: int, 2025-05-07T20:32:25.4576883Z D: int, 2025-05-07T20:32:25.4577027Z scale_ub: Optional[float], 2025-05-07T20:32:25.4577118Z contiguous: bool, 2025-05-07T20:32:25.4577207Z compiled: bool, 2025-05-07T20:32:25.4577286Z ) -> None: 2025-05-07T20:32:25.4577382Z torch.manual_seed(2025) 2025-05-07T20:32:25.4577463Z 2025-05-07T20:32:25.4577629Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4579403Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4579417Z 2025-05-07T20:32:25.4579574Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4579579Z 2025-05-07T20:32:25.4579694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4579915Z self=, 2025-05-07T20:32:25.4579994Z T=4096, 2025-05-07T20:32:25.4580075Z D=5120, 2025-05-07T20:32:25.4580158Z scale_ub=None, 2025-05-07T20:32:25.4580244Z contiguous=True, 2025-05-07T20:32:25.4580376Z compiled=False, 2025-05-07T20:32:25.4580452Z ) 2025-05-07T20:32:25.4580666Z self = 2025-05-07T20:32:25.4580839Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4580844Z 2025-05-07T20:32:25.4580924Z @given( 2025-05-07T20:32:25.4581046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4581145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4581267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4581390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4581504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4581580Z ) 2025-05-07T20:32:25.4581826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4581920Z def test_silu_mul_quant( 2025-05-07T20:32:25.4581997Z self, 2025-05-07T20:32:25.4582125Z T: int, 2025-05-07T20:32:25.4582203Z D: int, 2025-05-07T20:32:25.4582308Z scale_ub: Optional[float], 2025-05-07T20:32:25.4582400Z contiguous: bool, 2025-05-07T20:32:25.4582488Z compiled: bool, 2025-05-07T20:32:25.4582570Z ) -> None: 2025-05-07T20:32:25.4582665Z torch.manual_seed(2025) 2025-05-07T20:32:25.4582741Z 2025-05-07T20:32:25.4582914Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4584678Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4584688Z 2025-05-07T20:32:25.4584812Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4584816Z 2025-05-07T20:32:25.4584920Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4585140Z self=, 2025-05-07T20:32:25.4585220Z T=2048, 2025-05-07T20:32:25.4585297Z D=5120, 2025-05-07T20:32:25.4585386Z scale_ub=None, 2025-05-07T20:32:25.4585519Z contiguous=False, 2025-05-07T20:32:25.4585608Z compiled=False, 2025-05-07T20:32:25.4585689Z ) 2025-05-07T20:32:25.4585906Z self = 2025-05-07T20:32:25.4586077Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.4586082Z 2025-05-07T20:32:25.4586168Z @given( 2025-05-07T20:32:25.4586292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4586397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4586522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4586642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4586759Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4586834Z ) 2025-05-07T20:32:25.4587080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4587184Z def test_silu_mul_quant( 2025-05-07T20:32:25.4587263Z self, 2025-05-07T20:32:25.4587344Z T: int, 2025-05-07T20:32:25.4587470Z D: int, 2025-05-07T20:32:25.4587573Z scale_ub: Optional[float], 2025-05-07T20:32:25.4587664Z contiguous: bool, 2025-05-07T20:32:25.4587757Z compiled: bool, 2025-05-07T20:32:25.4587836Z ) -> None: 2025-05-07T20:32:25.4587932Z torch.manual_seed(2025) 2025-05-07T20:32:25.4588010Z 2025-05-07T20:32:25.4588178Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4589986Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4589992Z 2025-05-07T20:32:25.4590113Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4590118Z 2025-05-07T20:32:25.4590228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4590448Z self=, 2025-05-07T20:32:25.4590526Z T=4096, 2025-05-07T20:32:25.4590605Z D=7168, 2025-05-07T20:32:25.4590731Z scale_ub=None, 2025-05-07T20:32:25.4590820Z contiguous=True, 2025-05-07T20:32:25.4590909Z compiled=True, 2025-05-07T20:32:25.4590986Z ) 2025-05-07T20:32:25.4591203Z self = 2025-05-07T20:32:25.4591375Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.4591381Z 2025-05-07T20:32:25.4591459Z @given( 2025-05-07T20:32:25.4591584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4591688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4591805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4591924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4592037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4592114Z ) 2025-05-07T20:32:25.4592364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4592460Z def test_silu_mul_quant( 2025-05-07T20:32:25.4592546Z self, 2025-05-07T20:32:25.4592633Z T: int, 2025-05-07T20:32:25.4592711Z D: int, 2025-05-07T20:32:25.4592814Z scale_ub: Optional[float], 2025-05-07T20:32:25.4592905Z contiguous: bool, 2025-05-07T20:32:25.4592994Z compiled: bool, 2025-05-07T20:32:25.4593081Z ) -> None: 2025-05-07T20:32:25.4593178Z torch.manual_seed(2025) 2025-05-07T20:32:25.4593254Z 2025-05-07T20:32:25.4593475Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4595245Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4595256Z 2025-05-07T20:32:25.4595379Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4595384Z 2025-05-07T20:32:25.4595492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4595714Z self=, 2025-05-07T20:32:25.4595800Z T=2048, 2025-05-07T20:32:25.4595880Z D=5120, 2025-05-07T20:32:25.4595974Z scale_ub=1200.0, 2025-05-07T20:32:25.4596109Z contiguous=False, 2025-05-07T20:32:25.4596198Z compiled=False, 2025-05-07T20:32:25.4596281Z ) 2025-05-07T20:32:25.4596497Z self = 2025-05-07T20:32:25.4596675Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.4596679Z 2025-05-07T20:32:25.4596765Z @given( 2025-05-07T20:32:25.4596951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4597054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4597176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4597297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4597416Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4597491Z ) 2025-05-07T20:32:25.4597737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4597837Z def test_silu_mul_quant( 2025-05-07T20:32:25.4597918Z self, 2025-05-07T20:32:25.4597997Z T: int, 2025-05-07T20:32:25.4598083Z D: int, 2025-05-07T20:32:25.4598183Z scale_ub: Optional[float], 2025-05-07T20:32:25.4598276Z contiguous: bool, 2025-05-07T20:32:25.4598370Z compiled: bool, 2025-05-07T20:32:25.4598450Z ) -> None: 2025-05-07T20:32:25.4598546Z torch.manual_seed(2025) 2025-05-07T20:32:25.4598625Z 2025-05-07T20:32:25.4598837Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4600612Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4600618Z 2025-05-07T20:32:25.4600738Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4600742Z 2025-05-07T20:32:25.4600852Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4601073Z self=, 2025-05-07T20:32:25.4601160Z T=4096, 2025-05-07T20:32:25.4601247Z D=7168, 2025-05-07T20:32:25.4601336Z scale_ub=1200.0, 2025-05-07T20:32:25.4601427Z contiguous=True, 2025-05-07T20:32:25.4601522Z compiled=False, 2025-05-07T20:32:25.4601599Z ) 2025-05-07T20:32:25.4601812Z self = 2025-05-07T20:32:25.4601989Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.4601994Z 2025-05-07T20:32:25.4602073Z @given( 2025-05-07T20:32:25.4602245Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4602347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4602462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4602583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4602698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4602776Z ) 2025-05-07T20:32:25.4603022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4603125Z def test_silu_mul_quant( 2025-05-07T20:32:25.4603204Z self, 2025-05-07T20:32:25.4603289Z T: int, 2025-05-07T20:32:25.4603371Z D: int, 2025-05-07T20:32:25.4603481Z scale_ub: Optional[float], 2025-05-07T20:32:25.4603575Z contiguous: bool, 2025-05-07T20:32:25.4603664Z compiled: bool, 2025-05-07T20:32:25.4603749Z ) -> None: 2025-05-07T20:32:25.4603845Z torch.manual_seed(2025) 2025-05-07T20:32:25.4603919Z 2025-05-07T20:32:25.4604133Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4606260Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4606358Z 2025-05-07T20:32:25.4606488Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4606493Z 2025-05-07T20:32:25.4606597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4606816Z self=, 2025-05-07T20:32:25.4606904Z T=16384, 2025-05-07T20:32:25.4606985Z D=7168, 2025-05-07T20:32:25.4607069Z scale_ub=None, 2025-05-07T20:32:25.4607157Z contiguous=False, 2025-05-07T20:32:25.4607240Z compiled=True, 2025-05-07T20:32:25.4607316Z ) 2025-05-07T20:32:25.4607585Z self = 2025-05-07T20:32:25.4607759Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.4607838Z 2025-05-07T20:32:25.4607927Z @given( 2025-05-07T20:32:25.4608043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4608140Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4608258Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4608374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4608488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4608563Z ) 2025-05-07T20:32:25.4608807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4608905Z def test_silu_mul_quant( 2025-05-07T20:32:25.4608984Z self, 2025-05-07T20:32:25.4609065Z T: int, 2025-05-07T20:32:25.4609147Z D: int, 2025-05-07T20:32:25.4609249Z scale_ub: Optional[float], 2025-05-07T20:32:25.4609338Z contiguous: bool, 2025-05-07T20:32:25.4609434Z compiled: bool, 2025-05-07T20:32:25.4609514Z ) -> None: 2025-05-07T20:32:25.4609608Z torch.manual_seed(2025) 2025-05-07T20:32:25.4609689Z 2025-05-07T20:32:25.4609854Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4611693Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4611700Z 2025-05-07T20:32:25.4611819Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4611830Z 2025-05-07T20:32:25.4611930Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4612148Z self=, 2025-05-07T20:32:25.4612236Z T=4096, 2025-05-07T20:32:25.4612315Z D=7168, 2025-05-07T20:32:25.4612399Z scale_ub=None, 2025-05-07T20:32:25.4612488Z contiguous=True, 2025-05-07T20:32:25.4612573Z compiled=False, 2025-05-07T20:32:25.4612647Z ) 2025-05-07T20:32:25.4612866Z self = 2025-05-07T20:32:25.4613034Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4613042Z 2025-05-07T20:32:25.4613179Z @given( 2025-05-07T20:32:25.4613299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4613398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4613514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4613626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4613738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4613856Z ) 2025-05-07T20:32:25.4614101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4614193Z def test_silu_mul_quant( 2025-05-07T20:32:25.4614278Z self, 2025-05-07T20:32:25.4614354Z T: int, 2025-05-07T20:32:25.4614431Z D: int, 2025-05-07T20:32:25.4614532Z scale_ub: Optional[float], 2025-05-07T20:32:25.4614621Z contiguous: bool, 2025-05-07T20:32:25.4614709Z compiled: bool, 2025-05-07T20:32:25.4614789Z ) -> None: 2025-05-07T20:32:25.4614883Z torch.manual_seed(2025) 2025-05-07T20:32:25.4614965Z 2025-05-07T20:32:25.4615130Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4616899Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4616954Z 2025-05-07T20:32:25.4617071Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4617076Z 2025-05-07T20:32:25.4617178Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4617405Z self=, 2025-05-07T20:32:25.4617482Z T=16384, 2025-05-07T20:32:25.4617561Z D=7168, 2025-05-07T20:32:25.4617648Z scale_ub=None, 2025-05-07T20:32:25.4617731Z contiguous=True, 2025-05-07T20:32:25.4617820Z compiled=False, 2025-05-07T20:32:25.4617893Z ) 2025-05-07T20:32:25.4618104Z self = 2025-05-07T20:32:25.4618278Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.4618288Z 2025-05-07T20:32:25.4618365Z @given( 2025-05-07T20:32:25.4618480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4618581Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4618692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4618809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4618923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4619038Z ) 2025-05-07T20:32:25.4619285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4619378Z def test_silu_mul_quant( 2025-05-07T20:32:25.4619454Z self, 2025-05-07T20:32:25.4619534Z T: int, 2025-05-07T20:32:25.4619610Z D: int, 2025-05-07T20:32:25.4619707Z scale_ub: Optional[float], 2025-05-07T20:32:25.4619796Z contiguous: bool, 2025-05-07T20:32:25.4619879Z compiled: bool, 2025-05-07T20:32:25.4619964Z ) -> None: 2025-05-07T20:32:25.4620060Z torch.manual_seed(2025) 2025-05-07T20:32:25.4620134Z 2025-05-07T20:32:25.4620300Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4622116Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4622122Z 2025-05-07T20:32:25.4622237Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4622244Z 2025-05-07T20:32:25.4622344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4622602Z self=, 2025-05-07T20:32:25.4622684Z T=16384, 2025-05-07T20:32:25.4622761Z D=7168, 2025-05-07T20:32:25.4622843Z scale_ub=1200.0, 2025-05-07T20:32:25.4622932Z contiguous=True, 2025-05-07T20:32:25.4623016Z compiled=False, 2025-05-07T20:32:25.4623093Z ) 2025-05-07T20:32:25.4623313Z self = 2025-05-07T20:32:25.4623490Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.4623495Z 2025-05-07T20:32:25.4623575Z @given( 2025-05-07T20:32:25.4623690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4623786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4623901Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4624015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4624127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4624249Z ) 2025-05-07T20:32:25.4624489Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4624581Z def test_silu_mul_quant( 2025-05-07T20:32:25.4624665Z self, 2025-05-07T20:32:25.4624743Z T: int, 2025-05-07T20:32:25.4624820Z D: int, 2025-05-07T20:32:25.4624921Z scale_ub: Optional[float], 2025-05-07T20:32:25.4625010Z contiguous: bool, 2025-05-07T20:32:25.4625099Z compiled: bool, 2025-05-07T20:32:25.4625181Z ) -> None: 2025-05-07T20:32:25.4625274Z torch.manual_seed(2025) 2025-05-07T20:32:25.4625350Z 2025-05-07T20:32:25.4625515Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4627280Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4627294Z 2025-05-07T20:32:25.4627410Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4627415Z 2025-05-07T20:32:25.4627556Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4627782Z self=, 2025-05-07T20:32:25.4627858Z T=128, 2025-05-07T20:32:25.4627938Z D=5120, 2025-05-07T20:32:25.4628024Z scale_ub=1200.0, 2025-05-07T20:32:25.4628109Z contiguous=False, 2025-05-07T20:32:25.4628195Z compiled=False, 2025-05-07T20:32:25.4628268Z ) 2025-05-07T20:32:25.4628481Z self = 2025-05-07T20:32:25.4628657Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.4628661Z 2025-05-07T20:32:25.4628740Z @given( 2025-05-07T20:32:25.4628856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4628956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4629067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4629180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4629366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4629443Z ) 2025-05-07T20:32:25.4629688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4629780Z def test_silu_mul_quant( 2025-05-07T20:32:25.4629856Z self, 2025-05-07T20:32:25.4629937Z T: int, 2025-05-07T20:32:25.4630014Z D: int, 2025-05-07T20:32:25.4630109Z scale_ub: Optional[float], 2025-05-07T20:32:25.4630244Z contiguous: bool, 2025-05-07T20:32:25.4630331Z compiled: bool, 2025-05-07T20:32:25.4630408Z ) -> None: 2025-05-07T20:32:25.4630504Z torch.manual_seed(2025) 2025-05-07T20:32:25.4630576Z 2025-05-07T20:32:25.4630739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4630819Z 2025-05-07T20:32:25.4630911Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4631040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4631130Z x = x_sign * x_clamp 2025-05-07T20:32:25.4631212Z x0 = x[:, :D] 2025-05-07T20:32:25.4631296Z x1 = x[:, D:] 2025-05-07T20:32:25.4631369Z 2025-05-07T20:32:25.4631450Z if contiguous: 2025-05-07T20:32:25.4631544Z x0 = x0.contiguous() 2025-05-07T20:32:25.4631636Z x1 = x1.contiguous() 2025-05-07T20:32:25.4631709Z 2025-05-07T20:32:25.4631805Z if scale_ub is not None: 2025-05-07T20:32:25.4631911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4632091Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4632173Z ) 2025-05-07T20:32:25.4632249Z else: 2025-05-07T20:32:25.4632342Z scale_ub_tensor = None 2025-05-07T20:32:25.4632438Z 2025-05-07T20:32:25.4632583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4632691Z op = silu_mul_quant 2025-05-07T20:32:25.4632777Z if compiled: 2025-05-07T20:32:25.4632880Z op = torch.compile(op) 2025-05-07T20:32:25.4632991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4633064Z 2025-05-07T20:32:25.4633154Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4633159Z 2025-05-07T20:32:25.4633259Z moe/activation_test.py:117: 2025-05-07T20:32:25.4633387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4633486Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4633592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4634092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4634192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4634546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4634768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4635160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4635260Z kernel = self.compile( 2025-05-07T20:32:25.4635644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4635820Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4635948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4635957Z 2025-05-07T20:32:25.4636168Z self = 2025-05-07T20:32:25.4636942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4637488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f005df3e700>} 2025-05-07T20:32:25.4638236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4638425Z context = 2025-05-07T20:32:25.4638470Z 2025-05-07T20:32:25.4638639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4638901Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4639014Z module_map=module_map) 2025-05-07T20:32:25.4639177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4639277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4639363Z E ^ 2025-05-07T20:32:25.4639723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4639728Z 2025-05-07T20:32:25.4641557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4641566Z 2025-05-07T20:32:25.4641669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4641889Z self=, 2025-05-07T20:32:25.4642016Z T=2048, 2025-05-07T20:32:25.4642095Z D=7168, 2025-05-07T20:32:25.4642178Z scale_ub=None, 2025-05-07T20:32:25.4642271Z contiguous=False, 2025-05-07T20:32:25.4642357Z compiled=False, 2025-05-07T20:32:25.4642436Z ) 2025-05-07T20:32:25.4642656Z self = 2025-05-07T20:32:25.4642828Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.4642833Z 2025-05-07T20:32:25.4642919Z @given( 2025-05-07T20:32:25.4643041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4643141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4643261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4643379Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4643493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4643575Z ) 2025-05-07T20:32:25.4643820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4643921Z def test_silu_mul_quant( 2025-05-07T20:32:25.4644004Z self, 2025-05-07T20:32:25.4644083Z T: int, 2025-05-07T20:32:25.4644162Z D: int, 2025-05-07T20:32:25.4644265Z scale_ub: Optional[float], 2025-05-07T20:32:25.4644357Z contiguous: bool, 2025-05-07T20:32:25.4644449Z compiled: bool, 2025-05-07T20:32:25.4644528Z ) -> None: 2025-05-07T20:32:25.4644624Z torch.manual_seed(2025) 2025-05-07T20:32:25.4644746Z 2025-05-07T20:32:25.4644918Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4646684Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4646700Z 2025-05-07T20:32:25.4646819Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4646823Z 2025-05-07T20:32:25.4646927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4647153Z self=, 2025-05-07T20:32:25.4647271Z T=128, 2025-05-07T20:32:25.4647351Z D=7168, 2025-05-07T20:32:25.4647442Z scale_ub=1200.0, 2025-05-07T20:32:25.4647575Z contiguous=True, 2025-05-07T20:32:25.4647663Z compiled=True, 2025-05-07T20:32:25.4647738Z ) 2025-05-07T20:32:25.4647952Z self = 2025-05-07T20:32:25.4648124Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.4648174Z 2025-05-07T20:32:25.4648254Z @given( 2025-05-07T20:32:25.4648370Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4648472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4648584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4648697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4648812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4648889Z ) 2025-05-07T20:32:25.4649141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4649234Z def test_silu_mul_quant( 2025-05-07T20:32:25.4649312Z self, 2025-05-07T20:32:25.4649392Z T: int, 2025-05-07T20:32:25.4649469Z D: int, 2025-05-07T20:32:25.4649566Z scale_ub: Optional[float], 2025-05-07T20:32:25.4649658Z contiguous: bool, 2025-05-07T20:32:25.4649744Z compiled: bool, 2025-05-07T20:32:25.4649821Z ) -> None: 2025-05-07T20:32:25.4649970Z torch.manual_seed(2025) 2025-05-07T20:32:25.4650044Z 2025-05-07T20:32:25.4650210Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4650288Z 2025-05-07T20:32:25.4650377Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4650506Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4650593Z x = x_sign * x_clamp 2025-05-07T20:32:25.4650673Z x0 = x[:, :D] 2025-05-07T20:32:25.4650754Z x1 = x[:, D:] 2025-05-07T20:32:25.4650830Z 2025-05-07T20:32:25.4650914Z if contiguous: 2025-05-07T20:32:25.4651008Z x0 = x0.contiguous() 2025-05-07T20:32:25.4651096Z x1 = x1.contiguous() 2025-05-07T20:32:25.4651170Z 2025-05-07T20:32:25.4651262Z if scale_ub is not None: 2025-05-07T20:32:25.4651368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4651501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4651589Z ) 2025-05-07T20:32:25.4651667Z else: 2025-05-07T20:32:25.4651761Z scale_ub_tensor = None 2025-05-07T20:32:25.4651834Z 2025-05-07T20:32:25.4651961Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4652053Z op = silu_mul_quant 2025-05-07T20:32:25.4652137Z if compiled: 2025-05-07T20:32:25.4652238Z op = torch.compile(op) 2025-05-07T20:32:25.4652346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4652420Z 2025-05-07T20:32:25.4652559Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4652564Z 2025-05-07T20:32:25.4652663Z moe/activation_test.py:117: 2025-05-07T20:32:25.4652792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4652895Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4652991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4653356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4653459Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4653950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4654046Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4654405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4654669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4655010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4655103Z kernel = self.compile( 2025-05-07T20:32:25.4655483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4655662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4655832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4655837Z 2025-05-07T20:32:25.4656039Z self = 2025-05-07T20:32:25.4656815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4657315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f005df3ff60>} 2025-05-07T20:32:25.4658062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4658249Z context = 2025-05-07T20:32:25.4658297Z 2025-05-07T20:32:25.4658466Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4658727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4658832Z module_map=module_map) 2025-05-07T20:32:25.4658995Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4659092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4659173Z E ^ 2025-05-07T20:32:25.4659531Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4659536Z 2025-05-07T20:32:25.4659946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4659951Z 2025-05-07T20:32:25.4660059Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4660282Z self=, 2025-05-07T20:32:25.4660360Z T=128, 2025-05-07T20:32:25.4660442Z D=7168, 2025-05-07T20:32:25.4660525Z scale_ub=1200.0, 2025-05-07T20:32:25.4660609Z contiguous=True, 2025-05-07T20:32:25.4660696Z compiled=False, 2025-05-07T20:32:25.4660769Z ) 2025-05-07T20:32:25.4660984Z self = 2025-05-07T20:32:25.4661217Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.4661223Z 2025-05-07T20:32:25.4661302Z @given( 2025-05-07T20:32:25.4661423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4661522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4661634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4661751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4661862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4661944Z ) 2025-05-07T20:32:25.4662190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4662284Z def test_silu_mul_quant( 2025-05-07T20:32:25.4662364Z self, 2025-05-07T20:32:25.4662440Z T: int, 2025-05-07T20:32:25.4662517Z D: int, 2025-05-07T20:32:25.4662616Z scale_ub: Optional[float], 2025-05-07T20:32:25.4662704Z contiguous: bool, 2025-05-07T20:32:25.4662794Z compiled: bool, 2025-05-07T20:32:25.4662876Z ) -> None: 2025-05-07T20:32:25.4663011Z torch.manual_seed(2025) 2025-05-07T20:32:25.4663086Z 2025-05-07T20:32:25.4663251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4663324Z 2025-05-07T20:32:25.4663416Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4663538Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4665305Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4665352Z 2025-05-07T20:32:25.4665475Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.4665480Z 2025-05-07T20:32:25.4665584Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4665802Z self=, 2025-05-07T20:32:25.4665881Z T=128, 2025-05-07T20:32:25.4665960Z D=5120, 2025-05-07T20:32:25.4666042Z scale_ub=1200.0, 2025-05-07T20:32:25.4666127Z contiguous=True, 2025-05-07T20:32:25.4666218Z compiled=True, 2025-05-07T20:32:25.4666336Z ) 2025-05-07T20:32:25.4666549Z self = 2025-05-07T20:32:25.4666718Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.4666722Z 2025-05-07T20:32:25.4666798Z @given( 2025-05-07T20:32:25.4666916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4667013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4667127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4667248Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4667356Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4667430Z ) 2025-05-07T20:32:25.4667674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4667766Z def test_silu_mul_quant( 2025-05-07T20:32:25.4667842Z self, 2025-05-07T20:32:25.4667921Z T: int, 2025-05-07T20:32:25.4668003Z D: int, 2025-05-07T20:32:25.4668102Z scale_ub: Optional[float], 2025-05-07T20:32:25.4668188Z contiguous: bool, 2025-05-07T20:32:25.4668272Z compiled: bool, 2025-05-07T20:32:25.4668351Z ) -> None: 2025-05-07T20:32:25.4668444Z torch.manual_seed(2025) 2025-05-07T20:32:25.4668517Z 2025-05-07T20:32:25.4668683Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4668757Z 2025-05-07T20:32:25.4668847Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4669018Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4670772Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4670783Z 2025-05-07T20:32:25.4670901Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.4670906Z 2025-05-07T20:32:25.4671007Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4671227Z self=, 2025-05-07T20:32:25.4671307Z T=128, 2025-05-07T20:32:25.4671424Z D=7168, 2025-05-07T20:32:25.4671511Z scale_ub=None, 2025-05-07T20:32:25.4671598Z contiguous=True, 2025-05-07T20:32:25.4671678Z compiled=True, 2025-05-07T20:32:25.4671752Z ) 2025-05-07T20:32:25.4671963Z self = 2025-05-07T20:32:25.4672126Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.4672130Z 2025-05-07T20:32:25.4672254Z @given( 2025-05-07T20:32:25.4672374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4672470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4672589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4672703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4672817Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4672891Z ) 2025-05-07T20:32:25.4673134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4673233Z def test_silu_mul_quant( 2025-05-07T20:32:25.4673310Z self, 2025-05-07T20:32:25.4673385Z T: int, 2025-05-07T20:32:25.4673465Z D: int, 2025-05-07T20:32:25.4673561Z scale_ub: Optional[float], 2025-05-07T20:32:25.4673647Z contiguous: bool, 2025-05-07T20:32:25.4673734Z compiled: bool, 2025-05-07T20:32:25.4673813Z ) -> None: 2025-05-07T20:32:25.4673906Z torch.manual_seed(2025) 2025-05-07T20:32:25.4674029Z 2025-05-07T20:32:25.4674193Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4675952Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.4675959Z 2025-05-07T20:32:25.4676076Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.4676218Z =============================== warnings summary =============================== 2025-05-07T20:32:25.4676522Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:25.4676825Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:25.4677124Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:25.4678034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:25.4678269Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:25.4678274Z 2025-05-07T20:32:25.4678481Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:25.4678643Z ================= 1 failed, 1 deselected, 3 warnings in 13.78s ================= 2025-05-07T20:32:27.0193059Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:27.0808221Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:27.0808838Z 2025-05-07T20:32:29.0826081Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:31.2291666Z ============================= test session starts ============================== 2025-05-07T20:32:31.2292331Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:31.2292856Z cachedir: .pytest_cache 2025-05-07T20:32:31.2293423Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:31.2294255Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:31.2294706Z plugins: hypothesis-6.131.14 2025-05-07T20:32:32.8179360Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:32.9694614Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:32.9695065Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:32.9695289Z 2025-05-07T20:32:35.3126066Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3133299Z self=, 2025-05-07T20:32:35.3133763Z T=1, 2025-05-07T20:32:35.3133970Z D=5120, 2025-05-07T20:32:35.3134182Z scale_ub=None, 2025-05-07T20:32:35.3134404Z contiguous=True, 2025-05-07T20:32:35.3134640Z compiled=True, 2025-05-07T20:32:35.3134862Z ) 2025-05-07T20:32:35.3135189Z self = 2025-05-07T20:32:35.3136034Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3136299Z 2025-05-07T20:32:35.3136394Z @given( 2025-05-07T20:32:35.3136638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3136969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3137289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3137637Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3137984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3138287Z ) 2025-05-07T20:32:35.3138653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3139101Z def test_silu_mul_quant( 2025-05-07T20:32:35.3139359Z self, 2025-05-07T20:32:35.3139570Z T: int, 2025-05-07T20:32:35.3139774Z D: int, 2025-05-07T20:32:35.3140008Z scale_ub: Optional[float], 2025-05-07T20:32:35.3140296Z contiguous: bool, 2025-05-07T20:32:35.3140543Z compiled: bool, 2025-05-07T20:32:35.3140786Z ) -> None: 2025-05-07T20:32:35.3141016Z torch.manual_seed(2025) 2025-05-07T20:32:35.3141264Z 2025-05-07T20:32:35.3141551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3141908Z 2025-05-07T20:32:35.3142119Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3142415Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3142821Z x = x_sign * x_clamp 2025-05-07T20:32:35.3143086Z x0 = x[:, :D] 2025-05-07T20:32:35.3143318Z x1 = x[:, D:] 2025-05-07T20:32:35.3143532Z 2025-05-07T20:32:35.3143733Z if contiguous: 2025-05-07T20:32:35.3143983Z x0 = x0.contiguous() 2025-05-07T20:32:35.3144250Z x1 = x1.contiguous() 2025-05-07T20:32:35.3144506Z 2025-05-07T20:32:35.3144715Z if scale_ub is not None: 2025-05-07T20:32:35.3144996Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3145356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3145681Z ) 2025-05-07T20:32:35.3145883Z else: 2025-05-07T20:32:35.3146111Z scale_ub_tensor = None 2025-05-07T20:32:35.3146377Z 2025-05-07T20:32:35.3146616Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3146949Z op = silu_mul_quant 2025-05-07T20:32:35.3147218Z if compiled: 2025-05-07T20:32:35.3147477Z op = torch.compile(op) 2025-05-07T20:32:35.3147877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3148171Z 2025-05-07T20:32:35.3148379Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3148671Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3148975Z 2025-05-07T20:32:35.3149230Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3149571Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3149965Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3150293Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3150659Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3150986Z 2025-05-07T20:32:35.3151202Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3151400Z 2025-05-07T20:32:35.3151508Z moe/activation_test.py:126: 2025-05-07T20:32:35.3151820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3152173Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3152511Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3153311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3154078Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3154645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3155391Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3156083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3156816Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3157582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.3158333Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3159071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3159720Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3160332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3160861Z fn() 2025-05-07T20:32:35.3161379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3161972Z self.fn.run( 2025-05-07T20:32:35.3162452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3162989Z kernel = self.compile( 2025-05-07T20:32:35.3163594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3164258Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3164664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3164911Z 2025-05-07T20:32:35.3165126Z self = 2025-05-07T20:32:35.3166267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3167746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df65e13a0>} 2025-05-07T20:32:35.3169157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3170192Z context = 2025-05-07T20:32:35.3170491Z 2025-05-07T20:32:35.3170661Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3171196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3171723Z module_map=module_map) 2025-05-07T20:32:35.3172096Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3172469Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3172750Z E ^ 2025-05-07T20:32:35.3173219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3173679Z 2025-05-07T20:32:35.3174105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3174629Z 2025-05-07T20:32:35.3174739Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3175163Z self=, 2025-05-07T20:32:35.3175571Z T=2048, 2025-05-07T20:32:35.3175776Z D=5120, 2025-05-07T20:32:35.3175982Z scale_ub=1200.0, 2025-05-07T20:32:35.3176209Z contiguous=True, 2025-05-07T20:32:35.3176447Z compiled=False, 2025-05-07T20:32:35.3176713Z ) 2025-05-07T20:32:36.2357944Z self = 2025-05-07T20:32:36.2358658Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.2358940Z 2025-05-07T20:32:36.2359023Z @given( 2025-05-07T20:32:36.2359257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2359571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2359894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2360238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2360569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2360849Z ) 2025-05-07T20:32:36.2361203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2361646Z def test_silu_mul_quant( 2025-05-07T20:32:36.2361891Z self, 2025-05-07T20:32:36.2362079Z T: int, 2025-05-07T20:32:36.2362292Z D: int, 2025-05-07T20:32:36.2362514Z scale_ub: Optional[float], 2025-05-07T20:32:36.2362781Z contiguous: bool, 2025-05-07T20:32:36.2363023Z compiled: bool, 2025-05-07T20:32:36.2363260Z ) -> None: 2025-05-07T20:32:36.2363475Z torch.manual_seed(2025) 2025-05-07T20:32:36.2363716Z 2025-05-07T20:32:36.2363995Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2364345Z 2025-05-07T20:32:36.2364539Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2365125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2365439Z x = x_sign * x_clamp 2025-05-07T20:32:36.2365672Z x0 = x[:, :D] 2025-05-07T20:32:36.2365896Z x1 = x[:, D:] 2025-05-07T20:32:36.2366106Z 2025-05-07T20:32:36.2366284Z if contiguous: 2025-05-07T20:32:36.2366515Z x0 = x0.contiguous() 2025-05-07T20:32:36.2366771Z x1 = x1.contiguous() 2025-05-07T20:32:36.2367005Z 2025-05-07T20:32:36.2367203Z if scale_ub is not None: 2025-05-07T20:32:36.2367472Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.2367906Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.2368215Z ) 2025-05-07T20:32:36.2368419Z else: 2025-05-07T20:32:36.2368633Z scale_ub_tensor = None 2025-05-07T20:32:36.2368881Z 2025-05-07T20:32:36.2369117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2369439Z op = silu_mul_quant 2025-05-07T20:32:36.2369785Z if compiled: 2025-05-07T20:32:36.2370042Z op = torch.compile(op) 2025-05-07T20:32:36.2370347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2370627Z 2025-05-07T20:32:36.2370824Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.2370989Z 2025-05-07T20:32:36.2371095Z moe/activation_test.py:117: 2025-05-07T20:32:36.2371390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2371807Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.2372095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2372796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.2373483Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.2374022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.2374712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.2375381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.2375963Z kernel = self.compile( 2025-05-07T20:32:36.2376513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.2377174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.2377656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2377892Z 2025-05-07T20:32:36.2378101Z self = 2025-05-07T20:32:36.2379186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.2380583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df62902c0>} 2025-05-07T20:32:36.2381934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.2382966Z context = 2025-05-07T20:32:36.2383268Z 2025-05-07T20:32:36.2383439Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.2383966Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.2384438Z module_map=module_map) 2025-05-07T20:32:36.2384805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.2385216Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.2385492Z E ^ 2025-05-07T20:32:36.2385956Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.2386412Z 2025-05-07T20:32:36.2386829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.2387347Z 2025-05-07T20:32:36.2387453Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2387878Z self=, 2025-05-07T20:32:36.2388279Z T=2048, 2025-05-07T20:32:36.2388474Z D=5120, 2025-05-07T20:32:36.2388675Z scale_ub=1200.0, 2025-05-07T20:32:36.2388897Z contiguous=True, 2025-05-07T20:32:36.2389123Z compiled=True, 2025-05-07T20:32:36.2389338Z ) 2025-05-07T20:32:36.2389656Z self = 2025-05-07T20:32:36.2390202Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.2390475Z 2025-05-07T20:32:36.2390564Z @given( 2025-05-07T20:32:36.2390794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2391114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2391426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2391760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2392166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2392462Z ) 2025-05-07T20:32:36.2392821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2393260Z def test_silu_mul_quant( 2025-05-07T20:32:36.2393506Z self, 2025-05-07T20:32:36.2393709Z T: int, 2025-05-07T20:32:36.2393906Z D: int, 2025-05-07T20:32:36.2394131Z scale_ub: Optional[float], 2025-05-07T20:32:36.2394406Z contiguous: bool, 2025-05-07T20:32:36.2394645Z compiled: bool, 2025-05-07T20:32:36.2394876Z ) -> None: 2025-05-07T20:32:36.2395097Z torch.manual_seed(2025) 2025-05-07T20:32:36.2395345Z 2025-05-07T20:32:36.2395620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2395969Z 2025-05-07T20:32:36.2396170Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2396463Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2396777Z x = x_sign * x_clamp 2025-05-07T20:32:36.2397080Z x0 = x[:, :D] 2025-05-07T20:32:36.2397296Z x1 = x[:, D:] 2025-05-07T20:32:36.2397512Z 2025-05-07T20:32:36.2397706Z if contiguous: 2025-05-07T20:32:36.2397940Z x0 = x0.contiguous() 2025-05-07T20:32:36.2398203Z x1 = x1.contiguous() 2025-05-07T20:32:36.2398447Z 2025-05-07T20:32:36.2398643Z if scale_ub is not None: 2025-05-07T20:32:36.2398922Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.2399268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.2399578Z ) 2025-05-07T20:32:36.2399782Z else: 2025-05-07T20:32:36.2399997Z scale_ub_tensor = None 2025-05-07T20:32:36.2400252Z 2025-05-07T20:32:36.2400490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2400808Z op = silu_mul_quant 2025-05-07T20:32:36.2401063Z if compiled: 2025-05-07T20:32:36.2401310Z op = torch.compile(op) 2025-05-07T20:32:36.2401613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2401892Z 2025-05-07T20:32:36.2402085Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.2402375Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.2402667Z 2025-05-07T20:32:36.2402905Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2403244Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.2403540Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.2403904Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.2404270Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.2404589Z 2025-05-07T20:32:36.2404790Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.2404992Z 2025-05-07T20:32:36.2405093Z moe/activation_test.py:126: 2025-05-07T20:32:36.2405395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2406081Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.2406415Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.2407198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.2408019Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.2408559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.2409321Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.2410011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.2410730Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.2411477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.2413085Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.2413820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.2414462Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.2415059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.2415586Z fn() 2025-05-07T20:32:36.2416101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.2416687Z self.fn.run( 2025-05-07T20:32:36.2417153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.2417688Z kernel = self.compile( 2025-05-07T20:32:36.2418232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.2418959Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.2419362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2419589Z 2025-05-07T20:32:36.2419803Z self = 2025-05-07T20:32:36.2420888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.2422261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df6291440>} 2025-05-07T20:32:36.2423614Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.2424643Z context = 2025-05-07T20:32:36.2424932Z 2025-05-07T20:32:36.2425107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.2425625Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.2426094Z module_map=module_map) 2025-05-07T20:32:36.2426535Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.2426900Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.2427168Z E ^ 2025-05-07T20:32:36.2427647Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.2428095Z 2025-05-07T20:32:36.2428515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.2429031Z 2025-05-07T20:32:36.2429142Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2429550Z self=, 2025-05-07T20:32:36.2429959Z T=16384, 2025-05-07T20:32:36.2430161Z D=7168, 2025-05-07T20:32:36.2430355Z scale_ub=1200.0, 2025-05-07T20:32:36.2430584Z contiguous=False, 2025-05-07T20:32:36.2430814Z compiled=False, 2025-05-07T20:32:36.2431029Z ) 2025-05-07T20:32:37.0252104Z self = 2025-05-07T20:32:37.0252826Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.0253133Z 2025-05-07T20:32:37.0253214Z @given( 2025-05-07T20:32:37.0253457Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0253776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0254092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0254549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0254889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0255176Z ) 2025-05-07T20:32:37.0255534Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0255985Z def test_silu_mul_quant( 2025-05-07T20:32:37.0256230Z self, 2025-05-07T20:32:37.0256437Z T: int, 2025-05-07T20:32:37.0256646Z D: int, 2025-05-07T20:32:37.0256870Z scale_ub: Optional[float], 2025-05-07T20:32:37.0257157Z contiguous: bool, 2025-05-07T20:32:37.0257410Z compiled: bool, 2025-05-07T20:32:37.0257640Z ) -> None: 2025-05-07T20:32:37.0257866Z torch.manual_seed(2025) 2025-05-07T20:32:37.0258121Z 2025-05-07T20:32:37.0258397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0258750Z 2025-05-07T20:32:37.0258957Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0259354Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0259667Z x = x_sign * x_clamp 2025-05-07T20:32:37.0259916Z x0 = x[:, :D] 2025-05-07T20:32:37.0260146Z x1 = x[:, D:] 2025-05-07T20:32:37.0260361Z 2025-05-07T20:32:37.0260558Z if contiguous: 2025-05-07T20:32:37.0260801Z x0 = x0.contiguous() 2025-05-07T20:32:37.0261066Z x1 = x1.contiguous() 2025-05-07T20:32:37.0261317Z 2025-05-07T20:32:37.0261522Z if scale_ub is not None: 2025-05-07T20:32:37.0261804Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0262153Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0262476Z ) 2025-05-07T20:32:37.0262676Z else: 2025-05-07T20:32:37.0262902Z scale_ub_tensor = None 2025-05-07T20:32:37.0263163Z 2025-05-07T20:32:37.0263398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0263726Z op = silu_mul_quant 2025-05-07T20:32:37.0263998Z if compiled: 2025-05-07T20:32:37.0264252Z op = torch.compile(op) 2025-05-07T20:32:37.0264560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0264847Z 2025-05-07T20:32:37.0265052Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0265221Z 2025-05-07T20:32:37.0265326Z moe/activation_test.py:117: 2025-05-07T20:32:37.0265635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0266107Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0266396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0267098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0267801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0268345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0269042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0269714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0270256Z kernel = self.compile( 2025-05-07T20:32:37.0270801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0271461Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0271917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0272153Z 2025-05-07T20:32:37.0272368Z self = 2025-05-07T20:32:37.0273445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0274887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df50ba980>} 2025-05-07T20:32:37.0276235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0277267Z context = 2025-05-07T20:32:37.0277558Z 2025-05-07T20:32:37.0277733Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0278255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0278730Z module_map=module_map) 2025-05-07T20:32:37.0279101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0279458Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0279771Z E ^ 2025-05-07T20:32:37.0280240Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0280691Z 2025-05-07T20:32:37.0281111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0281622Z 2025-05-07T20:32:37.0281729Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0282156Z self=, 2025-05-07T20:32:37.0282568Z T=1, 2025-05-07T20:32:37.0282757Z D=7168, 2025-05-07T20:32:37.0282967Z scale_ub=None, 2025-05-07T20:32:37.0283193Z contiguous=True, 2025-05-07T20:32:37.0283421Z compiled=True, 2025-05-07T20:32:37.0283640Z ) 2025-05-07T20:32:37.0283968Z self = 2025-05-07T20:32:37.0284464Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0291137Z 2025-05-07T20:32:37.0291226Z @given( 2025-05-07T20:32:37.0291471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0291795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0292105Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0292442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0292786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0293070Z ) 2025-05-07T20:32:37.0293506Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0293960Z def test_silu_mul_quant( 2025-05-07T20:32:37.0294208Z self, 2025-05-07T20:32:37.0294418Z T: int, 2025-05-07T20:32:37.0294622Z D: int, 2025-05-07T20:32:37.0294844Z scale_ub: Optional[float], 2025-05-07T20:32:37.0295125Z contiguous: bool, 2025-05-07T20:32:37.0295372Z compiled: bool, 2025-05-07T20:32:37.0295611Z ) -> None: 2025-05-07T20:32:37.0295832Z torch.manual_seed(2025) 2025-05-07T20:32:37.0296114Z 2025-05-07T20:32:37.0296409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0296756Z 2025-05-07T20:32:37.0296960Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0297262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0297573Z x = x_sign * x_clamp 2025-05-07T20:32:37.0297825Z x0 = x[:, :D] 2025-05-07T20:32:37.0298055Z x1 = x[:, D:] 2025-05-07T20:32:37.0298354Z 2025-05-07T20:32:37.0298554Z if contiguous: 2025-05-07T20:32:37.0298794Z x0 = x0.contiguous() 2025-05-07T20:32:37.0299052Z x1 = x1.contiguous() 2025-05-07T20:32:37.0299297Z 2025-05-07T20:32:37.0299491Z if scale_ub is not None: 2025-05-07T20:32:37.0299772Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0300113Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0300485Z ) 2025-05-07T20:32:37.0300690Z else: 2025-05-07T20:32:37.0300903Z scale_ub_tensor = None 2025-05-07T20:32:37.0301168Z 2025-05-07T20:32:37.0301409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0301733Z op = silu_mul_quant 2025-05-07T20:32:37.0301983Z if compiled: 2025-05-07T20:32:37.0302239Z op = torch.compile(op) 2025-05-07T20:32:37.0302543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0302824Z 2025-05-07T20:32:37.0303028Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0303320Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0303616Z 2025-05-07T20:32:37.0303867Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0304208Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0304502Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0304830Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0305246Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0305568Z 2025-05-07T20:32:37.0306226Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0306500Z 2025-05-07T20:32:37.0306631Z moe/activation_test.py:126: 2025-05-07T20:32:37.0307026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0307461Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0307806Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0308597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0309351Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0309904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0310594Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0311287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0312004Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0312757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.0313610Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0314342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0314976Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0315584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0316111Z fn() 2025-05-07T20:32:37.0316620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0317211Z self.fn.run( 2025-05-07T20:32:37.0317684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0318218Z kernel = self.compile( 2025-05-07T20:32:37.0318755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0319483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0319884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0320113Z 2025-05-07T20:32:37.0320329Z self = 2025-05-07T20:32:37.0321406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0322879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4f3ac00>} 2025-05-07T20:32:37.0324235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0325266Z context = 2025-05-07T20:32:37.0325554Z 2025-05-07T20:32:37.0325731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0326254Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0326734Z module_map=module_map) 2025-05-07T20:32:37.0327195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0327651Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0327926Z E ^ 2025-05-07T20:32:37.0328394Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0328847Z 2025-05-07T20:32:37.0329273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0329782Z 2025-05-07T20:32:37.0329894Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0330315Z self=, 2025-05-07T20:32:37.0330728Z T=4096, 2025-05-07T20:32:37.0330922Z D=5120, 2025-05-07T20:32:37.0331124Z scale_ub=None, 2025-05-07T20:32:37.0331347Z contiguous=False, 2025-05-07T20:32:37.0331574Z compiled=False, 2025-05-07T20:32:37.0331789Z ) 2025-05-07T20:32:37.9275929Z self = 2025-05-07T20:32:37.9276569Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.9276860Z 2025-05-07T20:32:37.9276945Z @given( 2025-05-07T20:32:37.9277189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.9277505Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.9277821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.9278168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.9278818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.9279117Z ) 2025-05-07T20:32:37.9279478Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.9279921Z def test_silu_mul_quant( 2025-05-07T20:32:37.9280182Z self, 2025-05-07T20:32:37.9280392Z T: int, 2025-05-07T20:32:37.9280603Z D: int, 2025-05-07T20:32:37.9280824Z scale_ub: Optional[float], 2025-05-07T20:32:37.9281111Z contiguous: bool, 2025-05-07T20:32:37.9281364Z compiled: bool, 2025-05-07T20:32:37.9281599Z ) -> None: 2025-05-07T20:32:37.9281832Z torch.manual_seed(2025) 2025-05-07T20:32:37.9282087Z 2025-05-07T20:32:37.9282366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.9282721Z 2025-05-07T20:32:37.9282925Z x_sign = torch.sign(x) 2025-05-07T20:32:37.9283218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.9283636Z x = x_sign * x_clamp 2025-05-07T20:32:37.9283888Z x0 = x[:, :D] 2025-05-07T20:32:37.9284104Z x1 = x[:, D:] 2025-05-07T20:32:37.9284320Z 2025-05-07T20:32:37.9284523Z if contiguous: 2025-05-07T20:32:37.9284755Z x0 = x0.contiguous() 2025-05-07T20:32:37.9285027Z x1 = x1.contiguous() 2025-05-07T20:32:37.9285275Z 2025-05-07T20:32:37.9285471Z if scale_ub is not None: 2025-05-07T20:32:37.9285830Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.9286175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.9286485Z ) 2025-05-07T20:32:37.9286685Z else: 2025-05-07T20:32:37.9286905Z scale_ub_tensor = None 2025-05-07T20:32:37.9287159Z 2025-05-07T20:32:37.9287398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9287794Z op = silu_mul_quant 2025-05-07T20:32:37.9288052Z if compiled: 2025-05-07T20:32:37.9288309Z op = torch.compile(op) 2025-05-07T20:32:37.9288612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9288898Z 2025-05-07T20:32:37.9289096Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.9289267Z 2025-05-07T20:32:37.9289371Z moe/activation_test.py:117: 2025-05-07T20:32:37.9289676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9290009Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.9290391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9291107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.9291804Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.9292343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.9293021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.9293684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.9294221Z kernel = self.compile( 2025-05-07T20:32:37.9294760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.9295417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.9295814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9296048Z 2025-05-07T20:32:37.9296259Z self = 2025-05-07T20:32:37.9297329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9298785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4f14180>} 2025-05-07T20:32:37.9300140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9301176Z context = 2025-05-07T20:32:37.9301472Z 2025-05-07T20:32:37.9301648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9302168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9302650Z module_map=module_map) 2025-05-07T20:32:37.9303026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9303380Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.9303651Z E ^ 2025-05-07T20:32:37.9304181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9304634Z 2025-05-07T20:32:37.9305059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9305574Z 2025-05-07T20:32:37.9305953Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.9306422Z self=, 2025-05-07T20:32:37.9306935Z T=4096, 2025-05-07T20:32:37.9307125Z D=7168, 2025-05-07T20:32:37.9307334Z scale_ub=None, 2025-05-07T20:32:37.9307556Z contiguous=False, 2025-05-07T20:32:37.9307784Z compiled=False, 2025-05-07T20:32:37.9308003Z ) 2025-05-07T20:32:37.9308323Z self = 2025-05-07T20:32:37.9308826Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.9309102Z 2025-05-07T20:32:37.9309188Z @given( 2025-05-07T20:32:37.9309425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.9309744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.9310048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.9310389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.9310720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.9311003Z ) 2025-05-07T20:32:37.9311475Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.9311925Z def test_silu_mul_quant( 2025-05-07T20:32:37.9312175Z self, 2025-05-07T20:32:37.9312371Z T: int, 2025-05-07T20:32:37.9312582Z D: int, 2025-05-07T20:32:37.9312810Z scale_ub: Optional[float], 2025-05-07T20:32:37.9313084Z contiguous: bool, 2025-05-07T20:32:37.9313328Z compiled: bool, 2025-05-07T20:32:37.9313558Z ) -> None: 2025-05-07T20:32:37.9313780Z torch.manual_seed(2025) 2025-05-07T20:32:37.9314030Z 2025-05-07T20:32:37.9314310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.9314652Z 2025-05-07T20:32:37.9314858Z x_sign = torch.sign(x) 2025-05-07T20:32:37.9315154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.9315465Z x = x_sign * x_clamp 2025-05-07T20:32:37.9315716Z x0 = x[:, :D] 2025-05-07T20:32:37.9315941Z x1 = x[:, D:] 2025-05-07T20:32:37.9316186Z 2025-05-07T20:32:37.9316403Z if contiguous: 2025-05-07T20:32:37.9316643Z x0 = x0.contiguous() 2025-05-07T20:32:37.9316900Z x1 = x1.contiguous() 2025-05-07T20:32:37.9317153Z 2025-05-07T20:32:37.9317355Z if scale_ub is not None: 2025-05-07T20:32:37.9317639Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.9317975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.9318293Z ) 2025-05-07T20:32:37.9318575Z else: 2025-05-07T20:32:37.9318794Z scale_ub_tensor = None 2025-05-07T20:32:37.9319058Z 2025-05-07T20:32:37.9319297Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9319609Z op = silu_mul_quant 2025-05-07T20:32:37.9319874Z if compiled: 2025-05-07T20:32:37.9320131Z op = torch.compile(op) 2025-05-07T20:32:37.9320426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9320718Z 2025-05-07T20:32:37.9320919Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.9321086Z 2025-05-07T20:32:37.9321184Z moe/activation_test.py:117: 2025-05-07T20:32:37.9321488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9321827Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.9322117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9322809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.9323583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.9324129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.9324812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.9325479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.9326091Z kernel = self.compile( 2025-05-07T20:32:37.9326661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.9327314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.9327813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9328043Z 2025-05-07T20:32:37.9328258Z self = 2025-05-07T20:32:37.9329342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9330718Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4f16020>} 2025-05-07T20:32:37.9332125Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9333157Z context = 2025-05-07T20:32:37.9333444Z 2025-05-07T20:32:37.9333620Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9334143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9334616Z module_map=module_map) 2025-05-07T20:32:37.9334985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9335347Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.9335606Z E ^ 2025-05-07T20:32:37.9336099Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9336580Z 2025-05-07T20:32:37.9337007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9337518Z 2025-05-07T20:32:37.9337635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.9338046Z self=, 2025-05-07T20:32:37.9338453Z T=128, 2025-05-07T20:32:37.9338647Z D=7168, 2025-05-07T20:32:37.9338844Z scale_ub=None, 2025-05-07T20:32:37.9339120Z contiguous=False, 2025-05-07T20:32:37.9339360Z compiled=True, 2025-05-07T20:32:37.9339566Z ) 2025-05-07T20:32:37.9780061Z self = 2025-05-07T20:32:37.9780585Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.9780943Z 2025-05-07T20:32:37.9781062Z @given( 2025-05-07T20:32:37.9781327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.9781668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.9781986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.9782319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.9782655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.9782949Z ) 2025-05-07T20:32:37.9783303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.9783754Z def test_silu_mul_quant( 2025-05-07T20:32:37.9784005Z self, 2025-05-07T20:32:37.9784214Z T: int, 2025-05-07T20:32:37.9784670Z D: int, 2025-05-07T20:32:37.9784903Z scale_ub: Optional[float], 2025-05-07T20:32:37.9785178Z contiguous: bool, 2025-05-07T20:32:37.9785417Z compiled: bool, 2025-05-07T20:32:37.9785650Z ) -> None: 2025-05-07T20:32:37.9785872Z torch.manual_seed(2025) 2025-05-07T20:32:37.9786112Z 2025-05-07T20:32:37.9786438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.9786875Z 2025-05-07T20:32:37.9787076Z x_sign = torch.sign(x) 2025-05-07T20:32:37.9787369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.9787686Z x = x_sign * x_clamp 2025-05-07T20:32:37.9787931Z x0 = x[:, :D] 2025-05-07T20:32:37.9788151Z x1 = x[:, D:] 2025-05-07T20:32:37.9788368Z 2025-05-07T20:32:37.9788561Z if contiguous: 2025-05-07T20:32:37.9788816Z x0 = x0.contiguous() 2025-05-07T20:32:37.9789082Z x1 = x1.contiguous() 2025-05-07T20:32:37.9789329Z 2025-05-07T20:32:37.9789530Z if scale_ub is not None: 2025-05-07T20:32:37.9789801Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.9790139Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.9790453Z ) 2025-05-07T20:32:37.9790648Z else: 2025-05-07T20:32:37.9790865Z scale_ub_tensor = None 2025-05-07T20:32:37.9791123Z 2025-05-07T20:32:37.9791453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9791766Z op = silu_mul_quant 2025-05-07T20:32:37.9792025Z if compiled: 2025-05-07T20:32:37.9792278Z op = torch.compile(op) 2025-05-07T20:32:37.9792575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9792860Z 2025-05-07T20:32:37.9793062Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.9793349Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.9793650Z 2025-05-07T20:32:37.9793900Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9794235Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.9794538Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.9794860Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.9795223Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.9795544Z 2025-05-07T20:32:37.9795753Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.9795956Z 2025-05-07T20:32:37.9796067Z moe/activation_test.py:126: 2025-05-07T20:32:37.9796367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9796709Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.9797046Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.9797906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.9798671Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.9799224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.9799911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.9800597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.9801338Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.9802102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.9802860Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.9803597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.9804297Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.9804910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.9805434Z fn() 2025-05-07T20:32:37.9806325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.9806913Z self.fn.run( 2025-05-07T20:32:37.9807478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.9808090Z kernel = self.compile( 2025-05-07T20:32:37.9808632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.9809284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.9809674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9809914Z 2025-05-07T20:32:37.9810125Z self = 2025-05-07T20:32:37.9811201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9812586Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4a07060>} 2025-05-07T20:32:37.9814008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9815022Z context = 2025-05-07T20:32:37.9815316Z 2025-05-07T20:32:37.9815486Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9816007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9816527Z module_map=module_map) 2025-05-07T20:32:37.9816888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9817252Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.9817516Z E ^ 2025-05-07T20:32:37.9817978Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9818440Z 2025-05-07T20:32:37.9818853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9819372Z 2025-05-07T20:32:37.9819479Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.9819893Z self=, 2025-05-07T20:32:37.9820297Z T=128, 2025-05-07T20:32:37.9820556Z D=7168, 2025-05-07T20:32:37.9820763Z scale_ub=None, 2025-05-07T20:32:37.9820977Z contiguous=False, 2025-05-07T20:32:37.9821211Z compiled=False, 2025-05-07T20:32:37.9821419Z ) 2025-05-07T20:32:38.2804692Z self = 2025-05-07T20:32:38.2805448Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:38.2806231Z 2025-05-07T20:32:38.2806386Z @given( 2025-05-07T20:32:38.2806722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2807182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2807715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2808174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2808620Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2809018Z ) 2025-05-07T20:32:38.2809507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2817420Z def test_silu_mul_quant( 2025-05-07T20:32:38.2817689Z self, 2025-05-07T20:32:38.2817897Z T: int, 2025-05-07T20:32:38.2818098Z D: int, 2025-05-07T20:32:38.2818327Z scale_ub: Optional[float], 2025-05-07T20:32:38.2818607Z contiguous: bool, 2025-05-07T20:32:38.2818848Z compiled: bool, 2025-05-07T20:32:38.2819088Z ) -> None: 2025-05-07T20:32:38.2819315Z torch.manual_seed(2025) 2025-05-07T20:32:38.2819691Z 2025-05-07T20:32:38.2819975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2820332Z 2025-05-07T20:32:38.2820527Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2820831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2821152Z x = x_sign * x_clamp 2025-05-07T20:32:38.2821396Z x0 = x[:, :D] 2025-05-07T20:32:38.2821625Z x1 = x[:, D:] 2025-05-07T20:32:38.2821842Z 2025-05-07T20:32:38.2822034Z if contiguous: 2025-05-07T20:32:38.2822276Z x0 = x0.contiguous() 2025-05-07T20:32:38.2822543Z x1 = x1.contiguous() 2025-05-07T20:32:38.2822783Z 2025-05-07T20:32:38.2822983Z if scale_ub is not None: 2025-05-07T20:32:38.2823264Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2823610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2823915Z ) 2025-05-07T20:32:38.2824119Z else: 2025-05-07T20:32:38.2824429Z scale_ub_tensor = None 2025-05-07T20:32:38.2824678Z 2025-05-07T20:32:38.2824914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2825231Z op = silu_mul_quant 2025-05-07T20:32:38.2825488Z if compiled: 2025-05-07T20:32:38.2825738Z op = torch.compile(op) 2025-05-07T20:32:38.2826046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2826331Z 2025-05-07T20:32:38.2826529Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.2826706Z 2025-05-07T20:32:38.2826813Z moe/activation_test.py:117: 2025-05-07T20:32:38.2827119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2827452Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.2827740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2828438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.2829147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.2829684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2830374Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2831040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2831580Z kernel = self.compile( 2025-05-07T20:32:38.2832204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2832869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2833274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2833502Z 2025-05-07T20:32:38.2833715Z self = 2025-05-07T20:32:38.2834805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2836195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df43d8cc0>} 2025-05-07T20:32:38.2837638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2838670Z context = 2025-05-07T20:32:38.2838958Z 2025-05-07T20:32:38.2839125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2839652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2840186Z module_map=module_map) 2025-05-07T20:32:38.2840559Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2840913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2841184Z E ^ 2025-05-07T20:32:38.2841658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2842107Z 2025-05-07T20:32:38.2842530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2843047Z 2025-05-07T20:32:38.2843154Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2843573Z self=, 2025-05-07T20:32:38.2843980Z T=4096, 2025-05-07T20:32:38.2844172Z D=5120, 2025-05-07T20:32:38.2844370Z scale_ub=1200.0, 2025-05-07T20:32:38.2844597Z contiguous=True, 2025-05-07T20:32:38.2844870Z compiled=False, 2025-05-07T20:32:38.2845088Z ) 2025-05-07T20:32:38.2845410Z self = 2025-05-07T20:32:38.2845908Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:38.2846187Z 2025-05-07T20:32:38.2846268Z @given( 2025-05-07T20:32:38.2846505Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2846817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2847136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2847474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2847914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2848202Z ) 2025-05-07T20:32:38.2848555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2849006Z def test_silu_mul_quant( 2025-05-07T20:32:38.2849252Z self, 2025-05-07T20:32:38.2849464Z T: int, 2025-05-07T20:32:38.2849678Z D: int, 2025-05-07T20:32:38.2849899Z scale_ub: Optional[float], 2025-05-07T20:32:38.2850180Z contiguous: bool, 2025-05-07T20:32:38.2850429Z compiled: bool, 2025-05-07T20:32:38.2850654Z ) -> None: 2025-05-07T20:32:38.2850878Z torch.manual_seed(2025) 2025-05-07T20:32:38.2851128Z 2025-05-07T20:32:38.2851404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2851755Z 2025-05-07T20:32:38.2852011Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2852313Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2852632Z x = x_sign * x_clamp 2025-05-07T20:32:38.2852884Z x0 = x[:, :D] 2025-05-07T20:32:38.2853107Z x1 = x[:, D:] 2025-05-07T20:32:38.2853319Z 2025-05-07T20:32:38.2853515Z if contiguous: 2025-05-07T20:32:38.2853753Z x0 = x0.contiguous() 2025-05-07T20:32:38.2854014Z x1 = x1.contiguous() 2025-05-07T20:32:38.2854270Z 2025-05-07T20:32:38.2854468Z if scale_ub is not None: 2025-05-07T20:32:38.2854740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2855072Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2855386Z ) 2025-05-07T20:32:38.2855576Z else: 2025-05-07T20:32:38.2855795Z scale_ub_tensor = None 2025-05-07T20:32:38.2856056Z 2025-05-07T20:32:38.2856286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2856659Z op = silu_mul_quant 2025-05-07T20:32:38.2856920Z if compiled: 2025-05-07T20:32:38.2857172Z op = torch.compile(op) 2025-05-07T20:32:38.2857472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2857753Z 2025-05-07T20:32:38.2857956Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.2858121Z 2025-05-07T20:32:38.2858221Z moe/activation_test.py:117: 2025-05-07T20:32:38.2858525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2858912Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.2859190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2859878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.2860571Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.2861117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2861803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2862472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2863009Z kernel = self.compile( 2025-05-07T20:32:38.2863547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2864259Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2864663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2864893Z 2025-05-07T20:32:38.2865103Z self = 2025-05-07T20:32:38.2866211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2867592Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df43d9f80>} 2025-05-07T20:32:38.2868940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2869969Z context = 2025-05-07T20:32:38.2870256Z 2025-05-07T20:32:38.2870431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2870954Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2871417Z module_map=module_map) 2025-05-07T20:32:38.2871826Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2872183Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2872433Z E ^ 2025-05-07T20:32:38.2872893Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2873339Z 2025-05-07T20:32:38.2873755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2874263Z 2025-05-07T20:32:38.2874376Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2874781Z self=, 2025-05-07T20:32:38.2875178Z T=1, 2025-05-07T20:32:38.2875367Z D=5120, 2025-05-07T20:32:38.2875553Z scale_ub=None, 2025-05-07T20:32:38.2875767Z contiguous=True, 2025-05-07T20:32:38.2875988Z compiled=True, 2025-05-07T20:32:38.2876187Z ) 2025-05-07T20:32:38.7143991Z self = 2025-05-07T20:32:38.7145352Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:38.7145884Z 2025-05-07T20:32:38.7146044Z @given( 2025-05-07T20:32:38.7146418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.7146778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.7147087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.7147424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.7147842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.7148136Z ) 2025-05-07T20:32:38.7148489Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.7148934Z def test_silu_mul_quant( 2025-05-07T20:32:38.7149177Z self, 2025-05-07T20:32:38.7149375Z T: int, 2025-05-07T20:32:38.7149581Z D: int, 2025-05-07T20:32:38.7149801Z scale_ub: Optional[float], 2025-05-07T20:32:38.7150078Z contiguous: bool, 2025-05-07T20:32:38.7150326Z compiled: bool, 2025-05-07T20:32:38.7150555Z ) -> None: 2025-05-07T20:32:38.7150779Z torch.manual_seed(2025) 2025-05-07T20:32:38.7151028Z 2025-05-07T20:32:38.7151298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.7151649Z 2025-05-07T20:32:38.7151857Z x_sign = torch.sign(x) 2025-05-07T20:32:38.7152146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.7152463Z x = x_sign * x_clamp 2025-05-07T20:32:38.7152799Z x0 = x[:, :D] 2025-05-07T20:32:38.7153016Z x1 = x[:, D:] 2025-05-07T20:32:38.7153231Z 2025-05-07T20:32:38.7153425Z if contiguous: 2025-05-07T20:32:38.7153654Z x0 = x0.contiguous() 2025-05-07T20:32:38.7153931Z x1 = x1.contiguous() 2025-05-07T20:32:38.7154178Z 2025-05-07T20:32:38.7154376Z if scale_ub is not None: 2025-05-07T20:32:38.7154647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.7154994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.7155307Z ) 2025-05-07T20:32:38.7155508Z else: 2025-05-07T20:32:38.7155717Z scale_ub_tensor = None 2025-05-07T20:32:38.7155974Z 2025-05-07T20:32:38.7156209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.7156524Z op = silu_mul_quant 2025-05-07T20:32:38.7156785Z if compiled: 2025-05-07T20:32:38.7157034Z op = torch.compile(op) 2025-05-07T20:32:38.7157332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7157614Z 2025-05-07T20:32:38.7157813Z y_fp8, y_scale = fn() 2025-05-07T20:32:38.7158095Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:38.7158390Z 2025-05-07T20:32:38.7158630Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.7158960Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:38.7159347Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:38.7159670Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:38.7160036Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.7160345Z 2025-05-07T20:32:38.7160552Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:38.7160747Z 2025-05-07T20:32:38.7160854Z moe/activation_test.py:126: 2025-05-07T20:32:38.7161151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7161494Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:38.7161821Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.7162605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:38.7163358Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:38.7163909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.7164674Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.7165356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:38.7166082Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.7166833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:38.7167751Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.7168474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:38.7169121Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:38.7169730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:38.7170259Z fn() 2025-05-07T20:32:38.7170767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:38.7171352Z self.fn.run( 2025-05-07T20:32:38.7171827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.7172357Z kernel = self.compile( 2025-05-07T20:32:38.7172900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.7173610Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.7174016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7174245Z 2025-05-07T20:32:38.7174454Z self = 2025-05-07T20:32:38.7175542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.7176939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df43dafc0>} 2025-05-07T20:32:38.7178286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.7179306Z context = 2025-05-07T20:32:38.7179599Z 2025-05-07T20:32:38.7179766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.7180290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.7180813Z module_map=module_map) 2025-05-07T20:32:38.7181178Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.7181539Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:38.7181810Z E ^ 2025-05-07T20:32:38.7182270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.7182725Z 2025-05-07T20:32:38.7183137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.7183656Z 2025-05-07T20:32:38.7183761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.7184173Z self=, 2025-05-07T20:32:38.7184571Z T=2048, 2025-05-07T20:32:38.7184765Z D=5120, 2025-05-07T20:32:38.7184964Z scale_ub=None, 2025-05-07T20:32:38.7185181Z contiguous=True, 2025-05-07T20:32:38.7185410Z compiled=True, 2025-05-07T20:32:38.7185622Z ) 2025-05-07T20:32:39.1331239Z self = 2025-05-07T20:32:39.1331842Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.1332179Z 2025-05-07T20:32:39.1332271Z @given( 2025-05-07T20:32:39.1332510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.1332832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.1333149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.1333583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.1333924Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.1334213Z ) 2025-05-07T20:32:39.1334562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.1335022Z def test_silu_mul_quant( 2025-05-07T20:32:39.1335270Z self, 2025-05-07T20:32:39.1335469Z T: int, 2025-05-07T20:32:39.1335665Z D: int, 2025-05-07T20:32:39.1335892Z scale_ub: Optional[float], 2025-05-07T20:32:39.1336167Z contiguous: bool, 2025-05-07T20:32:39.1336417Z compiled: bool, 2025-05-07T20:32:39.1336681Z ) -> None: 2025-05-07T20:32:39.1336913Z torch.manual_seed(2025) 2025-05-07T20:32:39.1337162Z 2025-05-07T20:32:39.1337434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.1337784Z 2025-05-07T20:32:39.1337988Z x_sign = torch.sign(x) 2025-05-07T20:32:39.1338373Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.1338690Z x = x_sign * x_clamp 2025-05-07T20:32:39.1338936Z x0 = x[:, :D] 2025-05-07T20:32:39.1339157Z x1 = x[:, D:] 2025-05-07T20:32:39.1339366Z 2025-05-07T20:32:39.1339560Z if contiguous: 2025-05-07T20:32:39.1339799Z x0 = x0.contiguous() 2025-05-07T20:32:39.1340055Z x1 = x1.contiguous() 2025-05-07T20:32:39.1340300Z 2025-05-07T20:32:39.1340502Z if scale_ub is not None: 2025-05-07T20:32:39.1340780Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.1341123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.1341436Z ) 2025-05-07T20:32:39.1341640Z else: 2025-05-07T20:32:39.1341850Z scale_ub_tensor = None 2025-05-07T20:32:39.1342110Z 2025-05-07T20:32:39.1342351Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1342669Z op = silu_mul_quant 2025-05-07T20:32:39.1342935Z if compiled: 2025-05-07T20:32:39.1343189Z op = torch.compile(op) 2025-05-07T20:32:39.1343484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1343768Z 2025-05-07T20:32:39.1343968Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.1344255Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.1344555Z 2025-05-07T20:32:39.1344801Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1345230Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.1345536Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.1345861Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.1346225Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.1346539Z 2025-05-07T20:32:39.1346747Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.1346944Z 2025-05-07T20:32:39.1347053Z moe/activation_test.py:126: 2025-05-07T20:32:39.1347359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1347702Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.1348042Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.1348835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.1349598Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.1350200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.1350887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.1351573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.1352298Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.1353097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:39.1353846Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.1354574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.1355215Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.1355827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.1356352Z fn() 2025-05-07T20:32:39.1356904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.1357489Z self.fn.run( 2025-05-07T20:32:39.1357958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.1358537Z kernel = self.compile( 2025-05-07T20:32:39.1359081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.1359734Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1360138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1360368Z 2025-05-07T20:32:39.1360578Z self = 2025-05-07T20:32:39.1361660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.1363054Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4023740>} 2025-05-07T20:32:39.1364405Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.1365424Z context = 2025-05-07T20:32:39.1365718Z 2025-05-07T20:32:39.1365884Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.1366460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1366931Z module_map=module_map) 2025-05-07T20:32:39.1367294Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1367776Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.1368046Z E ^ 2025-05-07T20:32:39.1368506Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1368968Z 2025-05-07T20:32:39.1369383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.1369900Z 2025-05-07T20:32:39.1370005Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.1370418Z self=, 2025-05-07T20:32:39.1370820Z T=128, 2025-05-07T20:32:39.1371021Z D=5120, 2025-05-07T20:32:39.1371229Z scale_ub=None, 2025-05-07T20:32:39.1371449Z contiguous=True, 2025-05-07T20:32:39.1371732Z compiled=True, 2025-05-07T20:32:39.1371949Z ) 2025-05-07T20:32:39.7886210Z self = 2025-05-07T20:32:39.7887184Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.7887619Z 2025-05-07T20:32:39.7887721Z @given( 2025-05-07T20:32:39.7888020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7888653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7888967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7889303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7889630Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7889921Z ) 2025-05-07T20:32:39.7890272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7890725Z def test_silu_mul_quant( 2025-05-07T20:32:39.7897307Z self, 2025-05-07T20:32:39.7897547Z T: int, 2025-05-07T20:32:39.7897763Z D: int, 2025-05-07T20:32:39.7897999Z scale_ub: Optional[float], 2025-05-07T20:32:39.7898289Z contiguous: bool, 2025-05-07T20:32:39.7898539Z compiled: bool, 2025-05-07T20:32:39.7898779Z ) -> None: 2025-05-07T20:32:39.7899012Z torch.manual_seed(2025) 2025-05-07T20:32:39.7899262Z 2025-05-07T20:32:39.7899549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7900045Z 2025-05-07T20:32:39.7900244Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7900554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7900877Z x = x_sign * x_clamp 2025-05-07T20:32:39.7901131Z x0 = x[:, :D] 2025-05-07T20:32:39.7901364Z x1 = x[:, D:] 2025-05-07T20:32:39.7901587Z 2025-05-07T20:32:39.7901780Z if contiguous: 2025-05-07T20:32:39.7902023Z x0 = x0.contiguous() 2025-05-07T20:32:39.7902298Z x1 = x1.contiguous() 2025-05-07T20:32:39.7902554Z 2025-05-07T20:32:39.7902754Z if scale_ub is not None: 2025-05-07T20:32:39.7903039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7903387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7903702Z ) 2025-05-07T20:32:39.7903910Z else: 2025-05-07T20:32:39.7904137Z scale_ub_tensor = None 2025-05-07T20:32:39.7904394Z 2025-05-07T20:32:39.7904644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7904977Z op = silu_mul_quant 2025-05-07T20:32:39.7905235Z if compiled: 2025-05-07T20:32:39.7905492Z op = torch.compile(op) 2025-05-07T20:32:39.7906107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7906399Z 2025-05-07T20:32:39.7906602Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.7906905Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.7907214Z 2025-05-07T20:32:39.7907589Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7907941Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.7908250Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.7908580Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.7908945Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.7909276Z 2025-05-07T20:32:39.7909501Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.7909705Z 2025-05-07T20:32:39.7909810Z moe/activation_test.py:126: 2025-05-07T20:32:39.7910124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7910474Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.7910807Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.7911617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.7912469Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.7913031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7913721Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7914421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.7915227Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.7915991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:39.7916747Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.7917490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.7918146Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.7918761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.7919290Z fn() 2025-05-07T20:32:39.7919811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.7920406Z self.fn.run( 2025-05-07T20:32:39.7920954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7921500Z kernel = self.compile( 2025-05-07T20:32:39.7922050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7922716Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7923115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7923358Z 2025-05-07T20:32:39.7923570Z self = 2025-05-07T20:32:39.7924668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7926074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3fc91e40>} 2025-05-07T20:32:39.7927439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7928584Z context = 2025-05-07T20:32:39.7928882Z 2025-05-07T20:32:39.7929103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7929640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7930113Z module_map=module_map) 2025-05-07T20:32:39.7930492Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7930861Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.7931138Z E ^ 2025-05-07T20:32:39.7931621Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7932087Z 2025-05-07T20:32:39.7932508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7933026Z 2025-05-07T20:32:39.7933146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7933567Z self=, 2025-05-07T20:32:39.7933987Z T=4096, 2025-05-07T20:32:39.7934244Z D=5120, 2025-05-07T20:32:39.7934447Z scale_ub=None, 2025-05-07T20:32:39.7934679Z contiguous=True, 2025-05-07T20:32:39.7934921Z compiled=True, 2025-05-07T20:32:39.7935134Z ) 2025-05-07T20:32:40.2829235Z self = 2025-05-07T20:32:40.2829851Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:40.2830211Z 2025-05-07T20:32:40.2830653Z @given( 2025-05-07T20:32:40.2830893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.2831215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.2831532Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.2831865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.2832202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.2832497Z ) 2025-05-07T20:32:40.2832854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.2833309Z def test_silu_mul_quant( 2025-05-07T20:32:40.2833561Z self, 2025-05-07T20:32:40.2833766Z T: int, 2025-05-07T20:32:40.2833965Z D: int, 2025-05-07T20:32:40.2834190Z scale_ub: Optional[float], 2025-05-07T20:32:40.2834467Z contiguous: bool, 2025-05-07T20:32:40.2834712Z compiled: bool, 2025-05-07T20:32:40.2834949Z ) -> None: 2025-05-07T20:32:40.2835172Z torch.manual_seed(2025) 2025-05-07T20:32:40.2835505Z 2025-05-07T20:32:40.2835784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.2836135Z 2025-05-07T20:32:40.2836334Z x_sign = torch.sign(x) 2025-05-07T20:32:40.2836633Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.2836949Z x = x_sign * x_clamp 2025-05-07T20:32:40.2837192Z x0 = x[:, :D] 2025-05-07T20:32:40.2837415Z x1 = x[:, D:] 2025-05-07T20:32:40.2837632Z 2025-05-07T20:32:40.2837818Z if contiguous: 2025-05-07T20:32:40.2838062Z x0 = x0.contiguous() 2025-05-07T20:32:40.2838328Z x1 = x1.contiguous() 2025-05-07T20:32:40.2838567Z 2025-05-07T20:32:40.2838763Z if scale_ub is not None: 2025-05-07T20:32:40.2839037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.2839379Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.2839687Z ) 2025-05-07T20:32:40.2839887Z else: 2025-05-07T20:32:40.2840103Z scale_ub_tensor = None 2025-05-07T20:32:40.2840363Z 2025-05-07T20:32:40.2840598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.2840911Z op = silu_mul_quant 2025-05-07T20:32:40.2841163Z if compiled: 2025-05-07T20:32:40.2841413Z op = torch.compile(op) 2025-05-07T20:32:40.2841702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.2841981Z 2025-05-07T20:32:40.2842173Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.2842545Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.2842833Z 2025-05-07T20:32:40.2843068Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.2843403Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.2843691Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.2844008Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.2844367Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.2844680Z 2025-05-07T20:32:40.2844882Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.2845079Z 2025-05-07T20:32:40.2845183Z moe/activation_test.py:126: 2025-05-07T20:32:40.2845495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.2845822Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.2846148Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.2847029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.2847893Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.2848440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.2849128Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.2849870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.2850593Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.2851349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:40.2852101Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.2852841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.2853479Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.2854085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.2854612Z fn() 2025-05-07T20:32:40.2855124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.2855765Z self.fn.run( 2025-05-07T20:32:40.2856240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.2856776Z kernel = self.compile( 2025-05-07T20:32:40.2857320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.2857982Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.2858388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.2858619Z 2025-05-07T20:32:40.2858836Z self = 2025-05-07T20:32:40.2859914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.2861323Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3fc79ee0>} 2025-05-07T20:32:40.2862674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.2863758Z context = 2025-05-07T20:32:40.2864050Z 2025-05-07T20:32:40.2864229Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.2864749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.2865221Z module_map=module_map) 2025-05-07T20:32:40.2865590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.2865952Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.2866225Z E ^ 2025-05-07T20:32:40.2866699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.2867147Z 2025-05-07T20:32:40.2867569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.2868081Z 2025-05-07T20:32:40.2868188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.2868651Z self=, 2025-05-07T20:32:40.2869060Z T=16384, 2025-05-07T20:32:40.2869256Z D=5120, 2025-05-07T20:32:40.2869467Z scale_ub=None, 2025-05-07T20:32:40.2869686Z contiguous=True, 2025-05-07T20:32:40.2869907Z compiled=True, 2025-05-07T20:32:40.2870121Z ) 2025-05-07T20:32:40.3128037Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:40.3129477Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:40.3130809Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:40.3131807Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:40.3132908Z W0507 20:32:40.311000 88430 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:40.3818896Z self = 2025-05-07T20:32:40.3819465Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:40.3820120Z 2025-05-07T20:32:40.3820205Z @given( 2025-05-07T20:32:40.3820451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3820776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3821093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3821428Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3821765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3822065Z ) 2025-05-07T20:32:40.3822428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3822881Z def test_silu_mul_quant( 2025-05-07T20:32:40.3823141Z self, 2025-05-07T20:32:40.3823343Z T: int, 2025-05-07T20:32:40.3823556Z D: int, 2025-05-07T20:32:40.3823790Z scale_ub: Optional[float], 2025-05-07T20:32:40.3824067Z contiguous: bool, 2025-05-07T20:32:40.3824323Z compiled: bool, 2025-05-07T20:32:40.3824570Z ) -> None: 2025-05-07T20:32:40.3824794Z torch.manual_seed(2025) 2025-05-07T20:32:40.3825049Z 2025-05-07T20:32:40.3825513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3825863Z 2025-05-07T20:32:40.3826077Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3826381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3826701Z x = x_sign * x_clamp 2025-05-07T20:32:40.3826946Z x0 = x[:, :D] 2025-05-07T20:32:40.3827266Z x1 = x[:, D:] 2025-05-07T20:32:40.3827488Z 2025-05-07T20:32:40.3827680Z if contiguous: 2025-05-07T20:32:40.3827924Z x0 = x0.contiguous() 2025-05-07T20:32:40.3828196Z x1 = x1.contiguous() 2025-05-07T20:32:40.3828440Z 2025-05-07T20:32:40.3828645Z if scale_ub is not None: 2025-05-07T20:32:40.3828932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3829292Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3829607Z ) 2025-05-07T20:32:40.3829811Z else: 2025-05-07T20:32:40.3830032Z scale_ub_tensor = None 2025-05-07T20:32:40.3830288Z 2025-05-07T20:32:40.3830533Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3830857Z op = silu_mul_quant 2025-05-07T20:32:40.3831112Z if compiled: 2025-05-07T20:32:40.3831369Z op = torch.compile(op) 2025-05-07T20:32:40.3831677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3832041Z 2025-05-07T20:32:40.3832244Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.3832540Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.3832840Z 2025-05-07T20:32:40.3833083Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3833429Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.3833733Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.3834124Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.3834493Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.3834814Z 2025-05-07T20:32:40.3835022Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.3835227Z 2025-05-07T20:32:40.3835330Z moe/activation_test.py:126: 2025-05-07T20:32:40.3835637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3835983Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.3836320Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.3837162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.3837924Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.3838470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3839209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3839904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.3840634Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.3841388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:40.3842149Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.3842889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.3843540Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.3844143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.3844679Z fn() 2025-05-07T20:32:40.3845196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.3845779Z self.fn.run( 2025-05-07T20:32:40.3846259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3846806Z kernel = self.compile( 2025-05-07T20:32:40.3847402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3848178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3848585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3848818Z 2025-05-07T20:32:40.3849038Z self = 2025-05-07T20:32:40.3850129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3851524Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f91d080>} 2025-05-07T20:32:40.3852926Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3853961Z context = 2025-05-07T20:32:40.3854252Z 2025-05-07T20:32:40.3854426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3854946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3855421Z module_map=module_map) 2025-05-07T20:32:40.3855843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3856209Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.3856479Z E ^ 2025-05-07T20:32:40.3856950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3857410Z 2025-05-07T20:32:40.3857828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3858346Z 2025-05-07T20:32:40.3858463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3858884Z self=, 2025-05-07T20:32:40.3859288Z T=1, 2025-05-07T20:32:40.3859486Z D=5120, 2025-05-07T20:32:40.3859692Z scale_ub=1200.0, 2025-05-07T20:32:40.3859921Z contiguous=True, 2025-05-07T20:32:40.3860155Z compiled=True, 2025-05-07T20:32:40.3860376Z ) 2025-05-07T20:32:40.6586240Z self = 2025-05-07T20:32:40.6586782Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.6587049Z 2025-05-07T20:32:40.6587132Z @given( 2025-05-07T20:32:40.6587375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6587693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6587996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6588351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6588691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6588979Z ) 2025-05-07T20:32:40.6589332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6589783Z def test_silu_mul_quant( 2025-05-07T20:32:40.6590033Z self, 2025-05-07T20:32:40.6590230Z T: int, 2025-05-07T20:32:40.6590438Z D: int, 2025-05-07T20:32:40.6590662Z scale_ub: Optional[float], 2025-05-07T20:32:40.6590948Z contiguous: bool, 2025-05-07T20:32:40.6591196Z compiled: bool, 2025-05-07T20:32:40.6591433Z ) -> None: 2025-05-07T20:32:40.6591649Z torch.manual_seed(2025) 2025-05-07T20:32:40.6591902Z 2025-05-07T20:32:40.6592183Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6592531Z 2025-05-07T20:32:40.6592733Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6593310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6593628Z x = x_sign * x_clamp 2025-05-07T20:32:40.6593879Z x0 = x[:, :D] 2025-05-07T20:32:40.6594103Z x1 = x[:, D:] 2025-05-07T20:32:40.6594309Z 2025-05-07T20:32:40.6594504Z if contiguous: 2025-05-07T20:32:40.6594743Z x0 = x0.contiguous() 2025-05-07T20:32:40.6594999Z x1 = x1.contiguous() 2025-05-07T20:32:40.6595249Z 2025-05-07T20:32:40.6595448Z if scale_ub is not None: 2025-05-07T20:32:40.6595722Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6596066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6596380Z ) 2025-05-07T20:32:40.6596581Z else: 2025-05-07T20:32:40.6596790Z scale_ub_tensor = None 2025-05-07T20:32:40.6597046Z 2025-05-07T20:32:40.6597281Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6597596Z op = silu_mul_quant 2025-05-07T20:32:40.6597851Z if compiled: 2025-05-07T20:32:40.6598198Z op = torch.compile(op) 2025-05-07T20:32:40.6598497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6598783Z 2025-05-07T20:32:40.6598986Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6599151Z 2025-05-07T20:32:40.6599252Z moe/activation_test.py:117: 2025-05-07T20:32:40.6599551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6599888Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6600256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6600815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.6601384Z return fn(*args, **kwargs) 2025-05-07T20:32:40.6602051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6602744Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6603291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6603980Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6604649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6605183Z kernel = self.compile( 2025-05-07T20:32:40.6606021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6606795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6607192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6607428Z 2025-05-07T20:32:40.6607735Z self = 2025-05-07T20:32:40.6608833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6610241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f2ddee0>} 2025-05-07T20:32:40.6611600Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6612641Z context = 2025-05-07T20:32:40.6612937Z 2025-05-07T20:32:40.6613103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6613634Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6614180Z module_map=module_map) 2025-05-07T20:32:40.6614556Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6614918Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6615187Z E ^ 2025-05-07T20:32:40.6615653Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6616109Z 2025-05-07T20:32:40.6616543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6623576Z 2025-05-07T20:32:40.6623720Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6624158Z self=, 2025-05-07T20:32:40.6624577Z T=1, 2025-05-07T20:32:40.6624779Z D=5120, 2025-05-07T20:32:40.6624980Z scale_ub=None, 2025-05-07T20:32:40.6625218Z contiguous=False, 2025-05-07T20:32:40.6625458Z compiled=True, 2025-05-07T20:32:40.6625675Z ) 2025-05-07T20:32:40.7098044Z self = 2025-05-07T20:32:40.7098567Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.7098833Z 2025-05-07T20:32:40.7098926Z @given( 2025-05-07T20:32:40.7099159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.7099483Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.7099800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.7100259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.7100597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.7100892Z ) 2025-05-07T20:32:40.7101242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.7101693Z def test_silu_mul_quant( 2025-05-07T20:32:40.7101943Z self, 2025-05-07T20:32:40.7102149Z T: int, 2025-05-07T20:32:40.7102348Z D: int, 2025-05-07T20:32:40.7102578Z scale_ub: Optional[float], 2025-05-07T20:32:40.7102869Z contiguous: bool, 2025-05-07T20:32:40.7103109Z compiled: bool, 2025-05-07T20:32:40.7103345Z ) -> None: 2025-05-07T20:32:40.7103569Z torch.manual_seed(2025) 2025-05-07T20:32:40.7103808Z 2025-05-07T20:32:40.7104084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.7104434Z 2025-05-07T20:32:40.7104632Z x_sign = torch.sign(x) 2025-05-07T20:32:40.7105012Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.7105330Z x = x_sign * x_clamp 2025-05-07T20:32:40.7105579Z x0 = x[:, :D] 2025-05-07T20:32:40.7106042Z x1 = x[:, D:] 2025-05-07T20:32:40.7106257Z 2025-05-07T20:32:40.7106454Z if contiguous: 2025-05-07T20:32:40.7106686Z x0 = x0.contiguous() 2025-05-07T20:32:40.7106953Z x1 = x1.contiguous() 2025-05-07T20:32:40.7107204Z 2025-05-07T20:32:40.7107397Z if scale_ub is not None: 2025-05-07T20:32:40.7107684Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.7108028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.7108345Z ) 2025-05-07T20:32:40.7108542Z else: 2025-05-07T20:32:40.7108759Z scale_ub_tensor = None 2025-05-07T20:32:40.7109023Z 2025-05-07T20:32:40.7109253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.7109576Z op = silu_mul_quant 2025-05-07T20:32:40.7109841Z if compiled: 2025-05-07T20:32:40.7110088Z op = torch.compile(op) 2025-05-07T20:32:40.7110391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.7110671Z 2025-05-07T20:32:40.7110869Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.7111178Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.7111478Z 2025-05-07T20:32:40.7111723Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.7112139Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.7112446Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.7112770Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.7113131Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.7113451Z 2025-05-07T20:32:40.7113659Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.7113854Z 2025-05-07T20:32:40.7113963Z moe/activation_test.py:126: 2025-05-07T20:32:40.7114269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7114609Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.7114942Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.7115732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.7116488Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.7117153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.7117847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.7118534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.7119264Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.7120082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:40.7120823Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.7121554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.7122198Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.7122812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.7123334Z fn() 2025-05-07T20:32:40.7123850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.7124442Z self.fn.run( 2025-05-07T20:32:40.7124924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.7125526Z kernel = self.compile( 2025-05-07T20:32:40.7126076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.7126738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.7127135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.7127374Z 2025-05-07T20:32:40.7127669Z self = 2025-05-07T20:32:40.7128764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.7130156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f2f9f80>} 2025-05-07T20:32:40.7131516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.7132541Z context = 2025-05-07T20:32:40.7132840Z 2025-05-07T20:32:40.7133012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.7133594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.7134076Z module_map=module_map) 2025-05-07T20:32:40.7134441Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.7134806Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.7135083Z E ^ 2025-05-07T20:32:40.7135547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.7136013Z 2025-05-07T20:32:40.7136434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.7136955Z 2025-05-07T20:32:40.7137061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.7137480Z self=, 2025-05-07T20:32:40.7137877Z T=1, 2025-05-07T20:32:40.7138075Z D=5120, 2025-05-07T20:32:40.7138280Z scale_ub=None, 2025-05-07T20:32:40.7138496Z contiguous=True, 2025-05-07T20:32:40.7138776Z compiled=False, 2025-05-07T20:32:40.7138990Z ) 2025-05-07T20:32:40.8299016Z self = 2025-05-07T20:32:40.8299779Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:40.8300133Z 2025-05-07T20:32:40.8300236Z @given( 2025-05-07T20:32:40.8300537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8300949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8301557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8301900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8302239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8302543Z ) 2025-05-07T20:32:40.8302896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8303348Z def test_silu_mul_quant( 2025-05-07T20:32:40.8303603Z self, 2025-05-07T20:32:40.8303811Z T: int, 2025-05-07T20:32:40.8304030Z D: int, 2025-05-07T20:32:40.8304271Z scale_ub: Optional[float], 2025-05-07T20:32:40.8304550Z contiguous: bool, 2025-05-07T20:32:40.8304805Z compiled: bool, 2025-05-07T20:32:40.8305054Z ) -> None: 2025-05-07T20:32:40.8305277Z torch.manual_seed(2025) 2025-05-07T20:32:40.8305534Z 2025-05-07T20:32:40.8306114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8306565Z 2025-05-07T20:32:40.8306760Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8307058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8307375Z x = x_sign * x_clamp 2025-05-07T20:32:40.8307617Z x0 = x[:, :D] 2025-05-07T20:32:40.8307842Z x1 = x[:, D:] 2025-05-07T20:32:40.8308059Z 2025-05-07T20:32:40.8308252Z if contiguous: 2025-05-07T20:32:40.8308482Z x0 = x0.contiguous() 2025-05-07T20:32:40.8308742Z x1 = x1.contiguous() 2025-05-07T20:32:40.8308990Z 2025-05-07T20:32:40.8309184Z if scale_ub is not None: 2025-05-07T20:32:40.8309462Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8309802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8310109Z ) 2025-05-07T20:32:40.8310304Z else: 2025-05-07T20:32:40.8310522Z scale_ub_tensor = None 2025-05-07T20:32:40.8310775Z 2025-05-07T20:32:40.8311015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8311335Z op = silu_mul_quant 2025-05-07T20:32:40.8311586Z if compiled: 2025-05-07T20:32:40.8311842Z op = torch.compile(op) 2025-05-07T20:32:40.8312148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8312425Z 2025-05-07T20:32:40.8312623Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8312794Z 2025-05-07T20:32:40.8312894Z moe/activation_test.py:117: 2025-05-07T20:32:40.8313278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8313612Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8313896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8314592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8315278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8315820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8316514Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8317228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8317758Z kernel = self.compile( 2025-05-07T20:32:40.8318305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8319041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8319444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8319685Z 2025-05-07T20:32:40.8319894Z self = 2025-05-07T20:32:40.8320979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8322494Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f2fb9c0>} 2025-05-07T20:32:40.8323844Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8324866Z context = 2025-05-07T20:32:40.8325159Z 2025-05-07T20:32:40.8325325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8325852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8326322Z module_map=module_map) 2025-05-07T20:32:40.8326731Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8327091Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8327359Z E ^ 2025-05-07T20:32:40.8327915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8328368Z 2025-05-07T20:32:40.8328787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8329308Z 2025-05-07T20:32:40.8329420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8329841Z self=, 2025-05-07T20:32:40.8330237Z T=128, 2025-05-07T20:32:40.8330434Z D=5120, 2025-05-07T20:32:40.8330632Z scale_ub=None, 2025-05-07T20:32:40.8330848Z contiguous=False, 2025-05-07T20:32:40.8331079Z compiled=True, 2025-05-07T20:32:40.8331289Z ) 2025-05-07T20:32:40.8331608Z self = 2025-05-07T20:32:40.8332107Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.8332383Z 2025-05-07T20:32:40.8332463Z @given( 2025-05-07T20:32:40.8332698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8333011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8333323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8333658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8334039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8334334Z ) 2025-05-07T20:32:40.8334685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8335127Z def test_silu_mul_quant( 2025-05-07T20:32:40.8335374Z self, 2025-05-07T20:32:40.8335573Z T: int, 2025-05-07T20:32:40.8335769Z D: int, 2025-05-07T20:32:40.8335991Z scale_ub: Optional[float], 2025-05-07T20:32:40.8336272Z contiguous: bool, 2025-05-07T20:32:40.8336508Z compiled: bool, 2025-05-07T20:32:40.8336737Z ) -> None: 2025-05-07T20:32:40.8336957Z torch.manual_seed(2025) 2025-05-07T20:32:40.8337207Z 2025-05-07T20:32:40.8337482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8337829Z 2025-05-07T20:32:40.8338033Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8338328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8338648Z x = x_sign * x_clamp 2025-05-07T20:32:40.8338982Z x0 = x[:, :D] 2025-05-07T20:32:40.8339204Z x1 = x[:, D:] 2025-05-07T20:32:40.8339419Z 2025-05-07T20:32:40.8339613Z if contiguous: 2025-05-07T20:32:40.8339844Z x0 = x0.contiguous() 2025-05-07T20:32:40.8340113Z x1 = x1.contiguous() 2025-05-07T20:32:40.8340360Z 2025-05-07T20:32:40.8340553Z if scale_ub is not None: 2025-05-07T20:32:40.8340832Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8341218Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8341528Z ) 2025-05-07T20:32:40.8341735Z else: 2025-05-07T20:32:40.8341953Z scale_ub_tensor = None 2025-05-07T20:32:40.8342214Z 2025-05-07T20:32:40.8342448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8342770Z op = silu_mul_quant 2025-05-07T20:32:40.8343033Z if compiled: 2025-05-07T20:32:40.8343285Z op = torch.compile(op) 2025-05-07T20:32:40.8343591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8343873Z 2025-05-07T20:32:40.8344065Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8344236Z 2025-05-07T20:32:40.8344337Z moe/activation_test.py:117: 2025-05-07T20:32:40.8344643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8344978Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8345317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8345883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.8346446Z return fn(*args, **kwargs) 2025-05-07T20:32:40.8347157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8347857Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8348403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8349083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8349749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8350288Z kernel = self.compile( 2025-05-07T20:32:40.8350836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8351490Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8351888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8352117Z 2025-05-07T20:32:40.8352328Z self = 2025-05-07T20:32:40.8353456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8354826Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f0a1120>} 2025-05-07T20:32:40.8356173Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8357207Z context = 2025-05-07T20:32:40.8357499Z 2025-05-07T20:32:40.8357672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8358190Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8358661Z module_map=module_map) 2025-05-07T20:32:40.8359079Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8359444Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8359704Z E ^ 2025-05-07T20:32:40.8360172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8360623Z 2025-05-07T20:32:40.8361046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8361600Z 2025-05-07T20:32:40.8361713Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8362124Z self=, 2025-05-07T20:32:40.8362535Z T=128, 2025-05-07T20:32:40.8362726Z D=7168, 2025-05-07T20:32:40.8362917Z scale_ub=1200.0, 2025-05-07T20:32:40.8363149Z contiguous=False, 2025-05-07T20:32:40.8363382Z compiled=False, 2025-05-07T20:32:40.8363590Z ) 2025-05-07T20:32:40.9234186Z self = 2025-05-07T20:32:40.9234953Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.9235340Z 2025-05-07T20:32:40.9235459Z @given( 2025-05-07T20:32:40.9235726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.9236048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.9236364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.9236889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.9237262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.9237557Z ) 2025-05-07T20:32:40.9237918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.9238363Z def test_silu_mul_quant( 2025-05-07T20:32:40.9238615Z self, 2025-05-07T20:32:40.9238823Z T: int, 2025-05-07T20:32:40.9239024Z D: int, 2025-05-07T20:32:40.9239252Z scale_ub: Optional[float], 2025-05-07T20:32:40.9239536Z contiguous: bool, 2025-05-07T20:32:40.9239784Z compiled: bool, 2025-05-07T20:32:40.9240014Z ) -> None: 2025-05-07T20:32:40.9240238Z torch.manual_seed(2025) 2025-05-07T20:32:40.9240485Z 2025-05-07T20:32:40.9240761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.9241116Z 2025-05-07T20:32:40.9241318Z x_sign = torch.sign(x) 2025-05-07T20:32:40.9241612Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.9241940Z x = x_sign * x_clamp 2025-05-07T20:32:40.9242192Z x0 = x[:, :D] 2025-05-07T20:32:40.9242413Z x1 = x[:, D:] 2025-05-07T20:32:40.9242636Z 2025-05-07T20:32:40.9242832Z if contiguous: 2025-05-07T20:32:40.9243069Z x0 = x0.contiguous() 2025-05-07T20:32:40.9243340Z x1 = x1.contiguous() 2025-05-07T20:32:40.9243590Z 2025-05-07T20:32:40.9243788Z if scale_ub is not None: 2025-05-07T20:32:40.9244157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.9244506Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.9244825Z ) 2025-05-07T20:32:40.9245023Z else: 2025-05-07T20:32:40.9245244Z scale_ub_tensor = None 2025-05-07T20:32:40.9245508Z 2025-05-07T20:32:40.9245745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.9246072Z op = silu_mul_quant 2025-05-07T20:32:40.9246337Z if compiled: 2025-05-07T20:32:40.9246593Z op = torch.compile(op) 2025-05-07T20:32:40.9246899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9247213Z 2025-05-07T20:32:40.9247437Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.9247710Z 2025-05-07T20:32:40.9247814Z moe/activation_test.py:117: 2025-05-07T20:32:40.9248123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9248463Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.9248838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9249543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.9250244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9250783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9251474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9252227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9252767Z kernel = self.compile( 2025-05-07T20:32:40.9253309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9253972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9254383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9254616Z 2025-05-07T20:32:40.9254825Z self = 2025-05-07T20:32:40.9255914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9257358Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f0a0360>} 2025-05-07T20:32:40.9258709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9259740Z context = 2025-05-07T20:32:40.9260033Z 2025-05-07T20:32:40.9260205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9260737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9261210Z module_map=module_map) 2025-05-07T20:32:40.9261579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9261940Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9262211Z E ^ 2025-05-07T20:32:40.9262682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9263134Z 2025-05-07T20:32:40.9263552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9264072Z 2025-05-07T20:32:40.9264184Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9264652Z self=, 2025-05-07T20:32:40.9265066Z T=128, 2025-05-07T20:32:40.9265259Z D=5120, 2025-05-07T20:32:40.9265470Z scale_ub=None, 2025-05-07T20:32:40.9265700Z contiguous=False, 2025-05-07T20:32:40.9265931Z compiled=False, 2025-05-07T20:32:40.9266155Z ) 2025-05-07T20:32:40.9266484Z self = 2025-05-07T20:32:40.9266993Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:40.9267318Z 2025-05-07T20:32:40.9267402Z @given( 2025-05-07T20:32:40.9267643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.9267958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.9268278Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.9268617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.9268957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.9269249Z ) 2025-05-07T20:32:40.9269657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.9270108Z def test_silu_mul_quant( 2025-05-07T20:32:40.9270356Z self, 2025-05-07T20:32:40.9270561Z T: int, 2025-05-07T20:32:40.9270768Z D: int, 2025-05-07T20:32:40.9270989Z scale_ub: Optional[float], 2025-05-07T20:32:40.9271288Z contiguous: bool, 2025-05-07T20:32:40.9278306Z compiled: bool, 2025-05-07T20:32:40.9278631Z ) -> None: 2025-05-07T20:32:40.9278872Z torch.manual_seed(2025) 2025-05-07T20:32:40.9279133Z 2025-05-07T20:32:40.9279425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.9279779Z 2025-05-07T20:32:40.9279989Z x_sign = torch.sign(x) 2025-05-07T20:32:40.9280300Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.9280613Z x = x_sign * x_clamp 2025-05-07T20:32:40.9280864Z x0 = x[:, :D] 2025-05-07T20:32:40.9281091Z x1 = x[:, D:] 2025-05-07T20:32:40.9281308Z 2025-05-07T20:32:40.9281511Z if contiguous: 2025-05-07T20:32:40.9281762Z x0 = x0.contiguous() 2025-05-07T20:32:40.9282027Z x1 = x1.contiguous() 2025-05-07T20:32:40.9282277Z 2025-05-07T20:32:40.9282488Z if scale_ub is not None: 2025-05-07T20:32:40.9282767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.9283115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.9283488Z ) 2025-05-07T20:32:40.9283690Z else: 2025-05-07T20:32:40.9283919Z scale_ub_tensor = None 2025-05-07T20:32:40.9284183Z 2025-05-07T20:32:40.9284430Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.9284755Z op = silu_mul_quant 2025-05-07T20:32:40.9285019Z if compiled: 2025-05-07T20:32:40.9285286Z op = torch.compile(op) 2025-05-07T20:32:40.9285587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9285878Z 2025-05-07T20:32:40.9286085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.9286250Z 2025-05-07T20:32:40.9286365Z moe/activation_test.py:117: 2025-05-07T20:32:40.9286666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9287017Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.9287314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9288086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.9288804Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9289356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9290056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9290727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9291354Z kernel = self.compile( 2025-05-07T20:32:40.9291914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9292585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9292990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9293231Z 2025-05-07T20:32:40.9293442Z self = 2025-05-07T20:32:40.9294544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9295941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ec28720>} 2025-05-07T20:32:40.9297348Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9298398Z context = 2025-05-07T20:32:40.9298707Z 2025-05-07T20:32:40.9298878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9299458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9299937Z module_map=module_map) 2025-05-07T20:32:40.9300321Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9300691Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9300958Z E ^ 2025-05-07T20:32:40.9301437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9301907Z 2025-05-07T20:32:40.9302335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9302852Z 2025-05-07T20:32:40.9302972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9303392Z self=, 2025-05-07T20:32:40.9303808Z T=128, 2025-05-07T20:32:40.9304017Z D=5120, 2025-05-07T20:32:40.9304268Z scale_ub=1200.0, 2025-05-07T20:32:40.9304509Z contiguous=True, 2025-05-07T20:32:40.9304747Z compiled=False, 2025-05-07T20:32:40.9304963Z ) 2025-05-07T20:32:41.2241820Z self = 2025-05-07T20:32:41.2242619Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.2242996Z 2025-05-07T20:32:41.2243080Z @given( 2025-05-07T20:32:41.2243319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2243667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2243977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2244313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2244645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2244937Z ) 2025-05-07T20:32:41.2245293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2245741Z def test_silu_mul_quant( 2025-05-07T20:32:41.2245998Z self, 2025-05-07T20:32:41.2246202Z T: int, 2025-05-07T20:32:41.2246407Z D: int, 2025-05-07T20:32:41.2246634Z scale_ub: Optional[float], 2025-05-07T20:32:41.2246908Z contiguous: bool, 2025-05-07T20:32:41.2247189Z compiled: bool, 2025-05-07T20:32:41.2247445Z ) -> None: 2025-05-07T20:32:41.2247783Z torch.manual_seed(2025) 2025-05-07T20:32:41.2248041Z 2025-05-07T20:32:41.2248615Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2248970Z 2025-05-07T20:32:41.2249175Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2249474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2249785Z x = x_sign * x_clamp 2025-05-07T20:32:41.2250037Z x0 = x[:, :D] 2025-05-07T20:32:41.2250264Z x1 = x[:, D:] 2025-05-07T20:32:41.2250478Z 2025-05-07T20:32:41.2250675Z if contiguous: 2025-05-07T20:32:41.2250925Z x0 = x0.contiguous() 2025-05-07T20:32:41.2251192Z x1 = x1.contiguous() 2025-05-07T20:32:41.2251443Z 2025-05-07T20:32:41.2251648Z if scale_ub is not None: 2025-05-07T20:32:41.2251921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2252268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2252587Z ) 2025-05-07T20:32:41.2252789Z else: 2025-05-07T20:32:41.2253002Z scale_ub_tensor = None 2025-05-07T20:32:41.2253266Z 2025-05-07T20:32:41.2253596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2253919Z op = silu_mul_quant 2025-05-07T20:32:41.2254178Z if compiled: 2025-05-07T20:32:41.2254434Z op = torch.compile(op) 2025-05-07T20:32:41.2254732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2255016Z 2025-05-07T20:32:41.2255218Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2255384Z 2025-05-07T20:32:41.2255557Z moe/activation_test.py:117: 2025-05-07T20:32:41.2255863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2256200Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2256486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2257176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2257879Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2258427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2259111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2259776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2260315Z kernel = self.compile( 2025-05-07T20:32:41.2260861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2261603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2262006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2262235Z 2025-05-07T20:32:41.2262451Z self = 2025-05-07T20:32:41.2263542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2264932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ec298a0>} 2025-05-07T20:32:41.2266277Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2267299Z context = 2025-05-07T20:32:41.2267594Z 2025-05-07T20:32:41.2267758Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2268279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2268739Z module_map=module_map) 2025-05-07T20:32:41.2269152Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2269508Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2269772Z E ^ 2025-05-07T20:32:41.2270228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2270682Z 2025-05-07T20:32:41.2271096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2271610Z 2025-05-07T20:32:41.2271718Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2272124Z self=, 2025-05-07T20:32:41.2272520Z T=1, 2025-05-07T20:32:41.2272702Z D=7168, 2025-05-07T20:32:41.2272899Z scale_ub=1200.0, 2025-05-07T20:32:41.2273113Z contiguous=True, 2025-05-07T20:32:41.2273334Z compiled=True, 2025-05-07T20:32:41.2273549Z ) 2025-05-07T20:32:41.2273914Z self = 2025-05-07T20:32:41.2274407Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.2274672Z 2025-05-07T20:32:41.2274759Z @given( 2025-05-07T20:32:41.2274987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2275303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2275614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2275984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2276317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2276608Z ) 2025-05-07T20:32:41.2276960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2277401Z def test_silu_mul_quant( 2025-05-07T20:32:41.2277648Z self, 2025-05-07T20:32:41.2277847Z T: int, 2025-05-07T20:32:41.2278040Z D: int, 2025-05-07T20:32:41.2278262Z scale_ub: Optional[float], 2025-05-07T20:32:41.2278541Z contiguous: bool, 2025-05-07T20:32:41.2278779Z compiled: bool, 2025-05-07T20:32:41.2279004Z ) -> None: 2025-05-07T20:32:41.2279228Z torch.manual_seed(2025) 2025-05-07T20:32:41.2279469Z 2025-05-07T20:32:41.2279747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2280098Z 2025-05-07T20:32:41.2280295Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2280596Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2280963Z x = x_sign * x_clamp 2025-05-07T20:32:41.2281210Z x0 = x[:, :D] 2025-05-07T20:32:41.2281431Z x1 = x[:, D:] 2025-05-07T20:32:41.2281649Z 2025-05-07T20:32:41.2281843Z if contiguous: 2025-05-07T20:32:41.2282081Z x0 = x0.contiguous() 2025-05-07T20:32:41.2282345Z x1 = x1.contiguous() 2025-05-07T20:32:41.2282589Z 2025-05-07T20:32:41.2282781Z if scale_ub is not None: 2025-05-07T20:32:41.2283068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2283411Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2283719Z ) 2025-05-07T20:32:41.2283920Z else: 2025-05-07T20:32:41.2284140Z scale_ub_tensor = None 2025-05-07T20:32:41.2284392Z 2025-05-07T20:32:41.2284627Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2284946Z op = silu_mul_quant 2025-05-07T20:32:41.2285204Z if compiled: 2025-05-07T20:32:41.2285457Z op = torch.compile(op) 2025-05-07T20:32:41.2285756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2286030Z 2025-05-07T20:32:41.2286228Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2286399Z 2025-05-07T20:32:41.2286500Z moe/activation_test.py:117: 2025-05-07T20:32:41.2286800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2287134Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2287471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2288132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2288690Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2289355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2290053Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2290600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2291283Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2291954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2292494Z kernel = self.compile( 2025-05-07T20:32:41.2293084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2293746Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2294147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2294377Z 2025-05-07T20:32:41.2294591Z self = 2025-05-07T20:32:41.2295669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2297104Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ec2ae80>} 2025-05-07T20:32:41.2298456Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2299480Z context = 2025-05-07T20:32:41.2299767Z 2025-05-07T20:32:41.2299939Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2300458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2301004Z module_map=module_map) 2025-05-07T20:32:41.2301374Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2301727Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2301992Z E ^ 2025-05-07T20:32:41.2302462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2302909Z 2025-05-07T20:32:41.2303335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2303847Z 2025-05-07T20:32:41.2303953Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2304370Z self=, 2025-05-07T20:32:41.2304777Z T=1, 2025-05-07T20:32:41.2304961Z D=7168, 2025-05-07T20:32:41.2305162Z scale_ub=1200.0, 2025-05-07T20:32:41.2305393Z contiguous=False, 2025-05-07T20:32:41.2305898Z compiled=True, 2025-05-07T20:32:41.2306119Z ) 2025-05-07T20:32:41.3317359Z self = 2025-05-07T20:32:41.3318137Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.3318468Z 2025-05-07T20:32:41.3318558Z @given( 2025-05-07T20:32:41.3318786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3319110Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3319418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3320049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3320387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3320678Z ) 2025-05-07T20:32:41.3321029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3321468Z def test_silu_mul_quant( 2025-05-07T20:32:41.3321718Z self, 2025-05-07T20:32:41.3321919Z T: int, 2025-05-07T20:32:41.3322114Z D: int, 2025-05-07T20:32:41.3322348Z scale_ub: Optional[float], 2025-05-07T20:32:41.3322624Z contiguous: bool, 2025-05-07T20:32:41.3322860Z compiled: bool, 2025-05-07T20:32:41.3323098Z ) -> None: 2025-05-07T20:32:41.3323327Z torch.manual_seed(2025) 2025-05-07T20:32:41.3323566Z 2025-05-07T20:32:41.3323840Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3324189Z 2025-05-07T20:32:41.3324383Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3324773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3325089Z x = x_sign * x_clamp 2025-05-07T20:32:41.3325328Z x0 = x[:, :D] 2025-05-07T20:32:41.3325551Z x1 = x[:, D:] 2025-05-07T20:32:41.3325765Z 2025-05-07T20:32:41.3325953Z if contiguous: 2025-05-07T20:32:41.3326190Z x0 = x0.contiguous() 2025-05-07T20:32:41.3326454Z x1 = x1.contiguous() 2025-05-07T20:32:41.3326700Z 2025-05-07T20:32:41.3326972Z if scale_ub is not None: 2025-05-07T20:32:41.3327257Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3327704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3328016Z ) 2025-05-07T20:32:41.3328216Z else: 2025-05-07T20:32:41.3328435Z scale_ub_tensor = None 2025-05-07T20:32:41.3328685Z 2025-05-07T20:32:41.3328920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3329238Z op = silu_mul_quant 2025-05-07T20:32:41.3329490Z if compiled: 2025-05-07T20:32:41.3329750Z op = torch.compile(op) 2025-05-07T20:32:41.3330053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3330330Z 2025-05-07T20:32:41.3330532Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.3330699Z 2025-05-07T20:32:41.3330810Z moe/activation_test.py:117: 2025-05-07T20:32:41.3331117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3331544Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.3331832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3332392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.3332951Z return fn(*args, **kwargs) 2025-05-07T20:32:41.3333618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.3334316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.3334857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3335535Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3336196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3336732Z kernel = self.compile( 2025-05-07T20:32:41.3337278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3337936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3338334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3338569Z 2025-05-07T20:32:41.3338776Z self = 2025-05-07T20:32:41.3339902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3341295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ee9c680>} 2025-05-07T20:32:41.3342629Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3343659Z context = 2025-05-07T20:32:41.3343946Z 2025-05-07T20:32:41.3344119Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3344643Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3345156Z module_map=module_map) 2025-05-07T20:32:41.3345524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3345880Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.3346142Z E ^ 2025-05-07T20:32:41.3346609Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3347057Z 2025-05-07T20:32:41.3347475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3348029Z 2025-05-07T20:32:41.3348146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3348559Z self=, 2025-05-07T20:32:41.3348962Z T=1, 2025-05-07T20:32:41.3349156Z D=7168, 2025-05-07T20:32:41.3349350Z scale_ub=None, 2025-05-07T20:32:41.3349573Z contiguous=False, 2025-05-07T20:32:41.3349804Z compiled=True, 2025-05-07T20:32:41.3350014Z ) 2025-05-07T20:32:41.4027178Z self = 2025-05-07T20:32:41.4027917Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.4028284Z 2025-05-07T20:32:41.4028397Z @given( 2025-05-07T20:32:41.4028645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.4028954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.4029265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.4029872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.4030196Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.4030490Z ) 2025-05-07T20:32:41.4030845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.4031284Z def test_silu_mul_quant( 2025-05-07T20:32:41.4031535Z self, 2025-05-07T20:32:41.4031738Z T: int, 2025-05-07T20:32:41.4031935Z D: int, 2025-05-07T20:32:41.4032163Z scale_ub: Optional[float], 2025-05-07T20:32:41.4032439Z contiguous: bool, 2025-05-07T20:32:41.4032681Z compiled: bool, 2025-05-07T20:32:41.4032905Z ) -> None: 2025-05-07T20:32:41.4033128Z torch.manual_seed(2025) 2025-05-07T20:32:41.4033376Z 2025-05-07T20:32:41.4033644Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.4033990Z 2025-05-07T20:32:41.4034192Z x_sign = torch.sign(x) 2025-05-07T20:32:41.4034481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.4034793Z x = x_sign * x_clamp 2025-05-07T20:32:41.4035038Z x0 = x[:, :D] 2025-05-07T20:32:41.4035253Z x1 = x[:, D:] 2025-05-07T20:32:41.4035467Z 2025-05-07T20:32:41.4035654Z if contiguous: 2025-05-07T20:32:41.4035880Z x0 = x0.contiguous() 2025-05-07T20:32:41.4036141Z x1 = x1.contiguous() 2025-05-07T20:32:41.4036384Z 2025-05-07T20:32:41.4036665Z if scale_ub is not None: 2025-05-07T20:32:41.4036947Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.4037285Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.4037596Z ) 2025-05-07T20:32:41.4037789Z else: 2025-05-07T20:32:41.4038002Z scale_ub_tensor = None 2025-05-07T20:32:41.4038259Z 2025-05-07T20:32:41.4038488Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.4038811Z op = silu_mul_quant 2025-05-07T20:32:41.4039062Z if compiled: 2025-05-07T20:32:41.4039311Z op = torch.compile(op) 2025-05-07T20:32:41.4039610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.4039891Z 2025-05-07T20:32:41.4040088Z y_fp8, y_scale = fn() 2025-05-07T20:32:41.4040380Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:41.4040677Z 2025-05-07T20:32:41.4040914Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.4041335Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:41.4041637Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:41.4041956Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:41.4042314Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.4042631Z 2025-05-07T20:32:41.4042842Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:41.4043039Z 2025-05-07T20:32:41.4043216Z moe/activation_test.py:126: 2025-05-07T20:32:41.4043518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4043859Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:41.4044184Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.4044974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:41.4045730Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:41.4046283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.4046961Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.4047765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:41.4048487Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.4056528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:41.4057398Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.4058153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:41.4058820Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:41.4059441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:41.4059968Z fn() 2025-05-07T20:32:41.4060489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:41.4061086Z self.fn.run( 2025-05-07T20:32:41.4061559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.4062111Z kernel = self.compile( 2025-05-07T20:32:41.4062664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.4063330Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.4063733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4063977Z 2025-05-07T20:32:41.4064269Z self = 2025-05-07T20:32:41.4065363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.4066757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ee9d580>} 2025-05-07T20:32:41.4068160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.4069193Z context = 2025-05-07T20:32:41.4069491Z 2025-05-07T20:32:41.4069660Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.4070240Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.4070713Z module_map=module_map) 2025-05-07T20:32:41.4071089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.4071456Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:41.4071734Z E ^ 2025-05-07T20:32:41.4072205Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.4072744Z 2025-05-07T20:32:41.4073162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.4073675Z 2025-05-07T20:32:41.4073792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.4074206Z self=, 2025-05-07T20:32:41.4074616Z T=1, 2025-05-07T20:32:41.4074815Z D=5120, 2025-05-07T20:32:41.4075018Z scale_ub=1200.0, 2025-05-07T20:32:41.4075262Z contiguous=False, 2025-05-07T20:32:41.4075493Z compiled=True, 2025-05-07T20:32:41.4075711Z ) 2025-05-07T20:32:41.5271425Z self = 2025-05-07T20:32:41.5272910Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.5273662Z 2025-05-07T20:32:41.5273888Z @given( 2025-05-07T20:32:41.5274413Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5275436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5276043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5276701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5277227Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5277552Z ) 2025-05-07T20:32:41.5277908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5278358Z def test_silu_mul_quant( 2025-05-07T20:32:41.5278606Z self, 2025-05-07T20:32:41.5278807Z T: int, 2025-05-07T20:32:41.5279011Z D: int, 2025-05-07T20:32:41.5279233Z scale_ub: Optional[float], 2025-05-07T20:32:41.5279515Z contiguous: bool, 2025-05-07T20:32:41.5279762Z compiled: bool, 2025-05-07T20:32:41.5279991Z ) -> None: 2025-05-07T20:32:41.5280211Z torch.manual_seed(2025) 2025-05-07T20:32:41.5280462Z 2025-05-07T20:32:41.5280744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5281093Z 2025-05-07T20:32:41.5281304Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5281603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5281913Z x = x_sign * x_clamp 2025-05-07T20:32:41.5282163Z x0 = x[:, :D] 2025-05-07T20:32:41.5282385Z x1 = x[:, D:] 2025-05-07T20:32:41.5282595Z 2025-05-07T20:32:41.5282787Z if contiguous: 2025-05-07T20:32:41.5283026Z x0 = x0.contiguous() 2025-05-07T20:32:41.5283371Z x1 = x1.contiguous() 2025-05-07T20:32:41.5283626Z 2025-05-07T20:32:41.5283827Z if scale_ub is not None: 2025-05-07T20:32:41.5284097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5284440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5284756Z ) 2025-05-07T20:32:41.5284956Z else: 2025-05-07T20:32:41.5285182Z scale_ub_tensor = None 2025-05-07T20:32:41.5285450Z 2025-05-07T20:32:41.5285683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5286010Z op = silu_mul_quant 2025-05-07T20:32:41.5286267Z if compiled: 2025-05-07T20:32:41.5286524Z op = torch.compile(op) 2025-05-07T20:32:41.5286820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5287120Z 2025-05-07T20:32:41.5287357Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5287529Z 2025-05-07T20:32:41.5287727Z moe/activation_test.py:117: 2025-05-07T20:32:41.5288129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5288475Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5288759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5289327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.5289894Z return fn(*args, **kwargs) 2025-05-07T20:32:41.5290569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5291340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5291887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5292580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5293245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5293791Z kernel = self.compile( 2025-05-07T20:32:41.5294342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5295007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5295407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5295649Z 2025-05-07T20:32:41.5295903Z self = 2025-05-07T20:32:41.5296990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5298398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ee9eb60>} 2025-05-07T20:32:41.5299752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5300772Z context = 2025-05-07T20:32:41.5301069Z 2025-05-07T20:32:41.5301236Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5301770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5302235Z module_map=module_map) 2025-05-07T20:32:41.5302605Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5302964Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5303231Z E ^ 2025-05-07T20:32:41.5303738Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5304197Z 2025-05-07T20:32:41.5304612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5305121Z 2025-05-07T20:32:41.5305235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5305936Z self=, 2025-05-07T20:32:41.5306348Z T=1, 2025-05-07T20:32:41.5306546Z D=5120, 2025-05-07T20:32:41.5306757Z scale_ub=1200.0, 2025-05-07T20:32:41.5306984Z contiguous=False, 2025-05-07T20:32:41.5307216Z compiled=False, 2025-05-07T20:32:41.5307435Z ) 2025-05-07T20:32:41.5307753Z self = 2025-05-07T20:32:41.5308250Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.5308516Z 2025-05-07T20:32:41.5308602Z @given( 2025-05-07T20:32:41.5308837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5309230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5309545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5309870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5310202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5310493Z ) 2025-05-07T20:32:41.5310839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5311343Z def test_silu_mul_quant( 2025-05-07T20:32:41.5311592Z self, 2025-05-07T20:32:41.5311787Z T: int, 2025-05-07T20:32:41.5311990Z D: int, 2025-05-07T20:32:41.5312212Z scale_ub: Optional[float], 2025-05-07T20:32:41.5312481Z contiguous: bool, 2025-05-07T20:32:41.5312727Z compiled: bool, 2025-05-07T20:32:41.5312954Z ) -> None: 2025-05-07T20:32:41.5313174Z torch.manual_seed(2025) 2025-05-07T20:32:41.5313415Z 2025-05-07T20:32:41.5313697Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5314047Z 2025-05-07T20:32:41.5314241Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5314541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5314857Z x = x_sign * x_clamp 2025-05-07T20:32:41.5315097Z x0 = x[:, :D] 2025-05-07T20:32:41.5315319Z x1 = x[:, D:] 2025-05-07T20:32:41.5315532Z 2025-05-07T20:32:41.5315716Z if contiguous: 2025-05-07T20:32:41.5316024Z x0 = x0.contiguous() 2025-05-07T20:32:41.5316292Z x1 = x1.contiguous() 2025-05-07T20:32:41.5316532Z 2025-05-07T20:32:41.5316738Z if scale_ub is not None: 2025-05-07T20:32:41.5317017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5317353Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5317669Z ) 2025-05-07T20:32:41.5317868Z else: 2025-05-07T20:32:41.5318084Z scale_ub_tensor = None 2025-05-07T20:32:41.5318337Z 2025-05-07T20:32:41.5318582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5318902Z op = silu_mul_quant 2025-05-07T20:32:41.5319153Z if compiled: 2025-05-07T20:32:41.5319408Z op = torch.compile(op) 2025-05-07T20:32:41.5319708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5319984Z 2025-05-07T20:32:41.5320184Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5320352Z 2025-05-07T20:32:41.5320466Z moe/activation_test.py:117: 2025-05-07T20:32:41.5320761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5321100Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5321389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5322088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5322783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5323419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5324116Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5324779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5325319Z kernel = self.compile( 2025-05-07T20:32:41.5325863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5326529Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5326927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5327162Z 2025-05-07T20:32:41.5327369Z self = 2025-05-07T20:32:41.5328573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5329951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ee9f2e0>} 2025-05-07T20:32:41.5331294Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5332368Z context = 2025-05-07T20:32:41.5332662Z 2025-05-07T20:32:41.5332829Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5333347Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5333815Z module_map=module_map) 2025-05-07T20:32:41.5334185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5334542Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5334806Z E ^ 2025-05-07T20:32:41.5335267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5335724Z 2025-05-07T20:32:41.5336141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5336703Z 2025-05-07T20:32:41.5336814Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5337223Z self=, 2025-05-07T20:32:41.5337631Z T=16384, 2025-05-07T20:32:41.5337831Z D=5120, 2025-05-07T20:32:41.5338031Z scale_ub=1200.0, 2025-05-07T20:32:41.5338253Z contiguous=False, 2025-05-07T20:32:41.5338484Z compiled=True, 2025-05-07T20:32:41.5338691Z ) 2025-05-07T20:32:41.7614050Z self = 2025-05-07T20:32:41.7614877Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.7615275Z 2025-05-07T20:32:41.7615357Z @given( 2025-05-07T20:32:41.7615597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7615919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7616228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7616579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7616912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7617224Z ) 2025-05-07T20:32:41.7617607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7618055Z def test_silu_mul_quant( 2025-05-07T20:32:41.7618301Z self, 2025-05-07T20:32:41.7618498Z T: int, 2025-05-07T20:32:41.7618703Z D: int, 2025-05-07T20:32:41.7619211Z scale_ub: Optional[float], 2025-05-07T20:32:41.7619489Z contiguous: bool, 2025-05-07T20:32:41.7619734Z compiled: bool, 2025-05-07T20:32:41.7619967Z ) -> None: 2025-05-07T20:32:41.7620183Z torch.manual_seed(2025) 2025-05-07T20:32:41.7620430Z 2025-05-07T20:32:41.7620711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7621055Z 2025-05-07T20:32:41.7621257Z x_sign = torch.sign(x) 2025-05-07T20:32:41.7621556Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.7621862Z x = x_sign * x_clamp 2025-05-07T20:32:41.7622116Z x0 = x[:, :D] 2025-05-07T20:32:41.7622336Z x1 = x[:, D:] 2025-05-07T20:32:41.7622543Z 2025-05-07T20:32:41.7622732Z if contiguous: 2025-05-07T20:32:41.7622966Z x0 = x0.contiguous() 2025-05-07T20:32:41.7623231Z x1 = x1.contiguous() 2025-05-07T20:32:41.7623468Z 2025-05-07T20:32:41.7623671Z if scale_ub is not None: 2025-05-07T20:32:41.7624119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.7624455Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.7624778Z ) 2025-05-07T20:32:41.7624982Z else: 2025-05-07T20:32:41.7625192Z scale_ub_tensor = None 2025-05-07T20:32:41.7625451Z 2025-05-07T20:32:41.7625685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.7626077Z op = silu_mul_quant 2025-05-07T20:32:41.7626339Z if compiled: 2025-05-07T20:32:41.7626590Z op = torch.compile(op) 2025-05-07T20:32:41.7626885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7627163Z 2025-05-07T20:32:41.7627364Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.7627532Z 2025-05-07T20:32:41.7627637Z moe/activation_test.py:117: 2025-05-07T20:32:41.7627931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7628276Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.7628571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7629131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.7629699Z return fn(*args, **kwargs) 2025-05-07T20:32:41.7630364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.7631141Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.7631682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.7632367Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.7633036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.7633567Z kernel = self.compile( 2025-05-07T20:32:41.7634117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.7634778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7635184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7635415Z 2025-05-07T20:32:41.7635626Z self = 2025-05-07T20:32:41.7636715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.7638120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e69cfe0>} 2025-05-07T20:32:41.7639519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.7640544Z context = 2025-05-07T20:32:41.7640839Z 2025-05-07T20:32:41.7641007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.7641534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7642009Z module_map=module_map) 2025-05-07T20:32:41.7642370Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7642728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7642993Z E ^ 2025-05-07T20:32:41.7643454Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.7643909Z 2025-05-07T20:32:41.7644368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7644882Z 2025-05-07T20:32:41.7644987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7645399Z self=, 2025-05-07T20:32:41.7645798Z T=2048, 2025-05-07T20:32:41.7645993Z D=7168, 2025-05-07T20:32:41.7646199Z scale_ub=1200.0, 2025-05-07T20:32:41.7646428Z contiguous=False, 2025-05-07T20:32:41.7646697Z compiled=True, 2025-05-07T20:32:41.7646913Z ) 2025-05-07T20:32:41.7647253Z self = 2025-05-07T20:32:41.7647879Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.7648161Z 2025-05-07T20:32:41.7648242Z @given( 2025-05-07T20:32:41.7648477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7648793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7649100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7649434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7649765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7650051Z ) 2025-05-07T20:32:41.7650399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7650838Z def test_silu_mul_quant( 2025-05-07T20:32:41.7651081Z self, 2025-05-07T20:32:41.7651287Z T: int, 2025-05-07T20:32:41.7651542Z D: int, 2025-05-07T20:32:41.7651760Z scale_ub: Optional[float], 2025-05-07T20:32:41.7652035Z contiguous: bool, 2025-05-07T20:32:41.7652277Z compiled: bool, 2025-05-07T20:32:41.7652504Z ) -> None: 2025-05-07T20:32:41.7652720Z torch.manual_seed(2025) 2025-05-07T20:32:41.7652968Z 2025-05-07T20:32:41.7653243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7653585Z 2025-05-07T20:32:41.7653789Z x_sign = torch.sign(x) 2025-05-07T20:32:41.7654087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.7654395Z x = x_sign * x_clamp 2025-05-07T20:32:41.7654645Z x0 = x[:, :D] 2025-05-07T20:32:41.7654867Z x1 = x[:, D:] 2025-05-07T20:32:41.7655079Z 2025-05-07T20:32:41.7655273Z if contiguous: 2025-05-07T20:32:41.7655509Z x0 = x0.contiguous() 2025-05-07T20:32:41.7655770Z x1 = x1.contiguous() 2025-05-07T20:32:41.7656022Z 2025-05-07T20:32:41.7656222Z if scale_ub is not None: 2025-05-07T20:32:41.7656495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.7656833Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.7657155Z ) 2025-05-07T20:32:41.7657392Z else: 2025-05-07T20:32:41.7657611Z scale_ub_tensor = None 2025-05-07T20:32:41.7657863Z 2025-05-07T20:32:41.7658102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.7658465Z op = silu_mul_quant 2025-05-07T20:32:41.7658726Z if compiled: 2025-05-07T20:32:41.7658979Z op = torch.compile(op) 2025-05-07T20:32:41.7659272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7659552Z 2025-05-07T20:32:41.7659748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.7659913Z 2025-05-07T20:32:41.7660015Z moe/activation_test.py:117: 2025-05-07T20:32:41.7660314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7660650Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.7660936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7661491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.7662052Z return fn(*args, **kwargs) 2025-05-07T20:32:41.7662710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.7663445Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.7663982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.7664662Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.7665319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.7665890Z kernel = self.compile( 2025-05-07T20:32:41.7666433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.7667092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7667532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7667768Z 2025-05-07T20:32:41.7667975Z self = 2025-05-07T20:32:41.7669057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.7670422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e69db20>} 2025-05-07T20:32:41.7671763Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.7672828Z context = 2025-05-07T20:32:41.7673119Z 2025-05-07T20:32:41.7673285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.7673811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7674282Z module_map=module_map) 2025-05-07T20:32:41.7674641Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7675003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7675276Z E ^ 2025-05-07T20:32:41.7675735Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.7676193Z 2025-05-07T20:32:41.7676608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7677122Z 2025-05-07T20:32:41.8567215Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8567982Z self=, 2025-05-07T20:32:41.8568550Z T=1, 2025-05-07T20:32:41.8568855Z D=5120, 2025-05-07T20:32:41.8569122Z scale_ub=None, 2025-05-07T20:32:41.8569413Z contiguous=False, 2025-05-07T20:32:41.8570022Z compiled=False, 2025-05-07T20:32:41.8570303Z ) 2025-05-07T20:32:41.8570720Z self = 2025-05-07T20:32:41.8579032Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.8579345Z 2025-05-07T20:32:41.8579447Z @given( 2025-05-07T20:32:41.8579695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8580027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8580369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8580713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8581059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8581361Z ) 2025-05-07T20:32:41.8581718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8582175Z def test_silu_mul_quant( 2025-05-07T20:32:41.8582435Z self, 2025-05-07T20:32:41.8582642Z T: int, 2025-05-07T20:32:41.8582997Z D: int, 2025-05-07T20:32:41.8583238Z scale_ub: Optional[float], 2025-05-07T20:32:41.8583522Z contiguous: bool, 2025-05-07T20:32:41.8583785Z compiled: bool, 2025-05-07T20:32:41.8584034Z ) -> None: 2025-05-07T20:32:41.8584268Z torch.manual_seed(2025) 2025-05-07T20:32:41.8584519Z 2025-05-07T20:32:41.8584812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8585258Z 2025-05-07T20:32:41.8585464Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8585776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8586101Z x = x_sign * x_clamp 2025-05-07T20:32:41.8586353Z x0 = x[:, :D] 2025-05-07T20:32:41.8586591Z x1 = x[:, D:] 2025-05-07T20:32:41.8586816Z 2025-05-07T20:32:41.8587011Z if contiguous: 2025-05-07T20:32:41.8587259Z x0 = x0.contiguous() 2025-05-07T20:32:41.8587533Z x1 = x1.contiguous() 2025-05-07T20:32:41.8587780Z 2025-05-07T20:32:41.8587986Z if scale_ub is not None: 2025-05-07T20:32:41.8588280Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8588631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8588948Z ) 2025-05-07T20:32:41.8589143Z else: 2025-05-07T20:32:41.8589370Z scale_ub_tensor = None 2025-05-07T20:32:41.8589630Z 2025-05-07T20:32:41.8589881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8590306Z op = silu_mul_quant 2025-05-07T20:32:41.8590573Z if compiled: 2025-05-07T20:32:41.8590836Z op = torch.compile(op) 2025-05-07T20:32:41.8591152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8591440Z 2025-05-07T20:32:41.8591642Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8591819Z 2025-05-07T20:32:41.8591925Z moe/activation_test.py:117: 2025-05-07T20:32:41.8592231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8592562Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8592863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8593571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8594278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8594825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8595528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8596202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8596739Z kernel = self.compile( 2025-05-07T20:32:41.8597337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8598089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8598500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8598733Z 2025-05-07T20:32:41.8598945Z self = 2025-05-07T20:32:41.8600040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8601439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e69ee80>} 2025-05-07T20:32:41.8602793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8603874Z context = 2025-05-07T20:32:41.8604167Z 2025-05-07T20:32:41.8604339Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8604877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8605354Z module_map=module_map) 2025-05-07T20:32:41.8606075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8606452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8606723Z E ^ 2025-05-07T20:32:41.8607196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8607717Z 2025-05-07T20:32:41.8608137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8608658Z 2025-05-07T20:32:41.8608774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8609203Z self=, 2025-05-07T20:32:41.8609608Z T=4096, 2025-05-07T20:32:41.8609811Z D=7168, 2025-05-07T20:32:41.8610019Z scale_ub=1200.0, 2025-05-07T20:32:41.8610256Z contiguous=False, 2025-05-07T20:32:41.8610490Z compiled=False, 2025-05-07T20:32:41.8610712Z ) 2025-05-07T20:32:41.8611043Z self = 2025-05-07T20:32:41.8611635Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.8611924Z 2025-05-07T20:32:41.8612008Z @given( 2025-05-07T20:32:41.8612261Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8612586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8612914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8613259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8613598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8613905Z ) 2025-05-07T20:32:41.8614270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8614724Z def test_silu_mul_quant( 2025-05-07T20:32:41.8614973Z self, 2025-05-07T20:32:41.8615185Z T: int, 2025-05-07T20:32:41.8615401Z D: int, 2025-05-07T20:32:41.8615629Z scale_ub: Optional[float], 2025-05-07T20:32:41.8615924Z contiguous: bool, 2025-05-07T20:32:41.8616180Z compiled: bool, 2025-05-07T20:32:41.8616419Z ) -> None: 2025-05-07T20:32:41.8616651Z torch.manual_seed(2025) 2025-05-07T20:32:41.8616909Z 2025-05-07T20:32:41.8617189Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8617546Z 2025-05-07T20:32:41.8617760Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8618058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8618456Z x = x_sign * x_clamp 2025-05-07T20:32:41.8618717Z x0 = x[:, :D] 2025-05-07T20:32:41.8618942Z x1 = x[:, D:] 2025-05-07T20:32:41.8619165Z 2025-05-07T20:32:41.8619365Z if contiguous: 2025-05-07T20:32:41.8619610Z x0 = x0.contiguous() 2025-05-07T20:32:41.8619873Z x1 = x1.contiguous() 2025-05-07T20:32:41.8620128Z 2025-05-07T20:32:41.8620340Z if scale_ub is not None: 2025-05-07T20:32:41.8620618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8620976Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8621298Z ) 2025-05-07T20:32:41.8621499Z else: 2025-05-07T20:32:41.8621725Z scale_ub_tensor = None 2025-05-07T20:32:41.8621990Z 2025-05-07T20:32:41.8622231Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8622559Z op = silu_mul_quant 2025-05-07T20:32:41.8622825Z if compiled: 2025-05-07T20:32:41.8623079Z op = torch.compile(op) 2025-05-07T20:32:41.8623460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8623745Z 2025-05-07T20:32:41.8623941Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8624116Z 2025-05-07T20:32:41.8624219Z moe/activation_test.py:117: 2025-05-07T20:32:41.8624522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8624863Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8625211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8625908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8626602Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8627150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8627883Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8628555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8629097Z kernel = self.compile( 2025-05-07T20:32:41.8629641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8630290Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8630693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8630971Z 2025-05-07T20:32:41.8631183Z self = 2025-05-07T20:32:41.8632256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8633628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e7dc040>} 2025-05-07T20:32:41.8634968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8635992Z context = 2025-05-07T20:32:41.8636284Z 2025-05-07T20:32:41.8636455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8636970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8637437Z module_map=module_map) 2025-05-07T20:32:41.8637804Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8638159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8638420Z E ^ 2025-05-07T20:32:41.8638935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8639386Z 2025-05-07T20:32:41.8639806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8640314Z 2025-05-07T20:32:41.8640426Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8640836Z self=, 2025-05-07T20:32:41.8641252Z T=16384, 2025-05-07T20:32:41.8641453Z D=7168, 2025-05-07T20:32:41.8641649Z scale_ub=None, 2025-05-07T20:32:41.8641865Z contiguous=True, 2025-05-07T20:32:41.8642091Z compiled=True, 2025-05-07T20:32:41.8642290Z ) 2025-05-07T20:32:42.0007387Z self = 2025-05-07T20:32:42.0008263Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.0008628Z 2025-05-07T20:32:42.0008737Z @given( 2025-05-07T20:32:42.0009257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0009575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0009889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0010226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0010553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0010845Z ) 2025-05-07T20:32:42.0011200Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0011732Z def test_silu_mul_quant( 2025-05-07T20:32:42.0011977Z self, 2025-05-07T20:32:42.0012182Z T: int, 2025-05-07T20:32:42.0012387Z D: int, 2025-05-07T20:32:42.0012609Z scale_ub: Optional[float], 2025-05-07T20:32:42.0012886Z contiguous: bool, 2025-05-07T20:32:42.0013130Z compiled: bool, 2025-05-07T20:32:42.0013356Z ) -> None: 2025-05-07T20:32:42.0013580Z torch.manual_seed(2025) 2025-05-07T20:32:42.0013832Z 2025-05-07T20:32:42.0014111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0014460Z 2025-05-07T20:32:42.0014662Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0014954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0015274Z x = x_sign * x_clamp 2025-05-07T20:32:42.0015522Z x0 = x[:, :D] 2025-05-07T20:32:42.0015740Z x1 = x[:, D:] 2025-05-07T20:32:42.0016053Z 2025-05-07T20:32:42.0016248Z if contiguous: 2025-05-07T20:32:42.0016483Z x0 = x0.contiguous() 2025-05-07T20:32:42.0016750Z x1 = x1.contiguous() 2025-05-07T20:32:42.0016997Z 2025-05-07T20:32:42.0017193Z if scale_ub is not None: 2025-05-07T20:32:42.0017507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.0017860Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.0018172Z ) 2025-05-07T20:32:42.0018368Z else: 2025-05-07T20:32:42.0018589Z scale_ub_tensor = None 2025-05-07T20:32:42.0018847Z 2025-05-07T20:32:42.0019074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.0019393Z op = silu_mul_quant 2025-05-07T20:32:42.0019651Z if compiled: 2025-05-07T20:32:42.0019901Z op = torch.compile(op) 2025-05-07T20:32:42.0020201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0020489Z 2025-05-07T20:32:42.0020687Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.0020857Z 2025-05-07T20:32:42.0020960Z moe/activation_test.py:117: 2025-05-07T20:32:42.0021262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0021601Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.0021880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0022444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.0023093Z return fn(*args, **kwargs) 2025-05-07T20:32:42.0023756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.0024453Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.0024997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.0025683Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.0026349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.0026892Z kernel = self.compile( 2025-05-07T20:32:42.0027435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.0028089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.0028542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0028781Z 2025-05-07T20:32:42.0028988Z self = 2025-05-07T20:32:42.0030073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.0031520Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e7dd260>} 2025-05-07T20:32:42.0032866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.0033897Z context = 2025-05-07T20:32:42.0034190Z 2025-05-07T20:32:42.0034369Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.0034895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.0035363Z module_map=module_map) 2025-05-07T20:32:42.0035735Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.0036097Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.0036429Z E ^ 2025-05-07T20:32:42.0036899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.0037352Z 2025-05-07T20:32:42.0037780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.0038291Z 2025-05-07T20:32:42.0038404Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0038822Z self=, 2025-05-07T20:32:42.0039236Z T=4096, 2025-05-07T20:32:42.0039436Z D=5120, 2025-05-07T20:32:42.0039630Z scale_ub=None, 2025-05-07T20:32:42.0039851Z contiguous=False, 2025-05-07T20:32:42.0040085Z compiled=True, 2025-05-07T20:32:42.0040290Z ) 2025-05-07T20:32:42.0040613Z self = 2025-05-07T20:32:42.0041110Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.0041386Z 2025-05-07T20:32:42.0041476Z @given( 2025-05-07T20:32:42.0041707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0042022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0042334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0042661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0042991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0043286Z ) 2025-05-07T20:32:42.0043686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0044136Z def test_silu_mul_quant( 2025-05-07T20:32:42.0044384Z self, 2025-05-07T20:32:42.0044578Z T: int, 2025-05-07T20:32:42.0044780Z D: int, 2025-05-07T20:32:42.0045002Z scale_ub: Optional[float], 2025-05-07T20:32:42.0045279Z contiguous: bool, 2025-05-07T20:32:42.0045516Z compiled: bool, 2025-05-07T20:32:42.0045745Z ) -> None: 2025-05-07T20:32:42.0045972Z torch.manual_seed(2025) 2025-05-07T20:32:42.0046213Z 2025-05-07T20:32:42.0046491Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0046841Z 2025-05-07T20:32:42.0047035Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0047360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0047763Z x = x_sign * x_clamp 2025-05-07T20:32:42.0048002Z x0 = x[:, :D] 2025-05-07T20:32:42.0048232Z x1 = x[:, D:] 2025-05-07T20:32:42.0048455Z 2025-05-07T20:32:42.0048691Z if contiguous: 2025-05-07T20:32:42.0048930Z x0 = x0.contiguous() 2025-05-07T20:32:42.0049193Z x1 = x1.contiguous() 2025-05-07T20:32:42.0049432Z 2025-05-07T20:32:42.0049632Z if scale_ub is not None: 2025-05-07T20:32:42.0049911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.0050246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.0050610Z ) 2025-05-07T20:32:42.0050800Z else: 2025-05-07T20:32:42.0051017Z scale_ub_tensor = None 2025-05-07T20:32:42.0051273Z 2025-05-07T20:32:42.0051507Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.0051818Z op = silu_mul_quant 2025-05-07T20:32:42.0052075Z if compiled: 2025-05-07T20:32:42.0052327Z op = torch.compile(op) 2025-05-07T20:32:42.0052620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0052902Z 2025-05-07T20:32:42.0053103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.0053269Z 2025-05-07T20:32:42.0053368Z moe/activation_test.py:117: 2025-05-07T20:32:42.0053670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0054006Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.0054286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0054847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.0055461Z return fn(*args, **kwargs) 2025-05-07T20:32:42.0056122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.0056809Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.0057347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.0058037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.0058702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.0059231Z kernel = self.compile( 2025-05-07T20:32:42.0059776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.0060434Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.0060837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0061074Z 2025-05-07T20:32:42.0061280Z self = 2025-05-07T20:32:42.0062362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.0063782Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e7ddda0>} 2025-05-07T20:32:42.0065124Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.0066143Z context = 2025-05-07T20:32:42.0066443Z 2025-05-07T20:32:42.0066609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.0067136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.0067606Z module_map=module_map) 2025-05-07T20:32:42.0067968Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.0068326Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.0068596Z E ^ 2025-05-07T20:32:42.0069102Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.0069559Z 2025-05-07T20:32:42.0069972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.0070486Z 2025-05-07T20:32:42.1211757Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1212758Z self=, 2025-05-07T20:32:42.1213301Z T=4096, 2025-05-07T20:32:42.1213496Z D=5120, 2025-05-07T20:32:42.1213688Z scale_ub=1200.0, 2025-05-07T20:32:42.1213920Z contiguous=False, 2025-05-07T20:32:42.1214152Z compiled=False, 2025-05-07T20:32:42.1214357Z ) 2025-05-07T20:32:42.1214682Z self = 2025-05-07T20:32:42.1215185Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1215460Z 2025-05-07T20:32:42.1215549Z @given( 2025-05-07T20:32:42.1215783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1216099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1216405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1216739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1217072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1217513Z ) 2025-05-07T20:32:42.1217859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1218303Z def test_silu_mul_quant( 2025-05-07T20:32:42.1218551Z self, 2025-05-07T20:32:42.1218743Z T: int, 2025-05-07T20:32:42.1218950Z D: int, 2025-05-07T20:32:42.1219173Z scale_ub: Optional[float], 2025-05-07T20:32:42.1219440Z contiguous: bool, 2025-05-07T20:32:42.1219690Z compiled: bool, 2025-05-07T20:32:42.1219924Z ) -> None: 2025-05-07T20:32:42.1220142Z torch.manual_seed(2025) 2025-05-07T20:32:42.1220390Z 2025-05-07T20:32:42.1220668Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1221012Z 2025-05-07T20:32:42.1221212Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1221512Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1221822Z x = x_sign * x_clamp 2025-05-07T20:32:42.1222066Z x0 = x[:, :D] 2025-05-07T20:32:42.1222289Z x1 = x[:, D:] 2025-05-07T20:32:42.1222505Z 2025-05-07T20:32:42.1222694Z if contiguous: 2025-05-07T20:32:42.1222931Z x0 = x0.contiguous() 2025-05-07T20:32:42.1223201Z x1 = x1.contiguous() 2025-05-07T20:32:42.1223438Z 2025-05-07T20:32:42.1223640Z if scale_ub is not None: 2025-05-07T20:32:42.1223918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1224257Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1224653Z ) 2025-05-07T20:32:42.1224860Z else: 2025-05-07T20:32:42.1225069Z scale_ub_tensor = None 2025-05-07T20:32:42.1225330Z 2025-05-07T20:32:42.1225566Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1225900Z op = silu_mul_quant 2025-05-07T20:32:42.1233256Z if compiled: 2025-05-07T20:32:42.1233538Z op = torch.compile(op) 2025-05-07T20:32:42.1233856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1234148Z 2025-05-07T20:32:42.1234360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1234531Z 2025-05-07T20:32:42.1234648Z moe/activation_test.py:117: 2025-05-07T20:32:42.1234951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1235295Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1235593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1236419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1237132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1237685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1238393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1239063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1239661Z kernel = self.compile( 2025-05-07T20:32:42.1240218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1240896Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1241300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1241538Z 2025-05-07T20:32:42.1241755Z self = 2025-05-07T20:32:42.1242844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1244240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e7df420>} 2025-05-07T20:32:42.1245636Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1246671Z context = 2025-05-07T20:32:42.1246974Z 2025-05-07T20:32:42.1247160Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1247847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1248318Z module_map=module_map) 2025-05-07T20:32:42.1248702Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1249070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1249334Z E ^ 2025-05-07T20:32:42.1249808Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1250269Z 2025-05-07T20:32:42.1250693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1251212Z 2025-05-07T20:32:42.1251320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1251748Z self=, 2025-05-07T20:32:42.1252159Z T=4096, 2025-05-07T20:32:42.1252367Z D=5120, 2025-05-07T20:32:42.1252626Z scale_ub=1200.0, 2025-05-07T20:32:42.1252865Z contiguous=False, 2025-05-07T20:32:42.1253106Z compiled=True, 2025-05-07T20:32:42.1253328Z ) 2025-05-07T20:32:42.1253653Z self = 2025-05-07T20:32:42.1254166Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.1254452Z 2025-05-07T20:32:42.1254536Z @given( 2025-05-07T20:32:42.1254784Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1255111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1255437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1255780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1256113Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1256415Z ) 2025-05-07T20:32:42.1256780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1257233Z def test_silu_mul_quant( 2025-05-07T20:32:42.1257564Z self, 2025-05-07T20:32:42.1257777Z T: int, 2025-05-07T20:32:42.1257990Z D: int, 2025-05-07T20:32:42.1258215Z scale_ub: Optional[float], 2025-05-07T20:32:42.1258497Z contiguous: bool, 2025-05-07T20:32:42.1258750Z compiled: bool, 2025-05-07T20:32:42.1258981Z ) -> None: 2025-05-07T20:32:42.1259211Z torch.manual_seed(2025) 2025-05-07T20:32:42.1259467Z 2025-05-07T20:32:42.1259795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1260157Z 2025-05-07T20:32:42.1260369Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1260667Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1260993Z x = x_sign * x_clamp 2025-05-07T20:32:42.1261251Z x0 = x[:, :D] 2025-05-07T20:32:42.1261475Z x1 = x[:, D:] 2025-05-07T20:32:42.1261696Z 2025-05-07T20:32:42.1261894Z if contiguous: 2025-05-07T20:32:42.1262135Z x0 = x0.contiguous() 2025-05-07T20:32:42.1262404Z x1 = x1.contiguous() 2025-05-07T20:32:42.1262660Z 2025-05-07T20:32:42.1262866Z if scale_ub is not None: 2025-05-07T20:32:42.1263152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1263499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1263819Z ) 2025-05-07T20:32:42.1264022Z else: 2025-05-07T20:32:42.1264244Z scale_ub_tensor = None 2025-05-07T20:32:42.1264568Z 2025-05-07T20:32:42.1264808Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1265135Z op = silu_mul_quant 2025-05-07T20:32:42.1265401Z if compiled: 2025-05-07T20:32:42.1265665Z op = torch.compile(op) 2025-05-07T20:32:42.1265977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1266266Z 2025-05-07T20:32:42.1266467Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1266644Z 2025-05-07T20:32:42.1266753Z moe/activation_test.py:117: 2025-05-07T20:32:42.1267067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1267418Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1267707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1268276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1268851Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1269519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1270221Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1270772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1271470Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1272186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1272731Z kernel = self.compile( 2025-05-07T20:32:42.1273282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1273939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1274349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1274594Z 2025-05-07T20:32:42.1274806Z self = 2025-05-07T20:32:42.1275896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1277313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f134860>} 2025-05-07T20:32:42.1278717Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1279752Z context = 2025-05-07T20:32:42.1280043Z 2025-05-07T20:32:42.1280223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1280807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1281278Z module_map=module_map) 2025-05-07T20:32:42.1281655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1282022Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1282287Z E ^ 2025-05-07T20:32:42.1282766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1283220Z 2025-05-07T20:32:42.1283646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1284158Z 2025-05-07T20:32:42.2152957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2154225Z self=, 2025-05-07T20:32:42.2155312Z T=2048, 2025-05-07T20:32:42.2156065Z D=7168, 2025-05-07T20:32:42.2156455Z scale_ub=1200.0, 2025-05-07T20:32:42.2156901Z contiguous=False, 2025-05-07T20:32:42.2157305Z compiled=False, 2025-05-07T20:32:42.2157542Z ) 2025-05-07T20:32:42.2157893Z self = 2025-05-07T20:32:42.2158394Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.2158681Z 2025-05-07T20:32:42.2158767Z @given( 2025-05-07T20:32:42.2159013Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2159332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2159649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2159981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2160320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2160616Z ) 2025-05-07T20:32:42.2160966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2161417Z def test_silu_mul_quant( 2025-05-07T20:32:42.2161664Z self, 2025-05-07T20:32:42.2161858Z T: int, 2025-05-07T20:32:42.2162062Z D: int, 2025-05-07T20:32:42.2162284Z scale_ub: Optional[float], 2025-05-07T20:32:42.2162553Z contiguous: bool, 2025-05-07T20:32:42.2162798Z compiled: bool, 2025-05-07T20:32:42.2163026Z ) -> None: 2025-05-07T20:32:42.2163240Z torch.manual_seed(2025) 2025-05-07T20:32:42.2163487Z 2025-05-07T20:32:42.2163844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2164189Z 2025-05-07T20:32:42.2164391Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2164690Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2165001Z x = x_sign * x_clamp 2025-05-07T20:32:42.2165240Z x0 = x[:, :D] 2025-05-07T20:32:42.2165464Z x1 = x[:, D:] 2025-05-07T20:32:42.2165680Z 2025-05-07T20:32:42.2165870Z if contiguous: 2025-05-07T20:32:42.2166108Z x0 = x0.contiguous() 2025-05-07T20:32:42.2166367Z x1 = x1.contiguous() 2025-05-07T20:32:42.2166605Z 2025-05-07T20:32:42.2166801Z if scale_ub is not None: 2025-05-07T20:32:42.2167075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2167409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2167825Z ) 2025-05-07T20:32:42.2168029Z else: 2025-05-07T20:32:42.2168245Z scale_ub_tensor = None 2025-05-07T20:32:42.2168576Z 2025-05-07T20:32:42.2168814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2169124Z op = silu_mul_quant 2025-05-07T20:32:42.2169379Z if compiled: 2025-05-07T20:32:42.2169634Z op = torch.compile(op) 2025-05-07T20:32:42.2169928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2170212Z 2025-05-07T20:32:42.2170411Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2170650Z 2025-05-07T20:32:42.2170757Z moe/activation_test.py:117: 2025-05-07T20:32:42.2171056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2171394Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2171679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2172369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2173073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2173616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2174311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2174976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2175529Z kernel = self.compile( 2025-05-07T20:32:42.2176129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2176782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2177189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2177425Z 2025-05-07T20:32:42.2177633Z self = 2025-05-07T20:32:42.2178718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2180105Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f1356c0>} 2025-05-07T20:32:42.2181434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2182458Z context = 2025-05-07T20:32:42.2182748Z 2025-05-07T20:32:42.2182912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2183433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2183935Z module_map=module_map) 2025-05-07T20:32:42.2184298Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2184652Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2184903Z E ^ 2025-05-07T20:32:42.2185367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2185818Z 2025-05-07T20:32:42.2186229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2186743Z 2025-05-07T20:32:42.2186850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2187258Z self=, 2025-05-07T20:32:42.2187662Z T=1, 2025-05-07T20:32:42.2187847Z D=7168, 2025-05-07T20:32:42.2188035Z scale_ub=None, 2025-05-07T20:32:42.2188249Z contiguous=True, 2025-05-07T20:32:42.2188470Z compiled=False, 2025-05-07T20:32:42.2188673Z ) 2025-05-07T20:32:42.2189040Z self = 2025-05-07T20:32:42.2189522Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.2189777Z 2025-05-07T20:32:42.2189860Z @given( 2025-05-07T20:32:42.2190083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2190394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2190739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2191065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2191394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2191681Z ) 2025-05-07T20:32:42.2192019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2192459Z def test_silu_mul_quant( 2025-05-07T20:32:42.2192699Z self, 2025-05-07T20:32:42.2192888Z T: int, 2025-05-07T20:32:42.2193089Z D: int, 2025-05-07T20:32:42.2193310Z scale_ub: Optional[float], 2025-05-07T20:32:42.2193583Z contiguous: bool, 2025-05-07T20:32:42.2193816Z compiled: bool, 2025-05-07T20:32:42.2194046Z ) -> None: 2025-05-07T20:32:42.2194268Z torch.manual_seed(2025) 2025-05-07T20:32:42.2194508Z 2025-05-07T20:32:42.2194787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2195135Z 2025-05-07T20:32:42.2195336Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2195682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2195996Z x = x_sign * x_clamp 2025-05-07T20:32:42.2196236Z x0 = x[:, :D] 2025-05-07T20:32:42.2196460Z x1 = x[:, D:] 2025-05-07T20:32:42.2196676Z 2025-05-07T20:32:42.2196862Z if contiguous: 2025-05-07T20:32:42.2197098Z x0 = x0.contiguous() 2025-05-07T20:32:42.2197361Z x1 = x1.contiguous() 2025-05-07T20:32:42.2197626Z 2025-05-07T20:32:42.2197855Z if scale_ub is not None: 2025-05-07T20:32:42.2198132Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2198464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2198779Z ) 2025-05-07T20:32:42.2198981Z else: 2025-05-07T20:32:42.2199196Z scale_ub_tensor = None 2025-05-07T20:32:42.2199449Z 2025-05-07T20:32:42.2199685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2200007Z op = silu_mul_quant 2025-05-07T20:32:42.2200257Z if compiled: 2025-05-07T20:32:42.2200510Z op = torch.compile(op) 2025-05-07T20:32:42.2200807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2201083Z 2025-05-07T20:32:42.2201282Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2201447Z 2025-05-07T20:32:42.2201552Z moe/activation_test.py:117: 2025-05-07T20:32:42.2201848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2202258Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2202545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2203235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2203931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2204469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2205162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2206184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2206880Z kernel = self.compile( 2025-05-07T20:32:42.2207426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2208136Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2208629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2208869Z 2025-05-07T20:32:42.2209080Z self = 2025-05-07T20:32:42.2210160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2211593Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f134fe0>} 2025-05-07T20:32:42.2212931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2213962Z context = 2025-05-07T20:32:42.2214259Z 2025-05-07T20:32:42.2214424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2214951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2215413Z module_map=module_map) 2025-05-07T20:32:42.2215785Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2216264Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2216529Z E ^ 2025-05-07T20:32:42.2216994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2217449Z 2025-05-07T20:32:42.2217866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2218375Z 2025-05-07T20:32:42.2218488Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2218915Z self=, 2025-05-07T20:32:42.2219317Z T=16384, 2025-05-07T20:32:42.2219520Z D=7168, 2025-05-07T20:32:42.2219719Z scale_ub=1200.0, 2025-05-07T20:32:42.2219939Z contiguous=False, 2025-05-07T20:32:42.2220169Z compiled=True, 2025-05-07T20:32:42.5681073Z ) 2025-05-07T20:32:42.5681635Z self = 2025-05-07T20:32:42.5682364Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.5682772Z 2025-05-07T20:32:42.5682864Z @given( 2025-05-07T20:32:42.5683103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5683419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5683719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5684054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5684663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5684950Z ) 2025-05-07T20:32:42.5685304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5685739Z def test_silu_mul_quant( 2025-05-07T20:32:42.5685974Z self, 2025-05-07T20:32:42.5686169Z T: int, 2025-05-07T20:32:42.5686368Z D: int, 2025-05-07T20:32:42.5686580Z scale_ub: Optional[float], 2025-05-07T20:32:42.5686860Z contiguous: bool, 2025-05-07T20:32:42.5687104Z compiled: bool, 2025-05-07T20:32:42.5687326Z ) -> None: 2025-05-07T20:32:42.5687632Z torch.manual_seed(2025) 2025-05-07T20:32:42.5687884Z 2025-05-07T20:32:42.5688162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5688502Z 2025-05-07T20:32:42.5688702Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5689002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5689308Z x = x_sign * x_clamp 2025-05-07T20:32:42.5689559Z x0 = x[:, :D] 2025-05-07T20:32:42.5689869Z x1 = x[:, D:] 2025-05-07T20:32:42.5690077Z 2025-05-07T20:32:42.5690274Z if contiguous: 2025-05-07T20:32:42.5690511Z x0 = x0.contiguous() 2025-05-07T20:32:42.5690775Z x1 = x1.contiguous() 2025-05-07T20:32:42.5691018Z 2025-05-07T20:32:42.5691215Z if scale_ub is not None: 2025-05-07T20:32:42.5691486Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5691907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5692220Z ) 2025-05-07T20:32:42.5692422Z else: 2025-05-07T20:32:42.5692639Z scale_ub_tensor = None 2025-05-07T20:32:42.5692902Z 2025-05-07T20:32:42.5693143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5693456Z op = silu_mul_quant 2025-05-07T20:32:42.5693714Z if compiled: 2025-05-07T20:32:42.5693968Z op = torch.compile(op) 2025-05-07T20:32:42.5694269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5694551Z 2025-05-07T20:32:42.5694751Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5694937Z 2025-05-07T20:32:42.5695044Z moe/activation_test.py:117: 2025-05-07T20:32:42.5695340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5695683Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5695973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5696629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5697199Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5697869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5698569Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5699110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5699800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5700465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5701003Z kernel = self.compile( 2025-05-07T20:32:42.5701543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5702212Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5702615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5702846Z 2025-05-07T20:32:42.5703054Z self = 2025-05-07T20:32:42.5704187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5705857Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f137b00>} 2025-05-07T20:32:42.5707215Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5708252Z context = 2025-05-07T20:32:42.5708544Z 2025-05-07T20:32:42.5708710Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5709237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5709707Z module_map=module_map) 2025-05-07T20:32:42.5710079Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5710504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5710772Z E ^ 2025-05-07T20:32:42.5711242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5711693Z 2025-05-07T20:32:42.5712112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5712692Z 2025-05-07T20:32:42.5712802Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5713220Z self=, 2025-05-07T20:32:42.5713628Z T=1, 2025-05-07T20:32:42.5713812Z D=7168, 2025-05-07T20:32:42.5714013Z scale_ub=None, 2025-05-07T20:32:42.5714237Z contiguous=False, 2025-05-07T20:32:42.5714464Z compiled=False, 2025-05-07T20:32:42.5721475Z ) 2025-05-07T20:32:42.5721852Z self = 2025-05-07T20:32:42.5722368Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5722649Z 2025-05-07T20:32:42.5722733Z @given( 2025-05-07T20:32:42.5722982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5723303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5723624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5723969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5724422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5724722Z ) 2025-05-07T20:32:42.5725085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5725539Z def test_silu_mul_quant( 2025-05-07T20:32:42.5725786Z self, 2025-05-07T20:32:42.5725993Z T: int, 2025-05-07T20:32:42.5726206Z D: int, 2025-05-07T20:32:42.5726429Z scale_ub: Optional[float], 2025-05-07T20:32:42.5726714Z contiguous: bool, 2025-05-07T20:32:42.5726973Z compiled: bool, 2025-05-07T20:32:42.5727204Z ) -> None: 2025-05-07T20:32:42.5727447Z torch.manual_seed(2025) 2025-05-07T20:32:42.5727828Z 2025-05-07T20:32:42.5728109Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5728469Z 2025-05-07T20:32:42.5728683Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5728982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5729311Z x = x_sign * x_clamp 2025-05-07T20:32:42.5729567Z x0 = x[:, :D] 2025-05-07T20:32:42.5729798Z x1 = x[:, D:] 2025-05-07T20:32:42.5730013Z 2025-05-07T20:32:42.5730213Z if contiguous: 2025-05-07T20:32:42.5730459Z x0 = x0.contiguous() 2025-05-07T20:32:42.5730725Z x1 = x1.contiguous() 2025-05-07T20:32:42.5730980Z 2025-05-07T20:32:42.5731192Z if scale_ub is not None: 2025-05-07T20:32:42.5731481Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5731905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5732229Z ) 2025-05-07T20:32:42.5732430Z else: 2025-05-07T20:32:42.5732656Z scale_ub_tensor = None 2025-05-07T20:32:42.5732922Z 2025-05-07T20:32:42.5733162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5733492Z op = silu_mul_quant 2025-05-07T20:32:42.5733759Z if compiled: 2025-05-07T20:32:42.5734018Z op = torch.compile(op) 2025-05-07T20:32:42.5734335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5734625Z 2025-05-07T20:32:42.5734828Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5735007Z 2025-05-07T20:32:42.5735113Z moe/activation_test.py:117: 2025-05-07T20:32:42.5735425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5735775Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5736064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5736828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5737538Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5738081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5738779Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5739500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5740047Z kernel = self.compile( 2025-05-07T20:32:42.5740593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5741264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5741673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5741909Z 2025-05-07T20:32:42.5742134Z self = 2025-05-07T20:32:42.5743220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5744608Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e86c9a0>} 2025-05-07T20:32:42.5746015Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5747058Z context = 2025-05-07T20:32:42.5747348Z 2025-05-07T20:32:42.5747532Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5748061Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5748542Z module_map=module_map) 2025-05-07T20:32:42.5748919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5749277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5749550Z E ^ 2025-05-07T20:32:42.5750035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5750488Z 2025-05-07T20:32:42.5750917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5751432Z 2025-05-07T20:32:42.5751541Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5751967Z self=, 2025-05-07T20:32:42.5752432Z T=2048, 2025-05-07T20:32:42.5752637Z D=7168, 2025-05-07T20:32:42.5752852Z scale_ub=None, 2025-05-07T20:32:42.5753089Z contiguous=False, 2025-05-07T20:32:42.5753326Z compiled=True, 2025-05-07T20:32:42.5753547Z ) 2025-05-07T20:32:42.6437428Z self = 2025-05-07T20:32:42.6438237Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6438623Z 2025-05-07T20:32:42.6438767Z @given( 2025-05-07T20:32:42.6439083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6439500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6439822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6440161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6440503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6440803Z ) 2025-05-07T20:32:42.6441163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6441875Z def test_silu_mul_quant( 2025-05-07T20:32:42.6442140Z self, 2025-05-07T20:32:42.6442343Z T: int, 2025-05-07T20:32:42.6442563Z D: int, 2025-05-07T20:32:42.6442794Z scale_ub: Optional[float], 2025-05-07T20:32:42.6443078Z contiguous: bool, 2025-05-07T20:32:42.6443326Z compiled: bool, 2025-05-07T20:32:42.6443564Z ) -> None: 2025-05-07T20:32:42.6443794Z torch.manual_seed(2025) 2025-05-07T20:32:42.6444180Z 2025-05-07T20:32:42.6444468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6444824Z 2025-05-07T20:32:42.6445024Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6445328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6445653Z x = x_sign * x_clamp 2025-05-07T20:32:42.6445898Z x0 = x[:, :D] 2025-05-07T20:32:42.6446135Z x1 = x[:, D:] 2025-05-07T20:32:42.6446358Z 2025-05-07T20:32:42.6446552Z if contiguous: 2025-05-07T20:32:42.6446796Z x0 = x0.contiguous() 2025-05-07T20:32:42.6447066Z x1 = x1.contiguous() 2025-05-07T20:32:42.6447311Z 2025-05-07T20:32:42.6447519Z if scale_ub is not None: 2025-05-07T20:32:42.6447957Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6448297Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6448615Z ) 2025-05-07T20:32:42.6448917Z else: 2025-05-07T20:32:42.6449144Z scale_ub_tensor = None 2025-05-07T20:32:42.6449400Z 2025-05-07T20:32:42.6449646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6449972Z op = silu_mul_quant 2025-05-07T20:32:42.6450231Z if compiled: 2025-05-07T20:32:42.6450495Z op = torch.compile(op) 2025-05-07T20:32:42.6450807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6451086Z 2025-05-07T20:32:42.6451297Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6451470Z 2025-05-07T20:32:42.6451575Z moe/activation_test.py:117: 2025-05-07T20:32:42.6451881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6452217Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6452509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6453082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6453662Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6454322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6455018Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6455566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6456322Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6456995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6457533Z kernel = self.compile( 2025-05-07T20:32:42.6458078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6458731Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6459136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6459371Z 2025-05-07T20:32:42.6459586Z self = 2025-05-07T20:32:42.6460667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6462102Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e86dd00>} 2025-05-07T20:32:42.6463455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6464483Z context = 2025-05-07T20:32:42.6464818Z 2025-05-07T20:32:42.6464993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6465515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6465991Z module_map=module_map) 2025-05-07T20:32:42.6466362Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6466723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6466987Z E ^ 2025-05-07T20:32:42.6467484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6467941Z 2025-05-07T20:32:42.6468361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6468873Z 2025-05-07T20:32:42.6468987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6469406Z self=, 2025-05-07T20:32:42.6469862Z T=4096, 2025-05-07T20:32:42.6470061Z D=7168, 2025-05-07T20:32:42.6470264Z scale_ub=None, 2025-05-07T20:32:42.6470486Z contiguous=False, 2025-05-07T20:32:42.6470722Z compiled=True, 2025-05-07T20:32:42.6470938Z ) 2025-05-07T20:32:42.6471261Z self = 2025-05-07T20:32:42.6471765Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6472038Z 2025-05-07T20:32:42.6472131Z @given( 2025-05-07T20:32:42.6472370Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6472694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6473010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6473350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6473685Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6473982Z ) 2025-05-07T20:32:42.6474349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6474793Z def test_silu_mul_quant( 2025-05-07T20:32:42.6475045Z self, 2025-05-07T20:32:42.6475250Z T: int, 2025-05-07T20:32:42.6475450Z D: int, 2025-05-07T20:32:42.6475675Z scale_ub: Optional[float], 2025-05-07T20:32:42.6475951Z contiguous: bool, 2025-05-07T20:32:42.6476189Z compiled: bool, 2025-05-07T20:32:42.6476418Z ) -> None: 2025-05-07T20:32:42.6476686Z torch.manual_seed(2025) 2025-05-07T20:32:42.6476933Z 2025-05-07T20:32:42.6477208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6477556Z 2025-05-07T20:32:42.6477748Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6478042Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6478355Z x = x_sign * x_clamp 2025-05-07T20:32:42.6478600Z x0 = x[:, :D] 2025-05-07T20:32:42.6478813Z x1 = x[:, D:] 2025-05-07T20:32:42.6479033Z 2025-05-07T20:32:42.6479220Z if contiguous: 2025-05-07T20:32:42.6479448Z x0 = x0.contiguous() 2025-05-07T20:32:42.6479710Z x1 = x1.contiguous() 2025-05-07T20:32:42.6479956Z 2025-05-07T20:32:42.6480147Z if scale_ub is not None: 2025-05-07T20:32:42.6480424Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6480757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6481062Z ) 2025-05-07T20:32:42.6481264Z else: 2025-05-07T20:32:42.6481524Z scale_ub_tensor = None 2025-05-07T20:32:42.6481776Z 2025-05-07T20:32:42.6482010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6482328Z op = silu_mul_quant 2025-05-07T20:32:42.6482577Z if compiled: 2025-05-07T20:32:42.6482829Z op = torch.compile(op) 2025-05-07T20:32:42.6483129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6483451Z 2025-05-07T20:32:42.6483646Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6483819Z 2025-05-07T20:32:42.6483921Z moe/activation_test.py:117: 2025-05-07T20:32:42.6484220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6484552Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6484839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6485399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6485962Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6486625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6487320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6487929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6488608Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6489321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6489864Z kernel = self.compile( 2025-05-07T20:32:42.6490399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6491059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6491463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6491691Z 2025-05-07T20:32:42.6491907Z self = 2025-05-07T20:32:42.6492979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6494359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e86e840>} 2025-05-07T20:32:42.6495695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6496716Z context = 2025-05-07T20:32:42.6497047Z 2025-05-07T20:32:42.6497223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6497741Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6498209Z module_map=module_map) 2025-05-07T20:32:42.6498578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6498928Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6499195Z E ^ 2025-05-07T20:32:42.6499660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6500106Z 2025-05-07T20:32:42.6500525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6501033Z 2025-05-07T20:32:42.7764021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7765245Z self=, 2025-05-07T20:32:42.7766749Z T=16384, 2025-05-07T20:32:42.7767158Z D=5120, 2025-05-07T20:32:42.7767497Z scale_ub=1200.0, 2025-05-07T20:32:42.7767867Z contiguous=False, 2025-05-07T20:32:42.7768093Z compiled=False, 2025-05-07T20:32:42.7768302Z ) 2025-05-07T20:32:42.7768620Z self = 2025-05-07T20:32:42.7769126Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.7769494Z 2025-05-07T20:32:42.7769584Z @given( 2025-05-07T20:32:42.7769812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7770128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7770440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7770765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7771096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7771388Z ) 2025-05-07T20:32:42.7771747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7772191Z def test_silu_mul_quant( 2025-05-07T20:32:42.7772437Z self, 2025-05-07T20:32:42.7772637Z T: int, 2025-05-07T20:32:42.7772834Z D: int, 2025-05-07T20:32:42.7773056Z scale_ub: Optional[float], 2025-05-07T20:32:42.7773335Z contiguous: bool, 2025-05-07T20:32:42.7773574Z compiled: bool, 2025-05-07T20:32:42.7773894Z ) -> None: 2025-05-07T20:32:42.7774116Z torch.manual_seed(2025) 2025-05-07T20:32:42.7774362Z 2025-05-07T20:32:42.7774642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7774992Z 2025-05-07T20:32:42.7775188Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7775488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7775805Z x = x_sign * x_clamp 2025-05-07T20:32:42.7776053Z x0 = x[:, :D] 2025-05-07T20:32:42.7776271Z x1 = x[:, D:] 2025-05-07T20:32:42.7776485Z 2025-05-07T20:32:42.7776679Z if contiguous: 2025-05-07T20:32:42.7776906Z x0 = x0.contiguous() 2025-05-07T20:32:42.7777166Z x1 = x1.contiguous() 2025-05-07T20:32:42.7777412Z 2025-05-07T20:32:42.7777606Z if scale_ub is not None: 2025-05-07T20:32:42.7777928Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7778270Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7778582Z ) 2025-05-07T20:32:42.7778783Z else: 2025-05-07T20:32:42.7779001Z scale_ub_tensor = None 2025-05-07T20:32:42.7779257Z 2025-05-07T20:32:42.7779483Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7779802Z op = silu_mul_quant 2025-05-07T20:32:42.7780057Z if compiled: 2025-05-07T20:32:42.7780307Z op = torch.compile(op) 2025-05-07T20:32:42.7780606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7780962Z 2025-05-07T20:32:42.7781163Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7781336Z 2025-05-07T20:32:42.7781440Z moe/activation_test.py:117: 2025-05-07T20:32:42.7781742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7782077Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7782356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7783045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7783744Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7784281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7784968Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7785636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7786255Z kernel = self.compile( 2025-05-07T20:32:42.7786796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7787453Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7787854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7788082Z 2025-05-07T20:32:42.7788344Z self = 2025-05-07T20:32:42.7789422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7790812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e4f4040>} 2025-05-07T20:32:42.7792161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7793186Z context = 2025-05-07T20:32:42.7793478Z 2025-05-07T20:32:42.7793645Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7794221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7794692Z module_map=module_map) 2025-05-07T20:32:42.7795062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7795411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7795680Z E ^ 2025-05-07T20:32:42.7796148Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7796601Z 2025-05-07T20:32:42.7797022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7797538Z 2025-05-07T20:32:42.7797644Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7798061Z self=, 2025-05-07T20:32:42.7798468Z T=16384, 2025-05-07T20:32:42.7798662Z D=5120, 2025-05-07T20:32:42.7798871Z scale_ub=1200.0, 2025-05-07T20:32:42.7799100Z contiguous=True, 2025-05-07T20:32:42.7799321Z compiled=True, 2025-05-07T20:32:42.7799532Z ) 2025-05-07T20:32:42.7799855Z self = 2025-05-07T20:32:42.7800343Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7800628Z 2025-05-07T20:32:42.7800709Z @given( 2025-05-07T20:32:42.7800944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7801314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7801617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7801969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7802298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7802589Z ) 2025-05-07T20:32:42.7802935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7803386Z def test_silu_mul_quant( 2025-05-07T20:32:42.7803635Z self, 2025-05-07T20:32:42.7803828Z T: int, 2025-05-07T20:32:42.7804034Z D: int, 2025-05-07T20:32:42.7804257Z scale_ub: Optional[float], 2025-05-07T20:32:42.7804521Z contiguous: bool, 2025-05-07T20:32:42.7804763Z compiled: bool, 2025-05-07T20:32:42.7804987Z ) -> None: 2025-05-07T20:32:42.7805199Z torch.manual_seed(2025) 2025-05-07T20:32:42.7805451Z 2025-05-07T20:32:42.7806014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7806441Z 2025-05-07T20:32:42.7806636Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7806931Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7807243Z x = x_sign * x_clamp 2025-05-07T20:32:42.7807483Z x0 = x[:, :D] 2025-05-07T20:32:42.7807786Z x1 = x[:, D:] 2025-05-07T20:32:42.7807999Z 2025-05-07T20:32:42.7808190Z if contiguous: 2025-05-07T20:32:42.7808495Z x0 = x0.contiguous() 2025-05-07T20:32:42.7808764Z x1 = x1.contiguous() 2025-05-07T20:32:42.7809001Z 2025-05-07T20:32:42.7809198Z if scale_ub is not None: 2025-05-07T20:32:42.7809477Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7809811Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7810123Z ) 2025-05-07T20:32:42.7810323Z else: 2025-05-07T20:32:42.7810534Z scale_ub_tensor = None 2025-05-07T20:32:42.7810793Z 2025-05-07T20:32:42.7811036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7811356Z op = silu_mul_quant 2025-05-07T20:32:42.7811605Z if compiled: 2025-05-07T20:32:42.7811858Z op = torch.compile(op) 2025-05-07T20:32:42.7812158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7812432Z 2025-05-07T20:32:42.7812628Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7812811Z 2025-05-07T20:32:42.7813002Z moe/activation_test.py:117: 2025-05-07T20:32:42.7813306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7813636Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7813921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7821920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7822508Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7823184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7823889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7824441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7825129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7825808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7826369Z kernel = self.compile( 2025-05-07T20:32:42.7826926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7827588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7828003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7828238Z 2025-05-07T20:32:42.7828566Z self = 2025-05-07T20:32:42.7829659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7831042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e4f5300>} 2025-05-07T20:32:42.7832404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7833442Z context = 2025-05-07T20:32:42.7833737Z 2025-05-07T20:32:42.7833915Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7834498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7834979Z module_map=module_map) 2025-05-07T20:32:42.7835360Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7835726Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7835992Z E ^ 2025-05-07T20:32:42.7836468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7836968Z 2025-05-07T20:32:42.7837389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7837910Z 2025-05-07T20:32:43.0805351Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.0806214Z self=, 2025-05-07T20:32:43.0806782Z T=16384, 2025-05-07T20:32:43.0807049Z D=5120, 2025-05-07T20:32:43.0807278Z scale_ub=None, 2025-05-07T20:32:43.0807511Z contiguous=False, 2025-05-07T20:32:43.0807886Z compiled=True, 2025-05-07T20:32:43.0808105Z ) 2025-05-07T20:32:43.0808429Z self = 2025-05-07T20:32:43.0808942Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.0809223Z 2025-05-07T20:32:43.0809313Z @given( 2025-05-07T20:32:43.0809557Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.0810169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.0810494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.0810838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.0811176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.0811474Z ) 2025-05-07T20:32:43.0811836Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.0812289Z def test_silu_mul_quant( 2025-05-07T20:32:43.0812546Z self, 2025-05-07T20:32:43.0812753Z T: int, 2025-05-07T20:32:43.0812958Z D: int, 2025-05-07T20:32:43.0813190Z scale_ub: Optional[float], 2025-05-07T20:32:43.0813473Z contiguous: bool, 2025-05-07T20:32:43.0813723Z compiled: bool, 2025-05-07T20:32:43.0813964Z ) -> None: 2025-05-07T20:32:43.0814196Z torch.manual_seed(2025) 2025-05-07T20:32:43.0814446Z 2025-05-07T20:32:43.0814741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.0815099Z 2025-05-07T20:32:43.0815312Z x_sign = torch.sign(x) 2025-05-07T20:32:43.0815609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.0815938Z x = x_sign * x_clamp 2025-05-07T20:32:43.0816195Z x0 = x[:, :D] 2025-05-07T20:32:43.0816420Z x1 = x[:, D:] 2025-05-07T20:32:43.0816643Z 2025-05-07T20:32:43.0816854Z if contiguous: 2025-05-07T20:32:43.0817179Z x0 = x0.contiguous() 2025-05-07T20:32:43.0817460Z x1 = x1.contiguous() 2025-05-07T20:32:43.0817716Z 2025-05-07T20:32:43.0817917Z if scale_ub is not None: 2025-05-07T20:32:43.0818207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.0818558Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.0818874Z ) 2025-05-07T20:32:43.0819084Z else: 2025-05-07T20:32:43.0819313Z scale_ub_tensor = None 2025-05-07T20:32:43.0819575Z 2025-05-07T20:32:43.0819815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.0820143Z op = silu_mul_quant 2025-05-07T20:32:43.0820409Z if compiled: 2025-05-07T20:32:43.0820667Z op = torch.compile(op) 2025-05-07T20:32:43.0820977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.0821270Z 2025-05-07T20:32:43.0821471Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.0821647Z 2025-05-07T20:32:43.0821755Z moe/activation_test.py:117: 2025-05-07T20:32:43.0822144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.0822484Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.0822780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.0823352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.0823925Z return fn(*args, **kwargs) 2025-05-07T20:32:43.0824671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.0825378Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.0825929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.0826618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.0827300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.0827896Z kernel = self.compile( 2025-05-07T20:32:43.0828455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.0829119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.0829535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.0829820Z 2025-05-07T20:32:43.0830042Z self = 2025-05-07T20:32:43.0831142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.0832541Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e4f5e40>} 2025-05-07T20:32:43.0833902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.0834938Z context = 2025-05-07T20:32:43.0835233Z 2025-05-07T20:32:43.0835418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.0835947Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.0836429Z module_map=module_map) 2025-05-07T20:32:43.0836809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.0837177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.0837443Z E ^ 2025-05-07T20:32:43.0838059Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.0838519Z 2025-05-07T20:32:43.0838956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.0839477Z 2025-05-07T20:32:43.0839586Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.0840015Z self=, 2025-05-07T20:32:43.0840446Z T=2048, 2025-05-07T20:32:43.0840653Z D=5120, 2025-05-07T20:32:43.0840853Z scale_ub=None, 2025-05-07T20:32:43.0841089Z contiguous=False, 2025-05-07T20:32:43.0841325Z compiled=True, 2025-05-07T20:32:43.0841537Z ) 2025-05-07T20:32:43.1567014Z self = 2025-05-07T20:32:43.1568110Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.1568501Z 2025-05-07T20:32:43.1568599Z @given( 2025-05-07T20:32:43.1569185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1569592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1569956Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1570288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1570609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1570899Z ) 2025-05-07T20:32:43.1571245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1571767Z def test_silu_mul_quant( 2025-05-07T20:32:43.1572004Z self, 2025-05-07T20:32:43.1572200Z T: int, 2025-05-07T20:32:43.1572400Z D: int, 2025-05-07T20:32:43.1572613Z scale_ub: Optional[float], 2025-05-07T20:32:43.1572886Z contiguous: bool, 2025-05-07T20:32:43.1573126Z compiled: bool, 2025-05-07T20:32:43.1573346Z ) -> None: 2025-05-07T20:32:43.1573565Z torch.manual_seed(2025) 2025-05-07T20:32:43.1573809Z 2025-05-07T20:32:43.1574080Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1574422Z 2025-05-07T20:32:43.1574616Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1574908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1575225Z x = x_sign * x_clamp 2025-05-07T20:32:43.1575473Z x0 = x[:, :D] 2025-05-07T20:32:43.1575693Z x1 = x[:, D:] 2025-05-07T20:32:43.1575908Z 2025-05-07T20:32:43.1576184Z if contiguous: 2025-05-07T20:32:43.1576421Z x0 = x0.contiguous() 2025-05-07T20:32:43.1576679Z x1 = x1.contiguous() 2025-05-07T20:32:43.1576927Z 2025-05-07T20:32:43.1577133Z if scale_ub is not None: 2025-05-07T20:32:43.1577407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1577751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1578070Z ) 2025-05-07T20:32:43.1578269Z else: 2025-05-07T20:32:43.1578489Z scale_ub_tensor = None 2025-05-07T20:32:43.1578752Z 2025-05-07T20:32:43.1578986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1579312Z op = silu_mul_quant 2025-05-07T20:32:43.1579571Z if compiled: 2025-05-07T20:32:43.1579822Z op = torch.compile(op) 2025-05-07T20:32:43.1580127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1580411Z 2025-05-07T20:32:43.1580608Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1580787Z 2025-05-07T20:32:43.1580888Z moe/activation_test.py:117: 2025-05-07T20:32:43.1581190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1581532Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1581814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1582378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.1583024Z return fn(*args, **kwargs) 2025-05-07T20:32:43.1583689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1584393Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1584936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1585626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1586293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1586833Z kernel = self.compile( 2025-05-07T20:32:43.1587380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1588093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1588494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1588777Z 2025-05-07T20:32:43.1588988Z self = 2025-05-07T20:32:43.1590078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1591473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e4f7240>} 2025-05-07T20:32:43.1592861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1593893Z context = 2025-05-07T20:32:43.1594188Z 2025-05-07T20:32:43.1594360Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1594886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1595350Z module_map=module_map) 2025-05-07T20:32:43.1595717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1596077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1596338Z E ^ 2025-05-07T20:32:43.1596854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1597309Z 2025-05-07T20:32:43.1597727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1598239Z 2025-05-07T20:32:43.1598355Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1598766Z self=, 2025-05-07T20:32:43.1599177Z T=2048, 2025-05-07T20:32:43.1599379Z D=5120, 2025-05-07T20:32:43.1599578Z scale_ub=1200.0, 2025-05-07T20:32:43.1599813Z contiguous=False, 2025-05-07T20:32:43.1600049Z compiled=True, 2025-05-07T20:32:43.1600256Z ) 2025-05-07T20:32:43.1600585Z self = 2025-05-07T20:32:43.1601086Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.1601362Z 2025-05-07T20:32:43.1601454Z @given( 2025-05-07T20:32:43.1601687Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1602007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1602320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1602648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1602981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1603277Z ) 2025-05-07T20:32:43.1603680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1604130Z def test_silu_mul_quant( 2025-05-07T20:32:43.1604379Z self, 2025-05-07T20:32:43.1604580Z T: int, 2025-05-07T20:32:43.1604781Z D: int, 2025-05-07T20:32:43.1605007Z scale_ub: Optional[float], 2025-05-07T20:32:43.1605281Z contiguous: bool, 2025-05-07T20:32:43.1605521Z compiled: bool, 2025-05-07T20:32:43.1606058Z ) -> None: 2025-05-07T20:32:43.1606281Z torch.manual_seed(2025) 2025-05-07T20:32:43.1606527Z 2025-05-07T20:32:43.1606806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1607149Z 2025-05-07T20:32:43.1607345Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1607702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1608008Z x = x_sign * x_clamp 2025-05-07T20:32:43.1608247Z x0 = x[:, :D] 2025-05-07T20:32:43.1608463Z x1 = x[:, D:] 2025-05-07T20:32:43.1608673Z 2025-05-07T20:32:43.1608930Z if contiguous: 2025-05-07T20:32:43.1609165Z x0 = x0.contiguous() 2025-05-07T20:32:43.1609420Z x1 = x1.contiguous() 2025-05-07T20:32:43.1609652Z 2025-05-07T20:32:43.1609847Z if scale_ub is not None: 2025-05-07T20:32:43.1610116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1610448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1610747Z ) 2025-05-07T20:32:43.1611007Z else: 2025-05-07T20:32:43.1611216Z scale_ub_tensor = None 2025-05-07T20:32:43.1611461Z 2025-05-07T20:32:43.1611692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1612007Z op = silu_mul_quant 2025-05-07T20:32:43.1612255Z if compiled: 2025-05-07T20:32:43.1612503Z op = torch.compile(op) 2025-05-07T20:32:43.1612798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1613067Z 2025-05-07T20:32:43.1613264Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1613430Z 2025-05-07T20:32:43.1613535Z moe/activation_test.py:117: 2025-05-07T20:32:43.1613829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1614164Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1614442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1615002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.1615625Z return fn(*args, **kwargs) 2025-05-07T20:32:43.1616282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1616969Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1617499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1618181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1618842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1619380Z kernel = self.compile( 2025-05-07T20:32:43.1619919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1620580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1620988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1621217Z 2025-05-07T20:32:43.1621433Z self = 2025-05-07T20:32:43.1622507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1623953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e50c720>} 2025-05-07T20:32:43.1625299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1626325Z context = 2025-05-07T20:32:43.1626617Z 2025-05-07T20:32:43.1626788Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1627308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1627776Z module_map=module_map) 2025-05-07T20:32:43.1628144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1628499Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1628764Z E ^ 2025-05-07T20:32:43.1629283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1629734Z 2025-05-07T20:32:43.1630157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1630666Z 2025-05-07T20:32:43.2961998Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2963190Z self=, 2025-05-07T20:32:43.2964605Z T=4096, 2025-05-07T20:32:43.2964992Z D=5120, 2025-05-07T20:32:43.2965385Z scale_ub=1200.0, 2025-05-07T20:32:43.2965825Z contiguous=True, 2025-05-07T20:32:43.2966270Z compiled=True, 2025-05-07T20:32:43.2966682Z ) 2025-05-07T20:32:43.2967309Z self = 2025-05-07T20:32:43.2968073Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2968344Z 2025-05-07T20:32:43.2968436Z @given( 2025-05-07T20:32:43.2968672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2968980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2969295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2969628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2969956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2970249Z ) 2025-05-07T20:32:43.2970702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2971137Z def test_silu_mul_quant( 2025-05-07T20:32:43.2971390Z self, 2025-05-07T20:32:43.2971598Z T: int, 2025-05-07T20:32:43.2971794Z D: int, 2025-05-07T20:32:43.2972020Z scale_ub: Optional[float], 2025-05-07T20:32:43.2972299Z contiguous: bool, 2025-05-07T20:32:43.2972543Z compiled: bool, 2025-05-07T20:32:43.2972768Z ) -> None: 2025-05-07T20:32:43.2972994Z torch.manual_seed(2025) 2025-05-07T20:32:43.2973246Z 2025-05-07T20:32:43.2973520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2973867Z 2025-05-07T20:32:43.2974072Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2974366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2974683Z x = x_sign * x_clamp 2025-05-07T20:32:43.2974933Z x0 = x[:, :D] 2025-05-07T20:32:43.2975153Z x1 = x[:, D:] 2025-05-07T20:32:43.2975371Z 2025-05-07T20:32:43.2975566Z if contiguous: 2025-05-07T20:32:43.2975799Z x0 = x0.contiguous() 2025-05-07T20:32:43.2976067Z x1 = x1.contiguous() 2025-05-07T20:32:43.2976313Z 2025-05-07T20:32:43.2976505Z if scale_ub is not None: 2025-05-07T20:32:43.2976782Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2977122Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2977434Z ) 2025-05-07T20:32:43.2977725Z else: 2025-05-07T20:32:43.2977991Z scale_ub_tensor = None 2025-05-07T20:32:43.2978248Z 2025-05-07T20:32:43.2978478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2978800Z op = silu_mul_quant 2025-05-07T20:32:43.2979050Z if compiled: 2025-05-07T20:32:43.2979297Z op = torch.compile(op) 2025-05-07T20:32:43.2979597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2979879Z 2025-05-07T20:32:43.2980080Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2980245Z 2025-05-07T20:32:43.2980346Z moe/activation_test.py:117: 2025-05-07T20:32:43.2980643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2980980Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2981257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2981821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2982465Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2983125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2983817Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2984353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2985103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2985764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2986300Z kernel = self.compile( 2025-05-07T20:32:43.2986849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2987512Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2987913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2988148Z 2025-05-07T20:32:43.2988357Z self = 2025-05-07T20:32:43.2989435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2990869Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e50d260>} 2025-05-07T20:32:43.2992207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2993235Z context = 2025-05-07T20:32:43.2993531Z 2025-05-07T20:32:43.2993696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2994224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2994686Z module_map=module_map) 2025-05-07T20:32:43.2995056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2995417Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2995682Z E ^ 2025-05-07T20:32:43.2996153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2996609Z 2025-05-07T20:32:43.2997023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2997533Z 2025-05-07T20:32:43.2997648Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2998105Z self=, 2025-05-07T20:32:43.2998512Z T=128, 2025-05-07T20:32:43.2998703Z D=5120, 2025-05-07T20:32:43.2998918Z scale_ub=1200.0, 2025-05-07T20:32:43.2999143Z contiguous=False, 2025-05-07T20:32:43.2999377Z compiled=True, 2025-05-07T20:32:43.2999588Z ) 2025-05-07T20:32:43.5480684Z self = 2025-05-07T20:32:43.5481442Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.5481903Z 2025-05-07T20:32:43.5482016Z @given( 2025-05-07T20:32:43.5482345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5482737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5483052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5483387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5483715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5484005Z ) 2025-05-07T20:32:43.5484653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5485096Z def test_silu_mul_quant( 2025-05-07T20:32:43.5485345Z self, 2025-05-07T20:32:43.5485546Z T: int, 2025-05-07T20:32:43.5485750Z D: int, 2025-05-07T20:32:43.5485966Z scale_ub: Optional[float], 2025-05-07T20:32:43.5486242Z contiguous: bool, 2025-05-07T20:32:43.5486484Z compiled: bool, 2025-05-07T20:32:43.5486795Z ) -> None: 2025-05-07T20:32:43.5487020Z torch.manual_seed(2025) 2025-05-07T20:32:43.5487272Z 2025-05-07T20:32:43.5487622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5487982Z 2025-05-07T20:32:43.5488187Z x_sign = torch.sign(x) 2025-05-07T20:32:43.5488486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.5488808Z x = x_sign * x_clamp 2025-05-07T20:32:43.5489059Z x0 = x[:, :D] 2025-05-07T20:32:43.5489292Z x1 = x[:, D:] 2025-05-07T20:32:43.5489535Z 2025-05-07T20:32:43.5489718Z if contiguous: 2025-05-07T20:32:43.5489956Z x0 = x0.contiguous() 2025-05-07T20:32:43.5490219Z x1 = x1.contiguous() 2025-05-07T20:32:43.5490459Z 2025-05-07T20:32:43.5490653Z if scale_ub is not None: 2025-05-07T20:32:43.5490929Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.5491269Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.5491666Z ) 2025-05-07T20:32:43.5491862Z else: 2025-05-07T20:32:43.5492080Z scale_ub_tensor = None 2025-05-07T20:32:43.5492333Z 2025-05-07T20:32:43.5492566Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.5492881Z op = silu_mul_quant 2025-05-07T20:32:43.5493132Z if compiled: 2025-05-07T20:32:43.5493385Z op = torch.compile(op) 2025-05-07T20:32:43.5493685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5493959Z 2025-05-07T20:32:43.5494160Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.5494322Z 2025-05-07T20:32:43.5494433Z moe/activation_test.py:117: 2025-05-07T20:32:43.5494727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5495067Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.5495351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5495914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.5496481Z return fn(*args, **kwargs) 2025-05-07T20:32:43.5497142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.5497860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.5498416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.5499207Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.5499874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.5500409Z kernel = self.compile( 2025-05-07T20:32:43.5500946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.5501604Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.5502009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5502235Z 2025-05-07T20:32:43.5502448Z self = 2025-05-07T20:32:43.5503522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.5504964Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e50e480>} 2025-05-07T20:32:43.5506594Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.5507698Z context = 2025-05-07T20:32:43.5507986Z 2025-05-07T20:32:43.5508156Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.5508686Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.5509161Z module_map=module_map) 2025-05-07T20:32:43.5509542Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.5509908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.5510184Z E ^ 2025-05-07T20:32:43.5510662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.5511113Z 2025-05-07T20:32:43.5511541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.5512054Z 2025-05-07T20:32:43.5512164Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.5512658Z self=, 2025-05-07T20:32:43.5513078Z T=16384, 2025-05-07T20:32:43.5513282Z D=7168, 2025-05-07T20:32:43.5513489Z scale_ub=1200.0, 2025-05-07T20:32:43.5513726Z contiguous=True, 2025-05-07T20:32:43.5513953Z compiled=True, 2025-05-07T20:32:43.5514176Z ) 2025-05-07T20:32:43.5514506Z self = 2025-05-07T20:32:43.5515008Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.5515297Z 2025-05-07T20:32:43.5515380Z @given( 2025-05-07T20:32:43.5515623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5515947Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5516256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5516599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5516939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5517236Z ) 2025-05-07T20:32:43.5517593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5518048Z def test_silu_mul_quant( 2025-05-07T20:32:43.5518294Z self, 2025-05-07T20:32:43.5518504Z T: int, 2025-05-07T20:32:43.5518713Z D: int, 2025-05-07T20:32:43.5518933Z scale_ub: Optional[float], 2025-05-07T20:32:43.5519213Z contiguous: bool, 2025-05-07T20:32:43.5519464Z compiled: bool, 2025-05-07T20:32:43.5519767Z ) -> None: 2025-05-07T20:32:43.5519993Z torch.manual_seed(2025) 2025-05-07T20:32:43.5520247Z 2025-05-07T20:32:43.5520529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5520868Z 2025-05-07T20:32:43.5521070Z x_sign = torch.sign(x) 2025-05-07T20:32:43.5521370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.5521684Z x = x_sign * x_clamp 2025-05-07T20:32:43.5521942Z x0 = x[:, :D] 2025-05-07T20:32:43.5522170Z x1 = x[:, D:] 2025-05-07T20:32:43.5522381Z 2025-05-07T20:32:43.5522580Z if contiguous: 2025-05-07T20:32:43.5522822Z x0 = x0.contiguous() 2025-05-07T20:32:43.5523081Z x1 = x1.contiguous() 2025-05-07T20:32:43.5523330Z 2025-05-07T20:32:43.5523536Z if scale_ub is not None: 2025-05-07T20:32:43.5523812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.5524165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.5524551Z ) 2025-05-07T20:32:43.5524760Z else: 2025-05-07T20:32:43.5524976Z scale_ub_tensor = None 2025-05-07T20:32:43.5525239Z 2025-05-07T20:32:43.5525478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.5525800Z op = silu_mul_quant 2025-05-07T20:32:43.5526051Z if compiled: 2025-05-07T20:32:43.5526308Z op = torch.compile(op) 2025-05-07T20:32:43.5526659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5526935Z 2025-05-07T20:32:43.5527135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.5527301Z 2025-05-07T20:32:43.5527410Z moe/activation_test.py:117: 2025-05-07T20:32:43.5527805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5528168Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.5528457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5529020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.5529586Z return fn(*args, **kwargs) 2025-05-07T20:32:43.5530254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.5530948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.5531484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.5532223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.5532891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.5533428Z kernel = self.compile( 2025-05-07T20:32:43.5533968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.5534626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.5535032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5535263Z 2025-05-07T20:32:43.5535473Z self = 2025-05-07T20:32:43.5536560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.5537948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e50fd80>} 2025-05-07T20:32:43.5539301Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.5540381Z context = 2025-05-07T20:32:43.5540673Z 2025-05-07T20:32:43.5540840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.5541368Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.5541840Z module_map=module_map) 2025-05-07T20:32:43.5542213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.5542573Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.5542838Z E ^ 2025-05-07T20:32:43.5543311Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.5543761Z 2025-05-07T20:32:43.5544181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.5544702Z 2025-05-07T20:32:43.6500391Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6501307Z self=, 2025-05-07T20:32:43.6501881Z T=16384, 2025-05-07T20:32:43.6502122Z D=5120, 2025-05-07T20:32:43.6502317Z scale_ub=1200.0, 2025-05-07T20:32:43.6502545Z contiguous=True, 2025-05-07T20:32:43.6502772Z compiled=False, 2025-05-07T20:32:43.6502980Z ) 2025-05-07T20:32:43.6503303Z self = 2025-05-07T20:32:43.6503895Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.6504175Z 2025-05-07T20:32:43.6504258Z @given( 2025-05-07T20:32:43.6504495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6504811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6505128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6505457Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6506042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6507293Z ) 2025-05-07T20:32:43.6507709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6508176Z def test_silu_mul_quant( 2025-05-07T20:32:43.6508436Z self, 2025-05-07T20:32:43.6508632Z T: int, 2025-05-07T20:32:43.6508840Z D: int, 2025-05-07T20:32:43.6509067Z scale_ub: Optional[float], 2025-05-07T20:32:43.6509343Z contiguous: bool, 2025-05-07T20:32:43.6509847Z compiled: bool, 2025-05-07T20:32:43.6510085Z ) -> None: 2025-05-07T20:32:43.6510305Z torch.manual_seed(2025) 2025-05-07T20:32:43.6510557Z 2025-05-07T20:32:43.6510843Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6511186Z 2025-05-07T20:32:43.6511389Z x_sign = torch.sign(x) 2025-05-07T20:32:43.6511689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.6512006Z x = x_sign * x_clamp 2025-05-07T20:32:43.6512261Z x0 = x[:, :D] 2025-05-07T20:32:43.6512482Z x1 = x[:, D:] 2025-05-07T20:32:43.6512698Z 2025-05-07T20:32:43.6512886Z if contiguous: 2025-05-07T20:32:43.6513126Z x0 = x0.contiguous() 2025-05-07T20:32:43.6513390Z x1 = x1.contiguous() 2025-05-07T20:32:43.6513628Z 2025-05-07T20:32:43.6513826Z if scale_ub is not None: 2025-05-07T20:32:43.6514106Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.6514447Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.6514762Z ) 2025-05-07T20:32:43.6514965Z else: 2025-05-07T20:32:43.6515181Z scale_ub_tensor = None 2025-05-07T20:32:43.6515445Z 2025-05-07T20:32:43.6515687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.6516000Z op = silu_mul_quant 2025-05-07T20:32:43.6516259Z if compiled: 2025-05-07T20:32:43.6516514Z op = torch.compile(op) 2025-05-07T20:32:43.6516901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.6517178Z 2025-05-07T20:32:43.6517379Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.6517544Z 2025-05-07T20:32:43.6517652Z moe/activation_test.py:117: 2025-05-07T20:32:43.6517948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.6518289Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.6518579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.6519279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.6519979Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.6520523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.6521217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.6521962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.6522503Z kernel = self.compile( 2025-05-07T20:32:43.6523050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.6523711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.6524106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.6524411Z 2025-05-07T20:32:43.6524619Z self = 2025-05-07T20:32:43.6525702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.6527119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e3f8cc0>} 2025-05-07T20:32:43.6528558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.6529593Z context = 2025-05-07T20:32:43.6529890Z 2025-05-07T20:32:43.6530139Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.6530673Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.6531145Z module_map=module_map) 2025-05-07T20:32:43.6531523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.6531889Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.6532153Z E ^ 2025-05-07T20:32:43.6532636Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.6533096Z 2025-05-07T20:32:43.6533514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.6534029Z 2025-05-07T20:32:43.6534142Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6534617Z self=, 2025-05-07T20:32:43.6535032Z T=1, 2025-05-07T20:32:43.6535230Z D=7168, 2025-05-07T20:32:43.6535437Z scale_ub=1200.0, 2025-05-07T20:32:43.6535667Z contiguous=False, 2025-05-07T20:32:43.6535907Z compiled=False, 2025-05-07T20:32:43.6536128Z ) 2025-05-07T20:32:43.6536450Z self = 2025-05-07T20:32:43.6536950Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.6537229Z 2025-05-07T20:32:43.6537309Z @given( 2025-05-07T20:32:43.6537602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6537923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6538243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6538587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6538919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6539217Z ) 2025-05-07T20:32:43.6539582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6542619Z def test_silu_mul_quant( 2025-05-07T20:32:43.6542870Z self, 2025-05-07T20:32:43.6543076Z T: int, 2025-05-07T20:32:43.6543274Z D: int, 2025-05-07T20:32:43.6543500Z scale_ub: Optional[float], 2025-05-07T20:32:43.6543772Z contiguous: bool, 2025-05-07T20:32:43.6544008Z compiled: bool, 2025-05-07T20:32:43.6544235Z ) -> None: 2025-05-07T20:32:43.6544453Z torch.manual_seed(2025) 2025-05-07T20:32:43.6544707Z 2025-05-07T20:32:43.6545041Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6545384Z 2025-05-07T20:32:43.6545575Z x_sign = torch.sign(x) 2025-05-07T20:32:43.6545868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.6546176Z x = x_sign * x_clamp 2025-05-07T20:32:43.6546415Z x0 = x[:, :D] 2025-05-07T20:32:43.6546635Z x1 = x[:, D:] 2025-05-07T20:32:43.6546849Z 2025-05-07T20:32:43.6547042Z if contiguous: 2025-05-07T20:32:43.6547298Z x0 = x0.contiguous() 2025-05-07T20:32:43.6547558Z x1 = x1.contiguous() 2025-05-07T20:32:43.6547808Z 2025-05-07T20:32:43.6547998Z if scale_ub is not None: 2025-05-07T20:32:43.6548273Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.6548615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.6548923Z ) 2025-05-07T20:32:43.6549125Z else: 2025-05-07T20:32:43.6549350Z scale_ub_tensor = None 2025-05-07T20:32:43.6549604Z 2025-05-07T20:32:43.6549831Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.6550147Z op = silu_mul_quant 2025-05-07T20:32:43.6550395Z if compiled: 2025-05-07T20:32:43.6550638Z op = torch.compile(op) 2025-05-07T20:32:43.6550930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.6551208Z 2025-05-07T20:32:43.6551449Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.6551622Z 2025-05-07T20:32:43.6551723Z moe/activation_test.py:117: 2025-05-07T20:32:43.6552022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.6552357Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.6552642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.6553336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.6554037Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.6554571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.6555257Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.6555921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.6556450Z kernel = self.compile( 2025-05-07T20:32:43.6557003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.6557670Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.6558102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.6558351Z 2025-05-07T20:32:43.6558557Z self = 2025-05-07T20:32:43.6559691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.6561062Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e3f9080>} 2025-05-07T20:32:43.6562409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.6563515Z context = 2025-05-07T20:32:43.6563803Z 2025-05-07T20:32:43.6563967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.6564497Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.6565007Z module_map=module_map) 2025-05-07T20:32:43.6565366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.6565728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.6565995Z E ^ 2025-05-07T20:32:43.6566464Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.6566910Z 2025-05-07T20:32:43.6567325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.6567920Z 2025-05-07T20:32:43.7903306Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7903752Z self=, 2025-05-07T20:32:43.7904180Z T=4096, 2025-05-07T20:32:43.7904377Z D=7168, 2025-05-07T20:32:43.7904584Z scale_ub=1200.0, 2025-05-07T20:32:43.7904811Z contiguous=False, 2025-05-07T20:32:43.7905044Z compiled=True, 2025-05-07T20:32:43.7905251Z ) 2025-05-07T20:32:43.7905574Z self = 2025-05-07T20:32:43.7906238Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.7906511Z 2025-05-07T20:32:43.7906592Z @given( 2025-05-07T20:32:43.7906831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7907149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7907573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7907912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7908245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7908530Z ) 2025-05-07T20:32:43.7908886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7909331Z def test_silu_mul_quant( 2025-05-07T20:32:43.7909577Z self, 2025-05-07T20:32:43.7909769Z T: int, 2025-05-07T20:32:43.7909977Z D: int, 2025-05-07T20:32:43.7910199Z scale_ub: Optional[float], 2025-05-07T20:32:43.7910468Z contiguous: bool, 2025-05-07T20:32:43.7910710Z compiled: bool, 2025-05-07T20:32:43.7910937Z ) -> None: 2025-05-07T20:32:43.7911155Z torch.manual_seed(2025) 2025-05-07T20:32:43.7911404Z 2025-05-07T20:32:43.7911684Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7912030Z 2025-05-07T20:32:43.7912231Z x_sign = torch.sign(x) 2025-05-07T20:32:43.7912531Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.7912838Z x = x_sign * x_clamp 2025-05-07T20:32:43.7913084Z x0 = x[:, :D] 2025-05-07T20:32:43.7913310Z x1 = x[:, D:] 2025-05-07T20:32:43.7913516Z 2025-05-07T20:32:43.7913715Z if contiguous: 2025-05-07T20:32:43.7913950Z x0 = x0.contiguous() 2025-05-07T20:32:43.7914214Z x1 = x1.contiguous() 2025-05-07T20:32:43.7914526Z 2025-05-07T20:32:43.7914729Z if scale_ub is not None: 2025-05-07T20:32:43.7915011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.7915346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.7915660Z ) 2025-05-07T20:32:43.7915858Z else: 2025-05-07T20:32:43.7916086Z scale_ub_tensor = None 2025-05-07T20:32:43.7916353Z 2025-05-07T20:32:43.7916591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.7916916Z op = silu_mul_quant 2025-05-07T20:32:43.7917258Z if compiled: 2025-05-07T20:32:43.7925419Z op = torch.compile(op) 2025-05-07T20:32:43.7925769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7926052Z 2025-05-07T20:32:43.7926258Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.7926433Z 2025-05-07T20:32:43.7926540Z moe/activation_test.py:117: 2025-05-07T20:32:43.7926859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7927310Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.7927713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7928290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.7928871Z return fn(*args, **kwargs) 2025-05-07T20:32:43.7929541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.7930235Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.7930779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.7931473Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.7932151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.7932694Z kernel = self.compile( 2025-05-07T20:32:43.7933258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.7933927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.7934333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7934574Z 2025-05-07T20:32:43.7934786Z self = 2025-05-07T20:32:43.7935934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.7937322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e3fb060>} 2025-05-07T20:32:43.7938682Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.7939714Z context = 2025-05-07T20:32:43.7940018Z 2025-05-07T20:32:43.7940190Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.7940719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.7941210Z module_map=module_map) 2025-05-07T20:32:43.7941581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.7941952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.7942223Z E ^ 2025-05-07T20:32:43.7942696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.7943145Z 2025-05-07T20:32:43.7943611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.7944135Z 2025-05-07T20:32:43.7944246Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7944671Z self=, 2025-05-07T20:32:43.7945086Z T=128, 2025-05-07T20:32:43.7945282Z D=7168, 2025-05-07T20:32:43.7945490Z scale_ub=1200.0, 2025-05-07T20:32:43.7945733Z contiguous=False, 2025-05-07T20:32:43.7945966Z compiled=True, 2025-05-07T20:32:43.7946235Z ) 2025-05-07T20:32:43.8655529Z self = 2025-05-07T20:32:43.8656038Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.8656327Z 2025-05-07T20:32:43.8656418Z @given( 2025-05-07T20:32:43.8656664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8657027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8657453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8657989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8658643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8659215Z ) 2025-05-07T20:32:43.8659916Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8660794Z def test_silu_mul_quant( 2025-05-07T20:32:43.8661288Z self, 2025-05-07T20:32:43.8661680Z T: int, 2025-05-07T20:32:43.8662069Z D: int, 2025-05-07T20:32:43.8662512Z scale_ub: Optional[float], 2025-05-07T20:32:43.8663054Z contiguous: bool, 2025-05-07T20:32:43.8663529Z compiled: bool, 2025-05-07T20:32:43.8663983Z ) -> None: 2025-05-07T20:32:43.8664419Z torch.manual_seed(2025) 2025-05-07T20:32:43.8664898Z 2025-05-07T20:32:43.8665450Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8666146Z 2025-05-07T20:32:43.8666530Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8667122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8667730Z x = x_sign * x_clamp 2025-05-07T20:32:43.8667999Z x0 = x[:, :D] 2025-05-07T20:32:43.8668238Z x1 = x[:, D:] 2025-05-07T20:32:43.8668454Z 2025-05-07T20:32:43.8668648Z if contiguous: 2025-05-07T20:32:43.8668886Z x0 = x0.contiguous() 2025-05-07T20:32:43.8669226Z x1 = x1.contiguous() 2025-05-07T20:32:43.8669473Z 2025-05-07T20:32:43.8669668Z if scale_ub is not None: 2025-05-07T20:32:43.8669943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8670286Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8670596Z ) 2025-05-07T20:32:43.8670801Z else: 2025-05-07T20:32:43.8671024Z scale_ub_tensor = None 2025-05-07T20:32:43.8671281Z 2025-05-07T20:32:43.8671524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8671854Z op = silu_mul_quant 2025-05-07T20:32:43.8672105Z if compiled: 2025-05-07T20:32:43.8672365Z op = torch.compile(op) 2025-05-07T20:32:43.8672669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8672959Z 2025-05-07T20:32:43.8673154Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8673330Z 2025-05-07T20:32:43.8673435Z moe/activation_test.py:117: 2025-05-07T20:32:43.8673746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8674080Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8674371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8674943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.8675503Z return fn(*args, **kwargs) 2025-05-07T20:32:43.8676240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8676937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8677484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8678166Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8678836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8679382Z kernel = self.compile( 2025-05-07T20:32:43.8679995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8680648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8681050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8681284Z 2025-05-07T20:32:43.8681502Z self = 2025-05-07T20:32:43.8682623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8683996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e0c0360>} 2025-05-07T20:32:43.8685340Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8686366Z context = 2025-05-07T20:32:43.8686652Z 2025-05-07T20:32:43.8686825Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8687346Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8687928Z module_map=module_map) 2025-05-07T20:32:43.8688342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8688704Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8688971Z E ^ 2025-05-07T20:32:43.8689438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8689938Z 2025-05-07T20:32:43.8690362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8690872Z 2025-05-07T20:32:43.8690977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.8691396Z self=, 2025-05-07T20:32:43.8691807Z T=2048, 2025-05-07T20:32:43.8692008Z D=7168, 2025-05-07T20:32:43.8692202Z scale_ub=None, 2025-05-07T20:32:43.8692430Z contiguous=True, 2025-05-07T20:32:43.8692660Z compiled=True, 2025-05-07T20:32:43.8692860Z ) 2025-05-07T20:32:43.8693195Z self = 2025-05-07T20:32:43.8693697Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.8693966Z 2025-05-07T20:32:43.8694047Z @given( 2025-05-07T20:32:43.8694290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8694616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8694928Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8695265Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8695601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8695897Z ) 2025-05-07T20:32:43.8696254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8696697Z def test_silu_mul_quant( 2025-05-07T20:32:43.8696995Z self, 2025-05-07T20:32:43.8697207Z T: int, 2025-05-07T20:32:43.8697405Z D: int, 2025-05-07T20:32:43.8697631Z scale_ub: Optional[float], 2025-05-07T20:32:43.8697908Z contiguous: bool, 2025-05-07T20:32:43.8698148Z compiled: bool, 2025-05-07T20:32:43.8698391Z ) -> None: 2025-05-07T20:32:43.8698642Z torch.manual_seed(2025) 2025-05-07T20:32:43.8698883Z 2025-05-07T20:32:43.8699161Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8699511Z 2025-05-07T20:32:43.8699755Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8700052Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8700364Z x = x_sign * x_clamp 2025-05-07T20:32:43.8700602Z x0 = x[:, :D] 2025-05-07T20:32:43.8700825Z x1 = x[:, D:] 2025-05-07T20:32:43.8701043Z 2025-05-07T20:32:43.8701232Z if contiguous: 2025-05-07T20:32:43.8701473Z x0 = x0.contiguous() 2025-05-07T20:32:43.8701781Z x1 = x1.contiguous() 2025-05-07T20:32:43.8702026Z 2025-05-07T20:32:43.8702218Z if scale_ub is not None: 2025-05-07T20:32:43.8702497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8702836Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8703144Z ) 2025-05-07T20:32:43.8703341Z else: 2025-05-07T20:32:43.8703558Z scale_ub_tensor = None 2025-05-07T20:32:43.8703813Z 2025-05-07T20:32:43.8704048Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8704369Z op = silu_mul_quant 2025-05-07T20:32:43.8704616Z if compiled: 2025-05-07T20:32:43.8704873Z op = torch.compile(op) 2025-05-07T20:32:43.8705178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8705453Z 2025-05-07T20:32:43.8705820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8705987Z 2025-05-07T20:32:43.8706099Z moe/activation_test.py:117: 2025-05-07T20:32:43.8706408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8706738Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8707026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8707588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.8708145Z return fn(*args, **kwargs) 2025-05-07T20:32:43.8708813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8709586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8710129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8710810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8711477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8712021Z kernel = self.compile( 2025-05-07T20:32:43.8712558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8713214Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8713613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8713837Z 2025-05-07T20:32:43.8714053Z self = 2025-05-07T20:32:43.8715123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8716577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e0c0ea0>} 2025-05-07T20:32:43.8717918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8718942Z context = 2025-05-07T20:32:43.8719227Z 2025-05-07T20:32:43.8719396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8719916Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8720444Z module_map=module_map) 2025-05-07T20:32:43.8720805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8721155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8721417Z E ^ 2025-05-07T20:32:43.8721890Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8722396Z 2025-05-07T20:32:43.8722817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8723324Z 2025-05-07T20:32:43.9372203Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9372674Z self=, 2025-05-07T20:32:43.9373093Z T=16384, 2025-05-07T20:32:43.9373294Z D=5120, 2025-05-07T20:32:43.9373498Z scale_ub=None, 2025-05-07T20:32:43.9373719Z contiguous=False, 2025-05-07T20:32:43.9373952Z compiled=False, 2025-05-07T20:32:43.9374158Z ) 2025-05-07T20:32:43.9374480Z self = 2025-05-07T20:32:43.9374981Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.9375259Z 2025-05-07T20:32:43.9375341Z @given( 2025-05-07T20:32:43.9375578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9375899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9376208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9376536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9376865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9377156Z ) 2025-05-07T20:32:43.9377502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9378048Z def test_silu_mul_quant( 2025-05-07T20:32:43.9378298Z self, 2025-05-07T20:32:43.9378518Z T: int, 2025-05-07T20:32:43.9378745Z D: int, 2025-05-07T20:32:43.9378967Z scale_ub: Optional[float], 2025-05-07T20:32:43.9379232Z contiguous: bool, 2025-05-07T20:32:43.9379475Z compiled: bool, 2025-05-07T20:32:43.9379701Z ) -> None: 2025-05-07T20:32:43.9379911Z torch.manual_seed(2025) 2025-05-07T20:32:43.9380156Z 2025-05-07T20:32:43.9380435Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9380782Z 2025-05-07T20:32:43.9380974Z x_sign = torch.sign(x) 2025-05-07T20:32:43.9381269Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.9383301Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9385183Z 2025-05-07T20:32:43.9385307Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.9385518Z 2025-05-07T20:32:43.9385700Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9386115Z self=, 2025-05-07T20:32:43.9386521Z T=4096, 2025-05-07T20:32:43.9386713Z D=7168, 2025-05-07T20:32:43.9386927Z scale_ub=1200.0, 2025-05-07T20:32:43.9387155Z contiguous=True, 2025-05-07T20:32:43.9387379Z compiled=True, 2025-05-07T20:32:43.9387583Z ) 2025-05-07T20:32:43.9387900Z self = 2025-05-07T20:32:43.9388397Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.9388742Z 2025-05-07T20:32:43.9388828Z @given( 2025-05-07T20:32:43.9389054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9389367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9389675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9390002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9390336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9390689Z ) 2025-05-07T20:32:43.9391042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9391480Z def test_silu_mul_quant( 2025-05-07T20:32:43.9391721Z self, 2025-05-07T20:32:43.9391921Z T: int, 2025-05-07T20:32:43.9392115Z D: int, 2025-05-07T20:32:43.9392336Z scale_ub: Optional[float], 2025-05-07T20:32:43.9392611Z contiguous: bool, 2025-05-07T20:32:43.9392849Z compiled: bool, 2025-05-07T20:32:43.9393080Z ) -> None: 2025-05-07T20:32:43.9393300Z torch.manual_seed(2025) 2025-05-07T20:32:43.9393543Z 2025-05-07T20:32:43.9393818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9394166Z 2025-05-07T20:32:43.9394357Z x_sign = torch.sign(x) 2025-05-07T20:32:43.9394653Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.9396672Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9398600Z 2025-05-07T20:32:43.9398720Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.9398931Z 2025-05-07T20:32:43.9399039Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9399444Z self=, 2025-05-07T20:32:43.9399848Z T=16384, 2025-05-07T20:32:43.9400046Z D=7168, 2025-05-07T20:32:43.9400241Z scale_ub=None, 2025-05-07T20:32:43.9400462Z contiguous=False, 2025-05-07T20:32:43.9400695Z compiled=False, 2025-05-07T20:32:43.9400897Z ) 2025-05-07T20:32:43.9401213Z self = 2025-05-07T20:32:43.9401709Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.9401988Z 2025-05-07T20:32:43.9402069Z @given( 2025-05-07T20:32:43.9402295Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9402611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9402918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9403253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9403578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9403866Z ) 2025-05-07T20:32:43.9404216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9404652Z def test_silu_mul_quant( 2025-05-07T20:32:43.9404896Z self, 2025-05-07T20:32:43.9405195Z T: int, 2025-05-07T20:32:43.9405392Z D: int, 2025-05-07T20:32:43.9405766Z scale_ub: Optional[float], 2025-05-07T20:32:43.9406043Z contiguous: bool, 2025-05-07T20:32:43.9406278Z compiled: bool, 2025-05-07T20:32:43.9406501Z ) -> None: 2025-05-07T20:32:43.9406714Z torch.manual_seed(2025) 2025-05-07T20:32:43.9406954Z 2025-05-07T20:32:43.9407224Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9409432Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9411364Z 2025-05-07T20:32:43.9411482Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.9411693Z 2025-05-07T20:32:43.9411801Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9412207Z self=, 2025-05-07T20:32:43.9412613Z T=2048, 2025-05-07T20:32:43.9412806Z D=7168, 2025-05-07T20:32:43.9413000Z scale_ub=1200.0, 2025-05-07T20:32:43.9413222Z contiguous=True, 2025-05-07T20:32:43.9413448Z compiled=True, 2025-05-07T20:32:43.9413659Z ) 2025-05-07T20:32:43.9413978Z self = 2025-05-07T20:32:43.9414476Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.9414744Z 2025-05-07T20:32:43.9414833Z @given( 2025-05-07T20:32:43.9415057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9415380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9415684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9416011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9416343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9416634Z ) 2025-05-07T20:32:43.9416983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9417420Z def test_silu_mul_quant( 2025-05-07T20:32:43.9417734Z self, 2025-05-07T20:32:43.9417932Z T: int, 2025-05-07T20:32:43.9418129Z D: int, 2025-05-07T20:32:43.9418358Z scale_ub: Optional[float], 2025-05-07T20:32:43.9418633Z contiguous: bool, 2025-05-07T20:32:43.9418870Z compiled: bool, 2025-05-07T20:32:43.9419095Z ) -> None: 2025-05-07T20:32:43.9419310Z torch.manual_seed(2025) 2025-05-07T20:32:43.9419549Z 2025-05-07T20:32:43.9419823Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9420169Z 2025-05-07T20:32:43.9420361Z x_sign = torch.sign(x) 2025-05-07T20:32:43.9420653Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.9422638Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9424490Z 2025-05-07T20:32:43.9424609Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.9424819Z 2025-05-07T20:32:43.9424926Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9425399Z self=, 2025-05-07T20:32:43.9425804Z T=2048, 2025-05-07T20:32:43.9425994Z D=7168, 2025-05-07T20:32:43.9426181Z scale_ub=None, 2025-05-07T20:32:43.9426397Z contiguous=True, 2025-05-07T20:32:43.9426621Z compiled=False, 2025-05-07T20:32:43.9426823Z ) 2025-05-07T20:32:44.0290042Z self = 2025-05-07T20:32:44.0290594Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.0290966Z 2025-05-07T20:32:44.0291055Z @given( 2025-05-07T20:32:44.0291280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.0291592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.0291894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.0292219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.0292545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.0292839Z ) 2025-05-07T20:32:44.0293251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.0293690Z def test_silu_mul_quant( 2025-05-07T20:32:44.0293933Z self, 2025-05-07T20:32:44.0294131Z T: int, 2025-05-07T20:32:44.0294323Z D: int, 2025-05-07T20:32:44.0294542Z scale_ub: Optional[float], 2025-05-07T20:32:44.0294818Z contiguous: bool, 2025-05-07T20:32:44.0295055Z compiled: bool, 2025-05-07T20:32:44.0295282Z ) -> None: 2025-05-07T20:32:44.0295504Z torch.manual_seed(2025) 2025-05-07T20:32:44.0295740Z 2025-05-07T20:32:44.0296012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.0296360Z 2025-05-07T20:32:44.0296551Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.0298528Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.0300379Z 2025-05-07T20:32:44.0307045Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.0307443Z 2025-05-07T20:32:44.0307557Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.0307980Z self=, 2025-05-07T20:32:44.0308391Z T=1, 2025-05-07T20:32:44.0308578Z D=7168, 2025-05-07T20:32:44.0308781Z scale_ub=1200.0, 2025-05-07T20:32:44.0309015Z contiguous=True, 2025-05-07T20:32:44.0309241Z compiled=False, 2025-05-07T20:32:44.0309453Z ) 2025-05-07T20:32:44.0309785Z self = 2025-05-07T20:32:44.0310286Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.0310557Z 2025-05-07T20:32:44.0310638Z @given( 2025-05-07T20:32:44.0310877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.0311195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.0311502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.0311845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.0312180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.0312477Z ) 2025-05-07T20:32:44.0312837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.0313286Z def test_silu_mul_quant( 2025-05-07T20:32:44.0313530Z self, 2025-05-07T20:32:44.0313730Z T: int, 2025-05-07T20:32:44.0313937Z D: int, 2025-05-07T20:32:44.0314231Z scale_ub: Optional[float], 2025-05-07T20:32:44.0314507Z contiguous: bool, 2025-05-07T20:32:44.0314757Z compiled: bool, 2025-05-07T20:32:44.0314993Z ) -> None: 2025-05-07T20:32:44.0315211Z torch.manual_seed(2025) 2025-05-07T20:32:44.0315457Z 2025-05-07T20:32:44.0315736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.0316078Z 2025-05-07T20:32:44.0316277Z x_sign = torch.sign(x) 2025-05-07T20:32:44.0316580Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.0316967Z x = x_sign * x_clamp 2025-05-07T20:32:44.0317209Z x0 = x[:, :D] 2025-05-07T20:32:44.0317432Z x1 = x[:, D:] 2025-05-07T20:32:44.0317647Z 2025-05-07T20:32:44.0317837Z if contiguous: 2025-05-07T20:32:44.0318082Z x0 = x0.contiguous() 2025-05-07T20:32:44.0318346Z x1 = x1.contiguous() 2025-05-07T20:32:44.0318586Z 2025-05-07T20:32:44.0318791Z if scale_ub is not None: 2025-05-07T20:32:44.0319137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.0319483Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.0319799Z ) 2025-05-07T20:32:44.0320006Z else: 2025-05-07T20:32:44.0320227Z scale_ub_tensor = None 2025-05-07T20:32:44.0320480Z 2025-05-07T20:32:44.0320718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.0321042Z op = silu_mul_quant 2025-05-07T20:32:44.0321298Z if compiled: 2025-05-07T20:32:44.0321559Z op = torch.compile(op) 2025-05-07T20:32:44.0321861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0322137Z 2025-05-07T20:32:44.0322335Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.0322501Z 2025-05-07T20:32:44.0322606Z moe/activation_test.py:117: 2025-05-07T20:32:44.0322904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0323243Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.0323533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0324236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.0324929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.0325469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.0326155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.0326868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.0327405Z kernel = self.compile( 2025-05-07T20:32:44.0328037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.0328741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.0329136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0329366Z 2025-05-07T20:32:44.0329572Z self = 2025-05-07T20:32:44.0330645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.0332019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e230680>} 2025-05-07T20:32:44.0333352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.0334420Z context = 2025-05-07T20:32:44.0334712Z 2025-05-07T20:32:44.0334882Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.0335403Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.0335862Z module_map=module_map) 2025-05-07T20:32:44.0336230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.0336581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.0336846Z E ^ 2025-05-07T20:32:44.0337353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.0337807Z 2025-05-07T20:32:44.0338221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.0338763Z 2025-05-07T20:32:44.0338887Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.0339345Z self=, 2025-05-07T20:32:44.0339749Z T=128, 2025-05-07T20:32:44.0339945Z D=5120, 2025-05-07T20:32:44.0340144Z scale_ub=None, 2025-05-07T20:32:44.0340358Z contiguous=True, 2025-05-07T20:32:44.0340585Z compiled=False, 2025-05-07T20:32:44.0340791Z ) 2025-05-07T20:32:44.2555644Z self = 2025-05-07T20:32:44.2556645Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.2557167Z 2025-05-07T20:32:44.2557317Z @given( 2025-05-07T20:32:44.2557742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2558308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2558751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2559083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2559417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2559701Z ) 2025-05-07T20:32:44.2560059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2560503Z def test_silu_mul_quant( 2025-05-07T20:32:44.2560746Z self, 2025-05-07T20:32:44.2560943Z T: int, 2025-05-07T20:32:44.2561143Z D: int, 2025-05-07T20:32:44.2561358Z scale_ub: Optional[float], 2025-05-07T20:32:44.2561633Z contiguous: bool, 2025-05-07T20:32:44.2561873Z compiled: bool, 2025-05-07T20:32:44.2562221Z ) -> None: 2025-05-07T20:32:44.2562440Z torch.manual_seed(2025) 2025-05-07T20:32:44.2562685Z 2025-05-07T20:32:44.2562956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2563302Z 2025-05-07T20:32:44.2563500Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2563792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2564094Z x = x_sign * x_clamp 2025-05-07T20:32:44.2564334Z x0 = x[:, :D] 2025-05-07T20:32:44.2564558Z x1 = x[:, D:] 2025-05-07T20:32:44.2564768Z 2025-05-07T20:32:44.2564956Z if contiguous: 2025-05-07T20:32:44.2565188Z x0 = x0.contiguous() 2025-05-07T20:32:44.2565440Z x1 = x1.contiguous() 2025-05-07T20:32:44.2565689Z 2025-05-07T20:32:44.2565880Z if scale_ub is not None: 2025-05-07T20:32:44.2566148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2566486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2566797Z ) 2025-05-07T20:32:44.2566991Z else: 2025-05-07T20:32:44.2567211Z scale_ub_tensor = None 2025-05-07T20:32:44.2567465Z 2025-05-07T20:32:44.2567772Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2568089Z op = silu_mul_quant 2025-05-07T20:32:44.2568341Z if compiled: 2025-05-07T20:32:44.2568591Z op = torch.compile(op) 2025-05-07T20:32:44.2568885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2569235Z 2025-05-07T20:32:44.2569433Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2569600Z 2025-05-07T20:32:44.2569702Z moe/activation_test.py:117: 2025-05-07T20:32:44.2569996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2570329Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2570610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2571294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2572066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2572601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2573281Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2573938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2574528Z kernel = self.compile( 2025-05-07T20:32:44.2575066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2575714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2576113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2576342Z 2025-05-07T20:32:44.2576550Z self = 2025-05-07T20:32:44.2577628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2578995Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e2318a0>} 2025-05-07T20:32:44.2580325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2581352Z context = 2025-05-07T20:32:44.2581635Z 2025-05-07T20:32:44.2581811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2582379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2582840Z module_map=module_map) 2025-05-07T20:32:44.2583208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2583562Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2583819Z E ^ 2025-05-07T20:32:44.2584281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2584731Z 2025-05-07T20:32:44.2585147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2585657Z 2025-05-07T20:32:44.2585766Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2586177Z self=, 2025-05-07T20:32:44.2586578Z T=128, 2025-05-07T20:32:44.2586766Z D=7168, 2025-05-07T20:32:44.2586963Z scale_ub=None, 2025-05-07T20:32:44.2587175Z contiguous=True, 2025-05-07T20:32:44.2587397Z compiled=False, 2025-05-07T20:32:44.2587597Z ) 2025-05-07T20:32:44.2587935Z self = 2025-05-07T20:32:44.2588453Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.2588722Z 2025-05-07T20:32:44.2588809Z @given( 2025-05-07T20:32:44.2589038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2589402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2589711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2590033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2590361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2590642Z ) 2025-05-07T20:32:44.2590987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2591421Z def test_silu_mul_quant( 2025-05-07T20:32:44.2591689Z self, 2025-05-07T20:32:44.2591935Z T: int, 2025-05-07T20:32:44.2592130Z D: int, 2025-05-07T20:32:44.2592351Z scale_ub: Optional[float], 2025-05-07T20:32:44.2592619Z contiguous: bool, 2025-05-07T20:32:44.2592856Z compiled: bool, 2025-05-07T20:32:44.2593076Z ) -> None: 2025-05-07T20:32:44.2593288Z torch.manual_seed(2025) 2025-05-07T20:32:44.2593529Z 2025-05-07T20:32:44.2593796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2594208Z 2025-05-07T20:32:44.2594404Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2594693Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2595003Z x = x_sign * x_clamp 2025-05-07T20:32:44.2595240Z x0 = x[:, :D] 2025-05-07T20:32:44.2595455Z x1 = x[:, D:] 2025-05-07T20:32:44.2595663Z 2025-05-07T20:32:44.2595849Z if contiguous: 2025-05-07T20:32:44.2596074Z x0 = x0.contiguous() 2025-05-07T20:32:44.2596339Z x1 = x1.contiguous() 2025-05-07T20:32:44.2596583Z 2025-05-07T20:32:44.2596772Z if scale_ub is not None: 2025-05-07T20:32:44.2597043Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2597375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2597682Z ) 2025-05-07T20:32:44.2597875Z else: 2025-05-07T20:32:44.2598085Z scale_ub_tensor = None 2025-05-07T20:32:44.2598336Z 2025-05-07T20:32:44.2598570Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2598891Z op = silu_mul_quant 2025-05-07T20:32:44.2599150Z if compiled: 2025-05-07T20:32:44.2599397Z op = torch.compile(op) 2025-05-07T20:32:44.2599690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2599965Z 2025-05-07T20:32:44.2600155Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2600321Z 2025-05-07T20:32:44.2600474Z moe/activation_test.py:117: 2025-05-07T20:32:44.2600775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2601107Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2601387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2602074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2602764Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2603305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2603986Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2604650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2605180Z kernel = self.compile( 2025-05-07T20:32:44.2605899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2606562Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2606957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2607187Z 2025-05-07T20:32:44.2607391Z self = 2025-05-07T20:32:44.2608662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2610039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e2327a0>} 2025-05-07T20:32:44.2611391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2612479Z context = 2025-05-07T20:32:44.2612765Z 2025-05-07T20:32:44.2612929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2613452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2613914Z module_map=module_map) 2025-05-07T20:32:44.2614335Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2614692Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2614954Z E ^ 2025-05-07T20:32:44.2615419Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2615870Z 2025-05-07T20:32:44.2616281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2616793Z 2025-05-07T20:32:44.2616900Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2617311Z self=, 2025-05-07T20:32:44.2617710Z T=2048, 2025-05-07T20:32:44.2617894Z D=7168, 2025-05-07T20:32:44.2618087Z scale_ub=1200.0, 2025-05-07T20:32:44.2618307Z contiguous=True, 2025-05-07T20:32:44.2618523Z compiled=False, 2025-05-07T20:32:44.2618726Z ) 2025-05-07T20:32:44.3287835Z self = 2025-05-07T20:32:44.3288647Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3288967Z 2025-05-07T20:32:44.3289046Z @given( 2025-05-07T20:32:44.3289276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3289583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3289885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3290305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3290635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3290918Z ) 2025-05-07T20:32:44.3291261Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3291698Z def test_silu_mul_quant( 2025-05-07T20:32:44.3291940Z self, 2025-05-07T20:32:44.3292128Z T: int, 2025-05-07T20:32:44.3292327Z D: int, 2025-05-07T20:32:44.3292545Z scale_ub: Optional[float], 2025-05-07T20:32:44.3292811Z contiguous: bool, 2025-05-07T20:32:44.3293053Z compiled: bool, 2025-05-07T20:32:44.3293273Z ) -> None: 2025-05-07T20:32:44.3293486Z torch.manual_seed(2025) 2025-05-07T20:32:44.3293731Z 2025-05-07T20:32:44.3294001Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3296048Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3297905Z 2025-05-07T20:32:44.3298098Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3298324Z 2025-05-07T20:32:44.3298438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3298893Z self=, 2025-05-07T20:32:44.3299465Z T=1, 2025-05-07T20:32:44.3299695Z D=5120, 2025-05-07T20:32:44.3299946Z scale_ub=1200.0, 2025-05-07T20:32:44.3300229Z contiguous=True, 2025-05-07T20:32:44.3300507Z compiled=False, 2025-05-07T20:32:44.3300754Z ) 2025-05-07T20:32:44.3301078Z self = 2025-05-07T20:32:44.3301656Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3301925Z 2025-05-07T20:32:44.3302005Z @given( 2025-05-07T20:32:44.3302240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3302554Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3302877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3303264Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3303595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3303876Z ) 2025-05-07T20:32:44.3304217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3304656Z def test_silu_mul_quant( 2025-05-07T20:32:44.3304895Z self, 2025-05-07T20:32:44.3305088Z T: int, 2025-05-07T20:32:44.3305289Z D: int, 2025-05-07T20:32:44.3305504Z scale_ub: Optional[float], 2025-05-07T20:32:44.3306009Z contiguous: bool, 2025-05-07T20:32:44.3306252Z compiled: bool, 2025-05-07T20:32:44.3306479Z ) -> None: 2025-05-07T20:32:44.3306684Z torch.manual_seed(2025) 2025-05-07T20:32:44.3306925Z 2025-05-07T20:32:44.3307192Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3307534Z 2025-05-07T20:32:44.3307729Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3308053Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3308383Z x = x_sign * x_clamp 2025-05-07T20:32:44.3308620Z x0 = x[:, :D] 2025-05-07T20:32:44.3308834Z x1 = x[:, D:] 2025-05-07T20:32:44.3309041Z 2025-05-07T20:32:44.3309225Z if contiguous: 2025-05-07T20:32:44.3309457Z x0 = x0.contiguous() 2025-05-07T20:32:44.3309711Z x1 = x1.contiguous() 2025-05-07T20:32:44.3310021Z 2025-05-07T20:32:44.3310387Z if scale_ub is not None: 2025-05-07T20:32:44.3310661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3310991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3311299Z ) 2025-05-07T20:32:44.3311495Z else: 2025-05-07T20:32:44.3311702Z scale_ub_tensor = None 2025-05-07T20:32:44.3311953Z 2025-05-07T20:32:44.3312184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3312492Z op = silu_mul_quant 2025-05-07T20:32:44.3312739Z if compiled: 2025-05-07T20:32:44.3312986Z op = torch.compile(op) 2025-05-07T20:32:44.3313282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3313555Z 2025-05-07T20:32:44.3313750Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3313912Z 2025-05-07T20:32:44.3314013Z moe/activation_test.py:117: 2025-05-07T20:32:44.3314305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3314638Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3314922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3315605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3316293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3316824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3317578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3318285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3318814Z kernel = self.compile( 2025-05-07T20:32:44.3319354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3320004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3320402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3320702Z 2025-05-07T20:32:44.3320903Z self = 2025-05-07T20:32:44.3321977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3323399Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e233b00>} 2025-05-07T20:32:44.3324734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3325753Z context = 2025-05-07T20:32:44.3326064Z 2025-05-07T20:32:44.3326233Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3326751Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3327216Z module_map=module_map) 2025-05-07T20:32:44.3327640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3327996Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3328262Z E ^ 2025-05-07T20:32:44.3328770Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3329227Z 2025-05-07T20:32:44.3329642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3330153Z 2025-05-07T20:32:44.3330258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3330715Z self=, 2025-05-07T20:32:44.3331113Z T=2048, 2025-05-07T20:32:44.3331298Z D=5120, 2025-05-07T20:32:44.3331488Z scale_ub=None, 2025-05-07T20:32:44.3331698Z contiguous=True, 2025-05-07T20:32:44.3331925Z compiled=False, 2025-05-07T20:32:44.3332131Z ) 2025-05-07T20:32:44.3332446Z self = 2025-05-07T20:32:44.3332936Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3333202Z 2025-05-07T20:32:44.3333288Z @given( 2025-05-07T20:32:44.3333518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3333826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3334127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3334455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3334777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3335066Z ) 2025-05-07T20:32:44.3335411Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3335841Z def test_silu_mul_quant( 2025-05-07T20:32:44.3336082Z self, 2025-05-07T20:32:44.3336275Z T: int, 2025-05-07T20:32:44.3336466Z D: int, 2025-05-07T20:32:44.3336683Z scale_ub: Optional[float], 2025-05-07T20:32:44.3342593Z contiguous: bool, 2025-05-07T20:32:44.3342852Z compiled: bool, 2025-05-07T20:32:44.3343150Z ) -> None: 2025-05-07T20:32:44.3343376Z torch.manual_seed(2025) 2025-05-07T20:32:44.3343626Z 2025-05-07T20:32:44.3343898Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3344248Z 2025-05-07T20:32:44.3344446Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.3346396Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3348349Z 2025-05-07T20:32:44.3348474Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.3348735Z 2025-05-07T20:32:44.3348844Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3349262Z self=, 2025-05-07T20:32:44.3349664Z T=16384, 2025-05-07T20:32:44.3349858Z D=5120, 2025-05-07T20:32:44.3350058Z scale_ub=None, 2025-05-07T20:32:44.3350275Z contiguous=True, 2025-05-07T20:32:44.3350502Z compiled=False, 2025-05-07T20:32:44.3350714Z ) 2025-05-07T20:32:44.4045700Z self = 2025-05-07T20:32:44.4046529Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.4046947Z 2025-05-07T20:32:44.4047062Z @given( 2025-05-07T20:32:44.4047395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4047860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4048184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4048533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4048880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4049182Z ) 2025-05-07T20:32:44.4049535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4049988Z def test_silu_mul_quant( 2025-05-07T20:32:44.4050246Z self, 2025-05-07T20:32:44.4050450Z T: int, 2025-05-07T20:32:44.4050664Z D: int, 2025-05-07T20:32:44.4051183Z scale_ub: Optional[float], 2025-05-07T20:32:44.4051459Z contiguous: bool, 2025-05-07T20:32:44.4051715Z compiled: bool, 2025-05-07T20:32:44.4051957Z ) -> None: 2025-05-07T20:32:44.4052179Z torch.manual_seed(2025) 2025-05-07T20:32:44.4052441Z 2025-05-07T20:32:44.4052734Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4054820Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4056693Z 2025-05-07T20:32:44.4056825Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4057044Z 2025-05-07T20:32:44.4057153Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4057575Z self=, 2025-05-07T20:32:44.4057988Z T=4096, 2025-05-07T20:32:44.4058182Z D=5120, 2025-05-07T20:32:44.4058397Z scale_ub=None, 2025-05-07T20:32:44.4058623Z contiguous=True, 2025-05-07T20:32:44.4058859Z compiled=False, 2025-05-07T20:32:44.4059686Z ) 2025-05-07T20:32:44.4060025Z self = 2025-05-07T20:32:44.4060532Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.4060802Z 2025-05-07T20:32:44.4060886Z @given( 2025-05-07T20:32:44.4061132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4061460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4061766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4062114Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4062541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4062831Z ) 2025-05-07T20:32:44.4063193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4063654Z def test_silu_mul_quant( 2025-05-07T20:32:44.4063913Z self, 2025-05-07T20:32:44.4064115Z T: int, 2025-05-07T20:32:44.4064327Z D: int, 2025-05-07T20:32:44.4064644Z scale_ub: Optional[float], 2025-05-07T20:32:44.4064926Z contiguous: bool, 2025-05-07T20:32:44.4065182Z compiled: bool, 2025-05-07T20:32:44.4065419Z ) -> None: 2025-05-07T20:32:44.4065645Z torch.manual_seed(2025) 2025-05-07T20:32:44.4065905Z 2025-05-07T20:32:44.4066195Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4068248Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4070113Z 2025-05-07T20:32:44.4070247Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4070467Z 2025-05-07T20:32:44.4070577Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4071000Z self=, 2025-05-07T20:32:44.4071422Z T=2048, 2025-05-07T20:32:44.4071616Z D=5120, 2025-05-07T20:32:44.4071828Z scale_ub=None, 2025-05-07T20:32:44.4072067Z contiguous=False, 2025-05-07T20:32:44.4072348Z compiled=False, 2025-05-07T20:32:44.4072573Z ) 2025-05-07T20:32:44.4072911Z self = 2025-05-07T20:32:44.4073411Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.4073696Z 2025-05-07T20:32:44.4073783Z @given( 2025-05-07T20:32:44.4074029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4074355Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4074672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4075016Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4075360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4075654Z ) 2025-05-07T20:32:44.4076017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4076470Z def test_silu_mul_quant( 2025-05-07T20:32:44.4076718Z self, 2025-05-07T20:32:44.4076933Z T: int, 2025-05-07T20:32:44.4077144Z D: int, 2025-05-07T20:32:44.4077372Z scale_ub: Optional[float], 2025-05-07T20:32:44.4077654Z contiguous: bool, 2025-05-07T20:32:44.4077926Z compiled: bool, 2025-05-07T20:32:44.4078181Z ) -> None: 2025-05-07T20:32:44.4078412Z torch.manual_seed(2025) 2025-05-07T20:32:44.4078667Z 2025-05-07T20:32:44.4078948Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4081041Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4082938Z 2025-05-07T20:32:44.4083057Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4083279Z 2025-05-07T20:32:44.4083382Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4083796Z self=, 2025-05-07T20:32:44.4084200Z T=4096, 2025-05-07T20:32:44.4084395Z D=7168, 2025-05-07T20:32:44.4084592Z scale_ub=None, 2025-05-07T20:32:44.4084812Z contiguous=True, 2025-05-07T20:32:44.4085071Z compiled=True, 2025-05-07T20:32:44.4085280Z ) 2025-05-07T20:32:44.4085604Z self = 2025-05-07T20:32:44.4086089Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.4086369Z 2025-05-07T20:32:44.4086451Z @given( 2025-05-07T20:32:44.4086706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4087028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4087334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4087741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4088091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4088383Z ) 2025-05-07T20:32:44.4088740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4089188Z def test_silu_mul_quant( 2025-05-07T20:32:44.4089432Z self, 2025-05-07T20:32:44.4089633Z T: int, 2025-05-07T20:32:44.4089846Z D: int, 2025-05-07T20:32:44.4090078Z scale_ub: Optional[float], 2025-05-07T20:32:44.4090351Z contiguous: bool, 2025-05-07T20:32:44.4090600Z compiled: bool, 2025-05-07T20:32:44.4090834Z ) -> None: 2025-05-07T20:32:44.4091053Z torch.manual_seed(2025) 2025-05-07T20:32:44.4091305Z 2025-05-07T20:32:44.4091587Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4093689Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4095545Z 2025-05-07T20:32:44.4095678Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4095892Z 2025-05-07T20:32:44.4095998Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4096422Z self=, 2025-05-07T20:32:44.4096835Z T=2048, 2025-05-07T20:32:44.4097028Z D=5120, 2025-05-07T20:32:44.4097232Z scale_ub=1200.0, 2025-05-07T20:32:44.4097466Z contiguous=False, 2025-05-07T20:32:44.4097695Z compiled=False, 2025-05-07T20:32:44.4097911Z ) 2025-05-07T20:32:44.4098240Z self = 2025-05-07T20:32:44.4098739Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.4099015Z 2025-05-07T20:32:44.4099097Z @given( 2025-05-07T20:32:44.4099343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4099713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4100023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4100366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4100701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4100990Z ) 2025-05-07T20:32:44.4101347Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4101795Z def test_silu_mul_quant( 2025-05-07T20:32:44.4102052Z self, 2025-05-07T20:32:44.4102292Z T: int, 2025-05-07T20:32:44.4102501Z D: int, 2025-05-07T20:32:44.4102731Z scale_ub: Optional[float], 2025-05-07T20:32:44.4103005Z contiguous: bool, 2025-05-07T20:32:44.4103255Z compiled: bool, 2025-05-07T20:32:44.4103490Z ) -> None: 2025-05-07T20:32:44.4103710Z torch.manual_seed(2025) 2025-05-07T20:32:44.4103964Z 2025-05-07T20:32:44.4104246Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4106659Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4108570Z 2025-05-07T20:32:44.4108697Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4108922Z 2025-05-07T20:32:44.4109031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4109454Z self=, 2025-05-07T20:32:44.4109859Z T=4096, 2025-05-07T20:32:44.4110050Z D=7168, 2025-05-07T20:32:44.4110255Z scale_ub=1200.0, 2025-05-07T20:32:44.4110492Z contiguous=True, 2025-05-07T20:32:44.4110717Z compiled=False, 2025-05-07T20:32:44.4110929Z ) 2025-05-07T20:32:44.5025471Z self = 2025-05-07T20:32:44.5026245Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.5026640Z 2025-05-07T20:32:44.5026752Z @given( 2025-05-07T20:32:44.5027075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5027630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5027978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5028360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5028734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5029059Z ) 2025-05-07T20:32:44.5029468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5030001Z def test_silu_mul_quant( 2025-05-07T20:32:44.5030271Z self, 2025-05-07T20:32:44.5030479Z T: int, 2025-05-07T20:32:44.5030685Z D: int, 2025-05-07T20:32:44.5030918Z scale_ub: Optional[float], 2025-05-07T20:32:44.5031224Z contiguous: bool, 2025-05-07T20:32:44.5031486Z compiled: bool, 2025-05-07T20:32:44.5031742Z ) -> None: 2025-05-07T20:32:44.5031970Z torch.manual_seed(2025) 2025-05-07T20:32:44.5032217Z 2025-05-07T20:32:44.5032491Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5034658Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.5036557Z 2025-05-07T20:32:44.5036677Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.5036892Z 2025-05-07T20:32:44.5037008Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5037426Z self=, 2025-05-07T20:32:44.5037847Z T=16384, 2025-05-07T20:32:44.5038055Z D=7168, 2025-05-07T20:32:44.5038340Z scale_ub=None, 2025-05-07T20:32:44.5038556Z contiguous=False, 2025-05-07T20:32:44.5038798Z compiled=True, 2025-05-07T20:32:44.5039018Z ) 2025-05-07T20:32:44.5039339Z self = 2025-05-07T20:32:44.5039851Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.5040133Z 2025-05-07T20:32:44.5040221Z @given( 2025-05-07T20:32:44.5040533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5040862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5041180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5041515Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5041857Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5042156Z ) 2025-05-07T20:32:44.5042517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5042963Z def test_silu_mul_quant( 2025-05-07T20:32:44.5043222Z self, 2025-05-07T20:32:44.5043429Z T: int, 2025-05-07T20:32:44.5043632Z D: int, 2025-05-07T20:32:44.5043860Z scale_ub: Optional[float], 2025-05-07T20:32:44.5044143Z contiguous: bool, 2025-05-07T20:32:44.5044390Z compiled: bool, 2025-05-07T20:32:44.5044621Z ) -> None: 2025-05-07T20:32:44.5044848Z torch.manual_seed(2025) 2025-05-07T20:32:44.5045093Z 2025-05-07T20:32:44.5045377Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5047430Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.5049491Z 2025-05-07T20:32:44.5049617Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.5049834Z 2025-05-07T20:32:44.5049950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5050370Z self=, 2025-05-07T20:32:44.5050786Z T=4096, 2025-05-07T20:32:44.5050989Z D=7168, 2025-05-07T20:32:44.5051181Z scale_ub=None, 2025-05-07T20:32:44.5051405Z contiguous=True, 2025-05-07T20:32:44.5051637Z compiled=False, 2025-05-07T20:32:44.5051847Z ) 2025-05-07T20:32:44.5052169Z self = 2025-05-07T20:32:44.5052673Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.5052946Z 2025-05-07T20:32:44.5053045Z @given( 2025-05-07T20:32:44.5053280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5053607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5053926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5054258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5054601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5054895Z ) 2025-05-07T20:32:44.5055295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5055750Z def test_silu_mul_quant( 2025-05-07T20:32:44.5056009Z self, 2025-05-07T20:32:44.5056209Z T: int, 2025-05-07T20:32:44.5056413Z D: int, 2025-05-07T20:32:44.5056641Z scale_ub: Optional[float], 2025-05-07T20:32:44.5056917Z contiguous: bool, 2025-05-07T20:32:44.5057174Z compiled: bool, 2025-05-07T20:32:44.5057409Z ) -> None: 2025-05-07T20:32:44.5057634Z torch.manual_seed(2025) 2025-05-07T20:32:44.5057889Z 2025-05-07T20:32:44.5058225Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5060325Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.5062181Z 2025-05-07T20:32:44.5062314Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.5062527Z 2025-05-07T20:32:44.5062633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5063059Z self=, 2025-05-07T20:32:44.5063484Z T=16384, 2025-05-07T20:32:44.5063694Z D=7168, 2025-05-07T20:32:44.5063893Z scale_ub=None, 2025-05-07T20:32:44.5064130Z contiguous=True, 2025-05-07T20:32:44.5064358Z compiled=False, 2025-05-07T20:32:44.5064574Z ) 2025-05-07T20:32:44.5064899Z self = 2025-05-07T20:32:44.5065396Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.5065686Z 2025-05-07T20:32:44.5065774Z @given( 2025-05-07T20:32:44.5066018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5066360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5066670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5067015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5067354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5067642Z ) 2025-05-07T20:32:44.5068050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5068499Z def test_silu_mul_quant( 2025-05-07T20:32:44.5068744Z self, 2025-05-07T20:32:44.5068954Z T: int, 2025-05-07T20:32:44.5069165Z D: int, 2025-05-07T20:32:44.5069386Z scale_ub: Optional[float], 2025-05-07T20:32:44.5069671Z contiguous: bool, 2025-05-07T20:32:44.5069925Z compiled: bool, 2025-05-07T20:32:44.5070150Z ) -> None: 2025-05-07T20:32:44.5070386Z torch.manual_seed(2025) 2025-05-07T20:32:44.5070645Z 2025-05-07T20:32:44.5070930Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5072973Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.5074841Z 2025-05-07T20:32:44.5074968Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.5075193Z 2025-05-07T20:32:44.5075305Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5075781Z self=, 2025-05-07T20:32:44.5076194Z T=16384, 2025-05-07T20:32:44.5076403Z D=7168, 2025-05-07T20:32:44.5076610Z scale_ub=1200.0, 2025-05-07T20:32:44.5076851Z contiguous=True, 2025-05-07T20:32:44.5077079Z compiled=False, 2025-05-07T20:32:44.5077297Z ) 2025-05-07T20:32:44.5077628Z self = 2025-05-07T20:32:44.5078126Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.5078462Z 2025-05-07T20:32:44.5078544Z @given( 2025-05-07T20:32:44.5078786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5079099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5079412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5079751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5080079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5080376Z ) 2025-05-07T20:32:44.5080778Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5081225Z def test_silu_mul_quant( 2025-05-07T20:32:44.5081470Z self, 2025-05-07T20:32:44.5081680Z T: int, 2025-05-07T20:32:44.5081893Z D: int, 2025-05-07T20:32:44.5082115Z scale_ub: Optional[float], 2025-05-07T20:32:44.5082399Z contiguous: bool, 2025-05-07T20:32:44.5082650Z compiled: bool, 2025-05-07T20:32:44.5082881Z ) -> None: 2025-05-07T20:32:44.5083117Z torch.manual_seed(2025) 2025-05-07T20:32:44.5083373Z 2025-05-07T20:32:44.5083647Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5085707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.5087645Z 2025-05-07T20:32:44.5087774Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.5088020Z 2025-05-07T20:32:44.5088137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5088621Z self=, 2025-05-07T20:32:44.5089026Z T=128, 2025-05-07T20:32:44.5089229Z D=5120, 2025-05-07T20:32:44.5089436Z scale_ub=1200.0, 2025-05-07T20:32:44.5089666Z contiguous=False, 2025-05-07T20:32:44.5089907Z compiled=False, 2025-05-07T20:32:44.5090124Z ) 2025-05-07T20:32:44.6109880Z self = 2025-05-07T20:32:44.6110656Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.6111053Z 2025-05-07T20:32:44.6111166Z @given( 2025-05-07T20:32:44.6111495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.6111922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.6112245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.6112589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.6112921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.6113236Z ) 2025-05-07T20:32:44.6113595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.6114040Z def test_silu_mul_quant( 2025-05-07T20:32:44.6114295Z self, 2025-05-07T20:32:44.6114506Z T: int, 2025-05-07T20:32:44.6114736Z D: int, 2025-05-07T20:32:44.6114957Z scale_ub: Optional[float], 2025-05-07T20:32:44.6115242Z contiguous: bool, 2025-05-07T20:32:44.6115754Z compiled: bool, 2025-05-07T20:32:44.6116002Z ) -> None: 2025-05-07T20:32:44.6116226Z torch.manual_seed(2025) 2025-05-07T20:32:44.6116479Z 2025-05-07T20:32:44.6116764Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.6117113Z 2025-05-07T20:32:44.6117319Z x_sign = torch.sign(x) 2025-05-07T20:32:44.6125678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.6126031Z x = x_sign * x_clamp 2025-05-07T20:32:44.6126305Z x0 = x[:, :D] 2025-05-07T20:32:44.6126722Z x1 = x[:, D:] 2025-05-07T20:32:44.6126939Z 2025-05-07T20:32:44.6127148Z if contiguous: 2025-05-07T20:32:44.6127400Z x0 = x0.contiguous() 2025-05-07T20:32:44.6127786Z x1 = x1.contiguous() 2025-05-07T20:32:44.6128055Z 2025-05-07T20:32:44.6128291Z if scale_ub is not None: 2025-05-07T20:32:44.6128580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.6129004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.6129333Z ) 2025-05-07T20:32:44.6129553Z else: 2025-05-07T20:32:44.6129774Z scale_ub_tensor = None 2025-05-07T20:32:44.6130044Z 2025-05-07T20:32:44.6130297Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.6130624Z op = silu_mul_quant 2025-05-07T20:32:44.6130893Z if compiled: 2025-05-07T20:32:44.6131165Z op = torch.compile(op) 2025-05-07T20:32:44.6131474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.6131771Z 2025-05-07T20:32:44.6131990Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.6132166Z 2025-05-07T20:32:44.6132277Z moe/activation_test.py:117: 2025-05-07T20:32:44.6132604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.6132965Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.6133268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.6133978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.6134692Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.6135246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.6135939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.6136709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.6137253Z kernel = self.compile( 2025-05-07T20:32:44.6137813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.6138485Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.6138889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.6139135Z 2025-05-07T20:32:44.6139348Z self = 2025-05-07T20:32:44.6140437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.6141839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5999d5a700>} 2025-05-07T20:32:44.6143205Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.6144235Z context = 2025-05-07T20:32:44.6144535Z 2025-05-07T20:32:44.6144754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.6145290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.6145772Z module_map=module_map) 2025-05-07T20:32:44.6146140Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.6146507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.6146779Z E ^ 2025-05-07T20:32:44.6147242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.6147748Z 2025-05-07T20:32:44.6148167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.6148689Z 2025-05-07T20:32:44.6148797Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.6149220Z self=, 2025-05-07T20:32:44.6149624Z T=2048, 2025-05-07T20:32:44.6149831Z D=7168, 2025-05-07T20:32:44.6150076Z scale_ub=None, 2025-05-07T20:32:44.6150292Z contiguous=False, 2025-05-07T20:32:44.6150528Z compiled=False, 2025-05-07T20:32:44.6150745Z ) 2025-05-07T20:32:44.6151063Z self = 2025-05-07T20:32:44.6151563Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.6151835Z 2025-05-07T20:32:44.6151925Z @given( 2025-05-07T20:32:44.6152163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.6152489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.6152800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.6153136Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.6153464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.6153755Z ) 2025-05-07T20:32:44.6154107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.6154550Z def test_silu_mul_quant( 2025-05-07T20:32:44.6154799Z self, 2025-05-07T20:32:44.6155005Z T: int, 2025-05-07T20:32:44.6155203Z D: int, 2025-05-07T20:32:44.6155423Z scale_ub: Optional[float], 2025-05-07T20:32:44.6155694Z contiguous: bool, 2025-05-07T20:32:44.6155929Z compiled: bool, 2025-05-07T20:32:44.6156155Z ) -> None: 2025-05-07T20:32:44.6156372Z torch.manual_seed(2025) 2025-05-07T20:32:44.6156657Z 2025-05-07T20:32:44.6156935Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.6158995Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.6160851Z 2025-05-07T20:32:44.6160970Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.6161185Z 2025-05-07T20:32:44.6161303Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.6161712Z self=, 2025-05-07T20:32:44.6162128Z T=128, 2025-05-07T20:32:44.6162321Z D=7168, 2025-05-07T20:32:44.6162516Z scale_ub=1200.0, 2025-05-07T20:32:44.6162743Z contiguous=True, 2025-05-07T20:32:44.6162976Z compiled=True, 2025-05-07T20:32:44.6163184Z ) 2025-05-07T20:32:44.6455339Z self = 2025-05-07T20:32:44.6456028Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.6456371Z 2025-05-07T20:32:44.6456679Z @given( 2025-05-07T20:32:44.6456926Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.6457248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.6457551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.6457884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.6458269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.6458558Z ) 2025-05-07T20:32:44.6458916Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.6459451Z def test_silu_mul_quant( 2025-05-07T20:32:44.6459697Z self, 2025-05-07T20:32:44.6459908Z T: int, 2025-05-07T20:32:44.6460119Z D: int, 2025-05-07T20:32:44.6460343Z scale_ub: Optional[float], 2025-05-07T20:32:44.6460627Z contiguous: bool, 2025-05-07T20:32:44.6460880Z compiled: bool, 2025-05-07T20:32:44.6461117Z ) -> None: 2025-05-07T20:32:44.6461335Z torch.manual_seed(2025) 2025-05-07T20:32:44.6461588Z 2025-05-07T20:32:44.6461946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.6462297Z 2025-05-07T20:32:44.6462505Z x_sign = torch.sign(x) 2025-05-07T20:32:44.6462809Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.6463118Z x = x_sign * x_clamp 2025-05-07T20:32:44.6463369Z x0 = x[:, :D] 2025-05-07T20:32:44.6463595Z x1 = x[:, D:] 2025-05-07T20:32:44.6463812Z 2025-05-07T20:32:44.6464008Z if contiguous: 2025-05-07T20:32:44.6464252Z x0 = x0.contiguous() 2025-05-07T20:32:44.6464514Z x1 = x1.contiguous() 2025-05-07T20:32:44.6464766Z 2025-05-07T20:32:44.6464968Z if scale_ub is not None: 2025-05-07T20:32:44.6465241Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.6465585Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.6465908Z ) 2025-05-07T20:32:44.6466115Z else: 2025-05-07T20:32:44.6466337Z scale_ub_tensor = None 2025-05-07T20:32:44.6466604Z 2025-05-07T20:32:44.6466846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.6467173Z op = silu_mul_quant 2025-05-07T20:32:44.6467428Z if compiled: 2025-05-07T20:32:44.6467684Z op = torch.compile(op) 2025-05-07T20:32:44.6467987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.6468301Z 2025-05-07T20:32:44.6468598Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.6468769Z 2025-05-07T20:32:44.6468878Z moe/activation_test.py:117: 2025-05-07T20:32:44.6469173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.6469513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.6469800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.6470357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.6470929Z return fn(*args, **kwargs) 2025-05-07T20:32:44.6471601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.6472293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.6472832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.6473515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.6474185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.6474723Z kernel = self.compile( 2025-05-07T20:32:44.6475265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.6475924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.6476374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.6476606Z 2025-05-07T20:32:44.6476812Z self = 2025-05-07T20:32:44.6477893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.6479285Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5999d5bf60>} 2025-05-07T20:32:44.6480676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.6481705Z context = 2025-05-07T20:32:44.6481995Z 2025-05-07T20:32:44.6482208Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.6482735Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.6483206Z module_map=module_map) 2025-05-07T20:32:44.6483576Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.6483930Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.6484201Z E ^ 2025-05-07T20:32:44.6484669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.6485122Z 2025-05-07T20:32:44.6485541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.6486059Z 2025-05-07T20:32:44.6486166Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.6486586Z self=, 2025-05-07T20:32:44.6486997Z T=128, 2025-05-07T20:32:44.6487187Z D=7168, 2025-05-07T20:32:44.6487390Z scale_ub=1200.0, 2025-05-07T20:32:44.6487770Z contiguous=True, 2025-05-07T20:32:44.6487990Z compiled=False, 2025-05-07T20:32:44.6488201Z ) 2025-05-07T20:32:44.6488522Z self = 2025-05-07T20:32:44.6489004Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.6489328Z 2025-05-07T20:32:44.6489405Z @given( 2025-05-07T20:32:44.6489642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.6489948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.6490257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.6490585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.6490913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.6491193Z ) 2025-05-07T20:32:44.6491549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.6491987Z def test_silu_mul_quant( 2025-05-07T20:32:44.6492226Z self, 2025-05-07T20:32:44.6492429Z T: int, 2025-05-07T20:32:44.6492630Z D: int, 2025-05-07T20:32:44.6492842Z scale_ub: Optional[float], 2025-05-07T20:32:44.6493117Z contiguous: bool, 2025-05-07T20:32:44.6493354Z compiled: bool, 2025-05-07T20:32:44.6493570Z ) -> None: 2025-05-07T20:32:44.6493794Z torch.manual_seed(2025) 2025-05-07T20:32:44.6494046Z 2025-05-07T20:32:44.6494321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.6494671Z 2025-05-07T20:32:44.6494873Z x_sign = torch.sign(x) 2025-05-07T20:32:44.6495172Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.6497257Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.6499161Z 2025-05-07T20:32:44.6499285Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.6499553Z 2025-05-07T20:32:44.6499662Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.6500088Z self=, 2025-05-07T20:32:44.6500490Z T=128, 2025-05-07T20:32:44.6500687Z D=5120, 2025-05-07T20:32:44.6500887Z scale_ub=1200.0, 2025-05-07T20:32:44.6501115Z contiguous=True, 2025-05-07T20:32:44.6501336Z compiled=True, 2025-05-07T20:32:44.6501540Z ) 2025-05-07T20:32:44.6501908Z self = 2025-05-07T20:32:44.6502398Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.6502670Z 2025-05-07T20:32:44.6502752Z @given( 2025-05-07T20:32:44.6502989Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.6503303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.6503616Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.6503950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.6504279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.6504566Z ) 2025-05-07T20:32:44.6504920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.6505369Z def test_silu_mul_quant( 2025-05-07T20:32:44.6505949Z self, 2025-05-07T20:32:44.6506163Z T: int, 2025-05-07T20:32:44.6506369Z D: int, 2025-05-07T20:32:44.6506590Z scale_ub: Optional[float], 2025-05-07T20:32:44.6506877Z contiguous: bool, 2025-05-07T20:32:44.6507127Z compiled: bool, 2025-05-07T20:32:44.6507349Z ) -> None: 2025-05-07T20:32:44.6507572Z torch.manual_seed(2025) 2025-05-07T20:32:44.6507827Z 2025-05-07T20:32:44.6508102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.6508448Z 2025-05-07T20:32:44.6508652Z x_sign = torch.sign(x) 2025-05-07T20:32:44.6509053Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.6511051Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.6512906Z 2025-05-07T20:32:44.6513030Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.6513247Z 2025-05-07T20:32:44.6513353Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.6513768Z self=, 2025-05-07T20:32:44.6514170Z T=128, 2025-05-07T20:32:44.6514370Z D=7168, 2025-05-07T20:32:44.6514567Z scale_ub=None, 2025-05-07T20:32:44.6514783Z contiguous=True, 2025-05-07T20:32:44.6515007Z compiled=True, 2025-05-07T20:32:44.6515215Z ) 2025-05-07T20:32:44.8441632Z self = 2025-05-07T20:32:44.8442145Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8442450Z 2025-05-07T20:32:44.8442559Z @given( 2025-05-07T20:32:44.8443171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8443507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8443818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8444159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8444499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8444791Z ) 2025-05-07T20:32:44.8445150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8445611Z def test_silu_mul_quant( 2025-05-07T20:32:44.8445940Z self, 2025-05-07T20:32:44.8446145Z T: int, 2025-05-07T20:32:44.8446354Z D: int, 2025-05-07T20:32:44.8446577Z scale_ub: Optional[float], 2025-05-07T20:32:44.8446862Z contiguous: bool, 2025-05-07T20:32:44.8447118Z compiled: bool, 2025-05-07T20:32:44.8447349Z ) -> None: 2025-05-07T20:32:44.8447667Z torch.manual_seed(2025) 2025-05-07T20:32:44.8447914Z 2025-05-07T20:32:44.8448268Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8450332Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.8452215Z 2025-05-07T20:32:44.8452338Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.8452559Z 2025-05-07T20:32:44.8461465Z FAILED 2025-05-07T20:32:44.8461607Z 2025-05-07T20:32:44.8461777Z =================================== FAILURES =================================== 2025-05-07T20:32:44.8462371Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:44.8463000Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:44.8463842Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:44.8464598Z | yield 2025-05-07T20:32:44.8465181Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:44.8466034Z | self._callTestMethod(testMethod) 2025-05-07T20:32:44.8466804Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:44.8467558Z | if method() is not None: 2025-05-07T20:32:44.8467889Z | ^^^^^^^^ 2025-05-07T20:32:44.8468833Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:44.8469849Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8470262Z | ^^^^^^^ 2025-05-07T20:32:44.8471021Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:44.8471874Z | raise the_error_hypothesis_found 2025-05-07T20:32:44.8472327Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:44.8472750Z +-+---------------- 1 ---------------- 2025-05-07T20:32:44.8473064Z | Traceback (most recent call last): 2025-05-07T20:32:44.8474035Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.8475136Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8475990Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8478832Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.8481614Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.8482211Z | self=, 2025-05-07T20:32:44.8482782Z | T=2048, 2025-05-07T20:32:44.8483113Z | D=5120, # or any other generated value 2025-05-07T20:32:44.8483572Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:44.8484081Z | contiguous=True, # or any other generated value 2025-05-07T20:32:44.8484655Z | compiled=False, # or any other generated value 2025-05-07T20:32:44.8485084Z | ) 2025-05-07T20:32:44.8485331Z | 2025-05-07T20:32:44.8486051Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:44.8486899Z +---------------- 2 ---------------- 2025-05-07T20:32:44.8487296Z | Traceback (most recent call last): 2025-05-07T20:32:44.8488468Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.8489572Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8490088Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8492830Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.8495609Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.8496245Z | self=, 2025-05-07T20:32:44.8496808Z | T=128, 2025-05-07T20:32:44.8497095Z | D=7168, 2025-05-07T20:32:44.8497385Z | scale_ub=None, 2025-05-07T20:32:44.8497743Z | contiguous=True, 2025-05-07T20:32:44.8498079Z | compiled=True, 2025-05-07T20:32:44.8498388Z | ) 2025-05-07T20:32:44.8498643Z | 2025-05-07T20:32:44.8499384Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.8500215Z +---------------- 3 ---------------- 2025-05-07T20:32:44.8500621Z | Traceback (most recent call last): 2025-05-07T20:32:44.8501644Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.8502726Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8503230Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8505327Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.8507587Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.8527076Z | self=, 2025-05-07T20:32:44.8527844Z | T=128, 2025-05-07T20:32:44.8528207Z | D=5120, 2025-05-07T20:32:44.8528716Z | scale_ub=1200.0, 2025-05-07T20:32:44.8529073Z | contiguous=True, 2025-05-07T20:32:44.8529414Z | compiled=True, 2025-05-07T20:32:44.8529740Z | ) 2025-05-07T20:32:44.8530004Z | 2025-05-07T20:32:44.8530750Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.8531619Z +---------------- 4 ---------------- 2025-05-07T20:32:44.8532137Z | Traceback (most recent call last): 2025-05-07T20:32:44.8533150Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:44.8534146Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8534551Z | ^^^^^^^^ 2025-05-07T20:32:44.8535454Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:44.8536446Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8536938Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8538078Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:44.8539200Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8540049Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:44.8541074Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8541694Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8542582Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:44.8543813Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8544481Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8545417Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:44.8546548Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8547172Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8548097Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:44.8549111Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8549647Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8550493Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:44.8551285Z | fn() 2025-05-07T20:32:44.8552085Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:44.8552976Z | self.fn.run( 2025-05-07T20:32:44.8553813Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:44.8554633Z | kernel = self.compile( 2025-05-07T20:32:44.8555006Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:44.8555810Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:44.8556762Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8557305Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8558249Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.8559354Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8560033Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8560660Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8561155Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8561510Z | ^ 2025-05-07T20:32:44.8562160Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8562952Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.8563469Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:44.8564148Z | self=, 2025-05-07T20:32:44.8564719Z | T=1, # or any other generated value 2025-05-07T20:32:44.8565130Z | D=5120, # or any other generated value 2025-05-07T20:32:44.8565571Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:44.8566049Z | contiguous=True, # or any other generated value 2025-05-07T20:32:44.8566541Z | compiled=True, # or any other generated value 2025-05-07T20:32:44.8566942Z | ) 2025-05-07T20:32:44.8567192Z | 2025-05-07T20:32:44.8568019Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.8568824Z +------------------------------------ 2025-05-07T20:32:44.8569308Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:44.8569872Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8570416Z self=, 2025-05-07T20:32:44.8570949Z T=1, 2025-05-07T20:32:44.8571203Z D=5120, 2025-05-07T20:32:44.8571464Z scale_ub=None, 2025-05-07T20:32:44.8571746Z contiguous=True, 2025-05-07T20:32:44.8572054Z compiled=True, 2025-05-07T20:32:44.8572325Z ) 2025-05-07T20:32:44.8572734Z self = 2025-05-07T20:32:44.8573365Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8573707Z 2025-05-07T20:32:44.8573823Z @given( 2025-05-07T20:32:44.8574111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8574520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8574920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8575352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8575782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8576171Z ) 2025-05-07T20:32:44.8576646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8577237Z def test_silu_mul_quant( 2025-05-07T20:32:44.8577563Z self, 2025-05-07T20:32:44.8577828Z T: int, 2025-05-07T20:32:44.8578135Z D: int, 2025-05-07T20:32:44.8578430Z scale_ub: Optional[float], 2025-05-07T20:32:44.8578790Z contiguous: bool, 2025-05-07T20:32:44.8579175Z compiled: bool, 2025-05-07T20:32:44.8579484Z ) -> None: 2025-05-07T20:32:44.8579775Z torch.manual_seed(2025) 2025-05-07T20:32:44.8580094Z 2025-05-07T20:32:44.8580456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8580913Z 2025-05-07T20:32:44.8581166Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8581552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8581986Z x = x_sign * x_clamp 2025-05-07T20:32:44.8582318Z x0 = x[:, :D] 2025-05-07T20:32:44.8582674Z x1 = x[:, D:] 2025-05-07T20:32:44.8582962Z 2025-05-07T20:32:44.8583231Z if contiguous: 2025-05-07T20:32:44.8583548Z x0 = x0.contiguous() 2025-05-07T20:32:44.8583932Z x1 = x1.contiguous() 2025-05-07T20:32:44.8584289Z 2025-05-07T20:32:44.8584565Z if scale_ub is not None: 2025-05-07T20:32:44.8584938Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8585440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8585849Z ) 2025-05-07T20:32:44.8586111Z else: 2025-05-07T20:32:44.8586399Z scale_ub_tensor = None 2025-05-07T20:32:44.8586731Z 2025-05-07T20:32:44.8587036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8587459Z op = silu_mul_quant 2025-05-07T20:32:44.8587799Z if compiled: 2025-05-07T20:32:44.8588121Z op = torch.compile(op) 2025-05-07T20:32:44.8588519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8588891Z 2025-05-07T20:32:44.8589139Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8589525Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8589919Z 2025-05-07T20:32:44.8590229Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8590675Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8591084Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8591497Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8591994Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8592432Z 2025-05-07T20:32:44.8592692Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8592968Z 2025-05-07T20:32:44.8593110Z moe/activation_test.py:126: 2025-05-07T20:32:44.8593528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8594069Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8594521Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8595608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8596641Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8597396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8598328Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8599259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8600250Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8601288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8602337Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8603370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8604271Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8605160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8606173Z fn() 2025-05-07T20:32:44.8606867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8607743Z self.fn.run( 2025-05-07T20:32:44.8608392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8609112Z kernel = self.compile( 2025-05-07T20:32:44.8609856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8610857Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8611393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8611714Z 2025-05-07T20:32:44.8611987Z self = 2025-05-07T20:32:44.8613545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8615454Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df65e13a0>} 2025-05-07T20:32:44.8617300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8618818Z context = 2025-05-07T20:32:44.8619234Z 2025-05-07T20:32:44.8619469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8620209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8620865Z module_map=module_map) 2025-05-07T20:32:44.8621362Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8621847Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8622219Z E ^ 2025-05-07T20:32:44.8622838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8623451Z 2025-05-07T20:32:44.8624010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8624819Z 2025-05-07T20:32:44.8624965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8625526Z self=, 2025-05-07T20:32:44.8626047Z T=2048, 2025-05-07T20:32:44.8626313Z D=5120, 2025-05-07T20:32:44.8626578Z scale_ub=1200.0, 2025-05-07T20:32:44.8626876Z contiguous=True, 2025-05-07T20:32:44.8627194Z compiled=False, 2025-05-07T20:32:44.8627490Z ) 2025-05-07T20:32:44.8627915Z self = 2025-05-07T20:32:44.8628659Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.8629040Z 2025-05-07T20:32:44.8629166Z @given( 2025-05-07T20:32:44.8629488Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8629929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8630347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8630807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8631243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8631636Z ) 2025-05-07T20:32:44.8632124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8632718Z def test_silu_mul_quant( 2025-05-07T20:32:44.8633037Z self, 2025-05-07T20:32:44.8633297Z T: int, 2025-05-07T20:32:44.8633555Z D: int, 2025-05-07T20:32:44.8633939Z scale_ub: Optional[float], 2025-05-07T20:32:44.8634306Z contiguous: bool, 2025-05-07T20:32:44.8634634Z compiled: bool, 2025-05-07T20:32:44.8634945Z ) -> None: 2025-05-07T20:32:44.8635237Z torch.manual_seed(2025) 2025-05-07T20:32:44.8635574Z 2025-05-07T20:32:44.8635962Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8636450Z 2025-05-07T20:32:44.8636726Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8637138Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8637644Z x = x_sign * x_clamp 2025-05-07T20:32:44.8637991Z x0 = x[:, :D] 2025-05-07T20:32:44.8638341Z x1 = x[:, D:] 2025-05-07T20:32:44.8638632Z 2025-05-07T20:32:44.8638898Z if contiguous: 2025-05-07T20:32:44.8639212Z x0 = x0.contiguous() 2025-05-07T20:32:44.8639580Z x1 = x1.contiguous() 2025-05-07T20:32:44.8639920Z 2025-05-07T20:32:44.8640191Z if scale_ub is not None: 2025-05-07T20:32:44.8640655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8641107Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8641509Z ) 2025-05-07T20:32:44.8641773Z else: 2025-05-07T20:32:44.8642056Z scale_ub_tensor = None 2025-05-07T20:32:44.8642391Z 2025-05-07T20:32:44.8642691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8643122Z op = silu_mul_quant 2025-05-07T20:32:44.8643470Z if compiled: 2025-05-07T20:32:44.8643809Z op = torch.compile(op) 2025-05-07T20:32:44.8644194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8644570Z 2025-05-07T20:32:44.8644837Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8645074Z 2025-05-07T20:32:44.8645216Z moe/activation_test.py:117: 2025-05-07T20:32:44.8645628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8646085Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8646471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8647398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8648418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8649123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8650115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8651006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8651717Z kernel = self.compile( 2025-05-07T20:32:44.8652452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8653351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8653907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8654230Z 2025-05-07T20:32:44.8654513Z self = 2025-05-07T20:32:44.8655939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8657331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df62902c0>} 2025-05-07T20:32:44.8658674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8659769Z context = 2025-05-07T20:32:44.8660116Z 2025-05-07T20:32:44.8660303Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8660923Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8661480Z module_map=module_map) 2025-05-07T20:32:44.8661886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8662294Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8662629Z E ^ 2025-05-07T20:32:44.8663172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8663734Z 2025-05-07T20:32:44.8664241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8664882Z 2025-05-07T20:32:44.8664992Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8665518Z self=, 2025-05-07T20:32:44.8665919Z T=2048, 2025-05-07T20:32:44.8666116Z D=5120, 2025-05-07T20:32:44.8666313Z scale_ub=1200.0, 2025-05-07T20:32:44.8666537Z contiguous=True, 2025-05-07T20:32:44.8666758Z compiled=True, 2025-05-07T20:32:44.8666976Z ) 2025-05-07T20:32:44.8667299Z self = 2025-05-07T20:32:44.8667791Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.8668069Z 2025-05-07T20:32:44.8668150Z @given( 2025-05-07T20:32:44.8668383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8668690Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8668999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8669329Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8669651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8669946Z ) 2025-05-07T20:32:44.8670194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8670297Z def test_silu_mul_quant( 2025-05-07T20:32:44.8670378Z self, 2025-05-07T20:32:44.8670464Z T: int, 2025-05-07T20:32:44.8670544Z D: int, 2025-05-07T20:32:44.8670647Z scale_ub: Optional[float], 2025-05-07T20:32:44.8670743Z contiguous: bool, 2025-05-07T20:32:44.8670883Z compiled: bool, 2025-05-07T20:32:44.8670965Z ) -> None: 2025-05-07T20:32:44.8671071Z torch.manual_seed(2025) 2025-05-07T20:32:44.8671146Z 2025-05-07T20:32:44.8671318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8671400Z 2025-05-07T20:32:44.8671497Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8671625Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8671724Z x = x_sign * x_clamp 2025-05-07T20:32:44.8671810Z x0 = x[:, :D] 2025-05-07T20:32:44.8671901Z x1 = x[:, D:] 2025-05-07T20:32:44.8671980Z 2025-05-07T20:32:44.8672067Z if contiguous: 2025-05-07T20:32:44.8672171Z x0 = x0.contiguous() 2025-05-07T20:32:44.8672265Z x1 = x1.contiguous() 2025-05-07T20:32:44.8672340Z 2025-05-07T20:32:44.8672440Z if scale_ub is not None: 2025-05-07T20:32:44.8672548Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8672685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8672775Z ) 2025-05-07T20:32:44.8672857Z else: 2025-05-07T20:32:44.8672953Z scale_ub_tensor = None 2025-05-07T20:32:44.8673039Z 2025-05-07T20:32:44.8673169Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8673267Z op = silu_mul_quant 2025-05-07T20:32:44.8673355Z if compiled: 2025-05-07T20:32:44.8673457Z op = torch.compile(op) 2025-05-07T20:32:44.8673616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8673693Z 2025-05-07T20:32:44.8673782Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8673908Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8673979Z 2025-05-07T20:32:44.8674114Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8674224Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8674322Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8674447Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8674643Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8674716Z 2025-05-07T20:32:44.8674826Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8674831Z 2025-05-07T20:32:44.8674926Z moe/activation_test.py:126: 2025-05-07T20:32:44.8675054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8675167Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8675341Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8675896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8676005Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8676363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8676591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8676955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8677211Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8677610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8677867Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8678289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8678454Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8678793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8678926Z fn() 2025-05-07T20:32:44.8679319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8679410Z self.fn.run( 2025-05-07T20:32:44.8679744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8679836Z kernel = self.compile( 2025-05-07T20:32:44.8680220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8680398Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8680529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8680534Z 2025-05-07T20:32:44.8680739Z self = 2025-05-07T20:32:44.8681510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8682022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df6291440>} 2025-05-07T20:32:44.8682805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8683003Z context = 2025-05-07T20:32:44.8683007Z 2025-05-07T20:32:44.8683170Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8683436Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8683542Z module_map=module_map) 2025-05-07T20:32:44.8683708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8683858Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8683936Z E ^ 2025-05-07T20:32:44.8684289Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8684293Z 2025-05-07T20:32:44.8684710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8684715Z 2025-05-07T20:32:44.8684857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8685083Z self=, 2025-05-07T20:32:44.8685161Z T=16384, 2025-05-07T20:32:44.8685236Z D=7168, 2025-05-07T20:32:44.8685325Z scale_ub=1200.0, 2025-05-07T20:32:44.8685408Z contiguous=False, 2025-05-07T20:32:44.8685491Z compiled=False, 2025-05-07T20:32:44.8685570Z ) 2025-05-07T20:32:44.8685791Z self = 2025-05-07T20:32:44.8685973Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.8685987Z 2025-05-07T20:32:44.8686065Z @given( 2025-05-07T20:32:44.8686182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8686288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8686402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8686521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8686641Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8686712Z ) 2025-05-07T20:32:44.8686956Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8687055Z def test_silu_mul_quant( 2025-05-07T20:32:44.8687132Z self, 2025-05-07T20:32:44.8687209Z T: int, 2025-05-07T20:32:44.8687291Z D: int, 2025-05-07T20:32:44.8687435Z scale_ub: Optional[float], 2025-05-07T20:32:44.8687530Z contiguous: bool, 2025-05-07T20:32:44.8687706Z compiled: bool, 2025-05-07T20:32:44.8687787Z ) -> None: 2025-05-07T20:32:44.8687892Z torch.manual_seed(2025) 2025-05-07T20:32:44.8687969Z 2025-05-07T20:32:44.8688144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8688228Z 2025-05-07T20:32:44.8688325Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8688457Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8688565Z x = x_sign * x_clamp 2025-05-07T20:32:44.8688654Z x0 = x[:, :D] 2025-05-07T20:32:44.8688737Z x1 = x[:, D:] 2025-05-07T20:32:44.8688819Z 2025-05-07T20:32:44.8688906Z if contiguous: 2025-05-07T20:32:44.8689010Z x0 = x0.contiguous() 2025-05-07T20:32:44.8689102Z x1 = x1.contiguous() 2025-05-07T20:32:44.8689180Z 2025-05-07T20:32:44.8689280Z if scale_ub is not None: 2025-05-07T20:32:44.8689393Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8689533Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8689616Z ) 2025-05-07T20:32:44.8689695Z else: 2025-05-07T20:32:44.8689794Z scale_ub_tensor = None 2025-05-07T20:32:44.8689881Z 2025-05-07T20:32:44.8690012Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8690105Z op = silu_mul_quant 2025-05-07T20:32:44.8690252Z if compiled: 2025-05-07T20:32:44.8690359Z op = torch.compile(op) 2025-05-07T20:32:44.8690477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8690554Z 2025-05-07T20:32:44.8690647Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8690652Z 2025-05-07T20:32:44.8690759Z moe/activation_test.py:117: 2025-05-07T20:32:44.8690890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8690996Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8691108Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8691676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8691775Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8692140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8692403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8692752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8692849Z kernel = self.compile( 2025-05-07T20:32:44.8693229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8693411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8693542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8693550Z 2025-05-07T20:32:44.8693761Z self = 2025-05-07T20:32:44.8694532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8695039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df50ba980>} 2025-05-07T20:32:44.8695790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8695980Z context = 2025-05-07T20:32:44.8696030Z 2025-05-07T20:32:44.8696204Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8696466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8696577Z module_map=module_map) 2025-05-07T20:32:44.8696744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8696847Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8696932Z E ^ 2025-05-07T20:32:44.8697290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8697295Z 2025-05-07T20:32:44.8697709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8697714Z 2025-05-07T20:32:44.8697826Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8698052Z self=, 2025-05-07T20:32:44.8698142Z T=1, 2025-05-07T20:32:44.8698223Z D=7168, 2025-05-07T20:32:44.8698308Z scale_ub=None, 2025-05-07T20:32:44.8698405Z contiguous=True, 2025-05-07T20:32:44.8698493Z compiled=True, 2025-05-07T20:32:44.8698568Z ) 2025-05-07T20:32:44.8698791Z self = 2025-05-07T20:32:44.8698956Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8699006Z 2025-05-07T20:32:44.8699092Z @given( 2025-05-07T20:32:44.8699219Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8699320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8699438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8699563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8699677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8699766Z ) 2025-05-07T20:32:44.8700010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8700152Z def test_silu_mul_quant( 2025-05-07T20:32:44.8700242Z self, 2025-05-07T20:32:44.8700324Z T: int, 2025-05-07T20:32:44.8700404Z D: int, 2025-05-07T20:32:44.8700514Z scale_ub: Optional[float], 2025-05-07T20:32:44.8700605Z contiguous: bool, 2025-05-07T20:32:44.8700691Z compiled: bool, 2025-05-07T20:32:44.8700780Z ) -> None: 2025-05-07T20:32:44.8700924Z torch.manual_seed(2025) 2025-05-07T20:32:44.8701001Z 2025-05-07T20:32:44.8701177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8701256Z 2025-05-07T20:32:44.8701359Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8701486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8701577Z x = x_sign * x_clamp 2025-05-07T20:32:44.8701667Z x0 = x[:, :D] 2025-05-07T20:32:44.8701755Z x1 = x[:, D:] 2025-05-07T20:32:44.8701830Z 2025-05-07T20:32:44.8701924Z if contiguous: 2025-05-07T20:32:44.8702018Z x0 = x0.contiguous() 2025-05-07T20:32:44.8702111Z x1 = x1.contiguous() 2025-05-07T20:32:44.8702197Z 2025-05-07T20:32:44.8702291Z if scale_ub is not None: 2025-05-07T20:32:44.8702399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8702544Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8702627Z ) 2025-05-07T20:32:44.8702712Z else: 2025-05-07T20:32:44.8702808Z scale_ub_tensor = None 2025-05-07T20:32:44.8702886Z 2025-05-07T20:32:44.8703025Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8703118Z op = silu_mul_quant 2025-05-07T20:32:44.8703207Z if compiled: 2025-05-07T20:32:44.8703316Z op = torch.compile(op) 2025-05-07T20:32:44.8703424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8703546Z 2025-05-07T20:32:44.8703649Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8703772Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8703850Z 2025-05-07T20:32:44.8703993Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8704098Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8704206Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8704333Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8704477Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8704562Z 2025-05-07T20:32:44.8704665Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8704670Z 2025-05-07T20:32:44.8704773Z moe/activation_test.py:126: 2025-05-07T20:32:44.8704914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8705021Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8705159Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8706041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8706170Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8706537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8706882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8707254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8707514Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8707917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8708220Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8708658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8716834Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8717226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8717312Z fn() 2025-05-07T20:32:44.8717863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8717955Z self.fn.run( 2025-05-07T20:32:44.8718304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8718413Z kernel = self.compile( 2025-05-07T20:32:44.8718801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8718991Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8719129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8719135Z 2025-05-07T20:32:44.8719347Z self = 2025-05-07T20:32:44.8720147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8720657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4f3ac00>} 2025-05-07T20:32:44.8721419Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8721689Z context = 2025-05-07T20:32:44.8721694Z 2025-05-07T20:32:44.8721871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8722140Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8722252Z module_map=module_map) 2025-05-07T20:32:44.8722429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8722539Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8722622Z E ^ 2025-05-07T20:32:44.8722988Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8722993Z 2025-05-07T20:32:44.8723411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8723418Z 2025-05-07T20:32:44.8723535Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8723768Z self=, 2025-05-07T20:32:44.8723851Z T=4096, 2025-05-07T20:32:44.8723939Z D=5120, 2025-05-07T20:32:44.8724027Z scale_ub=None, 2025-05-07T20:32:44.8724119Z contiguous=False, 2025-05-07T20:32:44.8724215Z compiled=False, 2025-05-07T20:32:44.8724295Z ) 2025-05-07T20:32:44.8724560Z self = 2025-05-07T20:32:44.8724751Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8724756Z 2025-05-07T20:32:44.8724838Z @given( 2025-05-07T20:32:44.8724972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8725077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8725197Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8725325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8725445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8725573Z ) 2025-05-07T20:32:44.8725830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8725928Z def test_silu_mul_quant( 2025-05-07T20:32:44.8726011Z self, 2025-05-07T20:32:44.8726101Z T: int, 2025-05-07T20:32:44.8726186Z D: int, 2025-05-07T20:32:44.8726299Z scale_ub: Optional[float], 2025-05-07T20:32:44.8726397Z contiguous: bool, 2025-05-07T20:32:44.8726527Z compiled: bool, 2025-05-07T20:32:44.8726622Z ) -> None: 2025-05-07T20:32:44.8726722Z torch.manual_seed(2025) 2025-05-07T20:32:44.8726802Z 2025-05-07T20:32:44.8726983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8727062Z 2025-05-07T20:32:44.8727162Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8727299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8727397Z x = x_sign * x_clamp 2025-05-07T20:32:44.8727489Z x0 = x[:, :D] 2025-05-07T20:32:44.8727662Z x1 = x[:, D:] 2025-05-07T20:32:44.8727738Z 2025-05-07T20:32:44.8727834Z if contiguous: 2025-05-07T20:32:44.8727942Z x0 = x0.contiguous() 2025-05-07T20:32:44.8728049Z x1 = x1.contiguous() 2025-05-07T20:32:44.8728148Z 2025-05-07T20:32:44.8728253Z if scale_ub is not None: 2025-05-07T20:32:44.8728364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8728518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8728599Z ) 2025-05-07T20:32:44.8728681Z else: 2025-05-07T20:32:44.8728789Z scale_ub_tensor = None 2025-05-07T20:32:44.8728868Z 2025-05-07T20:32:44.8729002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8729107Z op = silu_mul_quant 2025-05-07T20:32:44.8729199Z if compiled: 2025-05-07T20:32:44.8729364Z op = torch.compile(op) 2025-05-07T20:32:44.8729479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8729557Z 2025-05-07T20:32:44.8729659Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8729664Z 2025-05-07T20:32:44.8729768Z moe/activation_test.py:117: 2025-05-07T20:32:44.8729903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8730017Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8730125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8730633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8730745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8731108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8731341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8731690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8731795Z kernel = self.compile( 2025-05-07T20:32:44.8732188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8732366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8732576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8732583Z 2025-05-07T20:32:44.8732795Z self = 2025-05-07T20:32:44.8733574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8734089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4f14180>} 2025-05-07T20:32:44.8734888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8735091Z context = 2025-05-07T20:32:44.8735096Z 2025-05-07T20:32:44.8735309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8735576Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8735695Z module_map=module_map) 2025-05-07T20:32:44.8735866Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8735979Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8736062Z E ^ 2025-05-07T20:32:44.8736426Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8736434Z 2025-05-07T20:32:44.8736857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8736862Z 2025-05-07T20:32:44.8736972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8737209Z self=, 2025-05-07T20:32:44.8737295Z T=4096, 2025-05-07T20:32:44.8737382Z D=7168, 2025-05-07T20:32:44.8737476Z scale_ub=None, 2025-05-07T20:32:44.8737569Z contiguous=False, 2025-05-07T20:32:44.8737660Z compiled=False, 2025-05-07T20:32:44.8737747Z ) 2025-05-07T20:32:44.8737969Z self = 2025-05-07T20:32:44.8738147Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8738196Z 2025-05-07T20:32:44.8738285Z @given( 2025-05-07T20:32:44.8738412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8738527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8738646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8738768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8738893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8738974Z ) 2025-05-07T20:32:44.8739227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8739334Z def test_silu_mul_quant( 2025-05-07T20:32:44.8739416Z self, 2025-05-07T20:32:44.8739498Z T: int, 2025-05-07T20:32:44.8739586Z D: int, 2025-05-07T20:32:44.8739690Z scale_ub: Optional[float], 2025-05-07T20:32:44.8739785Z contiguous: bool, 2025-05-07T20:32:44.8739882Z compiled: bool, 2025-05-07T20:32:44.8739966Z ) -> None: 2025-05-07T20:32:44.8740076Z torch.manual_seed(2025) 2025-05-07T20:32:44.8740157Z 2025-05-07T20:32:44.8740333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8740419Z 2025-05-07T20:32:44.8740516Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8740646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8740746Z x = x_sign * x_clamp 2025-05-07T20:32:44.8740833Z x0 = x[:, :D] 2025-05-07T20:32:44.8740919Z x1 = x[:, D:] 2025-05-07T20:32:44.8741003Z 2025-05-07T20:32:44.8741139Z if contiguous: 2025-05-07T20:32:44.8741237Z x0 = x0.contiguous() 2025-05-07T20:32:44.8741337Z x1 = x1.contiguous() 2025-05-07T20:32:44.8741416Z 2025-05-07T20:32:44.8741513Z if scale_ub is not None: 2025-05-07T20:32:44.8741631Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8741770Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8741857Z ) 2025-05-07T20:32:44.8741943Z else: 2025-05-07T20:32:44.8742043Z scale_ub_tensor = None 2025-05-07T20:32:44.8742171Z 2025-05-07T20:32:44.8742307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8742403Z op = silu_mul_quant 2025-05-07T20:32:44.8742503Z if compiled: 2025-05-07T20:32:44.8742608Z op = torch.compile(op) 2025-05-07T20:32:44.8742720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8742807Z 2025-05-07T20:32:44.8742905Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8742950Z 2025-05-07T20:32:44.8743063Z moe/activation_test.py:117: 2025-05-07T20:32:44.8743198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8743305Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8743418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8743921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8744030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8744402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8744628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8744981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8745083Z kernel = self.compile( 2025-05-07T20:32:44.8745471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8745657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8745790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8745795Z 2025-05-07T20:32:44.8746003Z self = 2025-05-07T20:32:44.8746833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8747346Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4f16020>} 2025-05-07T20:32:44.8748108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8748303Z context = 2025-05-07T20:32:44.8748307Z 2025-05-07T20:32:44.8748482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8748749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8748867Z module_map=module_map) 2025-05-07T20:32:44.8749041Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8749146Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8749229Z E ^ 2025-05-07T20:32:44.8749596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8749600Z 2025-05-07T20:32:44.8750062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8750067Z 2025-05-07T20:32:44.8750187Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8750415Z self=, 2025-05-07T20:32:44.8750499Z T=128, 2025-05-07T20:32:44.8750590Z D=7168, 2025-05-07T20:32:44.8750681Z scale_ub=None, 2025-05-07T20:32:44.8750777Z contiguous=False, 2025-05-07T20:32:44.8750873Z compiled=True, 2025-05-07T20:32:44.8750996Z ) 2025-05-07T20:32:44.8751225Z self = 2025-05-07T20:32:44.8751399Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.8751404Z 2025-05-07T20:32:44.8751485Z @given( 2025-05-07T20:32:44.8751618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8751726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8751891Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8752022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8752140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8752226Z ) 2025-05-07T20:32:44.8752476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8752575Z def test_silu_mul_quant( 2025-05-07T20:32:44.8752663Z self, 2025-05-07T20:32:44.8752751Z T: int, 2025-05-07T20:32:44.8752838Z D: int, 2025-05-07T20:32:44.8752954Z scale_ub: Optional[float], 2025-05-07T20:32:44.8753049Z contiguous: bool, 2025-05-07T20:32:44.8753140Z compiled: bool, 2025-05-07T20:32:44.8753232Z ) -> None: 2025-05-07T20:32:44.8753332Z torch.manual_seed(2025) 2025-05-07T20:32:44.8753411Z 2025-05-07T20:32:44.8753591Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8753673Z 2025-05-07T20:32:44.8753772Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8753911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8754005Z x = x_sign * x_clamp 2025-05-07T20:32:44.8754099Z x0 = x[:, :D] 2025-05-07T20:32:44.8754185Z x1 = x[:, D:] 2025-05-07T20:32:44.8754266Z 2025-05-07T20:32:44.8754364Z if contiguous: 2025-05-07T20:32:44.8754462Z x0 = x0.contiguous() 2025-05-07T20:32:44.8754603Z x1 = x1.contiguous() 2025-05-07T20:32:44.8754689Z 2025-05-07T20:32:44.8754788Z if scale_ub is not None: 2025-05-07T20:32:44.8754899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8755046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8755127Z ) 2025-05-07T20:32:44.8755209Z else: 2025-05-07T20:32:44.8755318Z scale_ub_tensor = None 2025-05-07T20:32:44.8755396Z 2025-05-07T20:32:44.8755539Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8755639Z op = silu_mul_quant 2025-05-07T20:32:44.8755730Z if compiled: 2025-05-07T20:32:44.8755842Z op = torch.compile(op) 2025-05-07T20:32:44.8755953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8756031Z 2025-05-07T20:32:44.8756132Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8756257Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8756339Z 2025-05-07T20:32:44.8756486Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8756598Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8756703Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8756839Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8756984Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8757069Z 2025-05-07T20:32:44.8757176Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8757225Z 2025-05-07T20:32:44.8757332Z moe/activation_test.py:126: 2025-05-07T20:32:44.8757474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8757585Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8757723Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8758293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8758403Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8758817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8759043Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8759414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8759717Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8760120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8760386Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8760763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8760941Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8761290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8761372Z fn() 2025-05-07T20:32:44.8761779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8761867Z self.fn.run( 2025-05-07T20:32:44.8762212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8762317Z kernel = self.compile( 2025-05-07T20:32:44.8762698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8762884Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8763018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8763092Z 2025-05-07T20:32:44.8763303Z self = 2025-05-07T20:32:44.8764089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8764599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4a07060>} 2025-05-07T20:32:44.8765351Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8765545Z context = 2025-05-07T20:32:44.8765550Z 2025-05-07T20:32:44.8765723Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8765992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8766105Z module_map=module_map) 2025-05-07T20:32:44.8766275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8766382Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8766464Z E ^ 2025-05-07T20:32:44.8766872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8766877Z 2025-05-07T20:32:44.8767293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8767298Z 2025-05-07T20:32:44.8767414Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8767694Z self=, 2025-05-07T20:32:44.8767780Z T=128, 2025-05-07T20:32:44.8767869Z D=7168, 2025-05-07T20:32:44.8768001Z scale_ub=None, 2025-05-07T20:32:44.8768092Z contiguous=False, 2025-05-07T20:32:44.8768187Z compiled=False, 2025-05-07T20:32:44.8768266Z ) 2025-05-07T20:32:44.8768486Z self = 2025-05-07T20:32:44.8768666Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8768671Z 2025-05-07T20:32:44.8768752Z @given( 2025-05-07T20:32:44.8768922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8769025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8769142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8769265Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8769378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8769457Z ) 2025-05-07T20:32:44.8769708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8769807Z def test_silu_mul_quant( 2025-05-07T20:32:44.8769896Z self, 2025-05-07T20:32:44.8769976Z T: int, 2025-05-07T20:32:44.8770055Z D: int, 2025-05-07T20:32:44.8770162Z scale_ub: Optional[float], 2025-05-07T20:32:44.8770252Z contiguous: bool, 2025-05-07T20:32:44.8770340Z compiled: bool, 2025-05-07T20:32:44.8770424Z ) -> None: 2025-05-07T20:32:44.8770522Z torch.manual_seed(2025) 2025-05-07T20:32:44.8770598Z 2025-05-07T20:32:44.8770776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8770856Z 2025-05-07T20:32:44.8770949Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8771081Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8771173Z x = x_sign * x_clamp 2025-05-07T20:32:44.8771255Z x0 = x[:, :D] 2025-05-07T20:32:44.8771344Z x1 = x[:, D:] 2025-05-07T20:32:44.8771419Z 2025-05-07T20:32:44.8771555Z if contiguous: 2025-05-07T20:32:44.8771647Z x0 = x0.contiguous() 2025-05-07T20:32:44.8771742Z x1 = x1.contiguous() 2025-05-07T20:32:44.8771819Z 2025-05-07T20:32:44.8771911Z if scale_ub is not None: 2025-05-07T20:32:44.8772018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8772163Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8772244Z ) 2025-05-07T20:32:44.8772323Z else: 2025-05-07T20:32:44.8772428Z scale_ub_tensor = None 2025-05-07T20:32:44.8772507Z 2025-05-07T20:32:44.8772636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8772736Z op = silu_mul_quant 2025-05-07T20:32:44.8772824Z if compiled: 2025-05-07T20:32:44.8772932Z op = torch.compile(op) 2025-05-07T20:32:44.8773040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8773116Z 2025-05-07T20:32:44.8773216Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8773224Z 2025-05-07T20:32:44.8773324Z moe/activation_test.py:117: 2025-05-07T20:32:44.8773457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8773568Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8773670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8774167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8774318Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8774680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8774911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8775252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8775349Z kernel = self.compile( 2025-05-07T20:32:44.8775741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8775956Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8776091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8776096Z 2025-05-07T20:32:44.8776302Z self = 2025-05-07T20:32:44.8777118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8777628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df43d8cc0>} 2025-05-07T20:32:44.8778423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8778628Z context = 2025-05-07T20:32:44.8778633Z 2025-05-07T20:32:44.8778798Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8779060Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8779179Z module_map=module_map) 2025-05-07T20:32:44.8779342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8779452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8779534Z E ^ 2025-05-07T20:32:44.8779889Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8779894Z 2025-05-07T20:32:44.8780312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8780361Z 2025-05-07T20:32:44.8780469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8780697Z self=, 2025-05-07T20:32:44.8780777Z T=4096, 2025-05-07T20:32:44.8780855Z D=5120, 2025-05-07T20:32:44.8780948Z scale_ub=1200.0, 2025-05-07T20:32:44.8781034Z contiguous=True, 2025-05-07T20:32:44.8781123Z compiled=False, 2025-05-07T20:32:44.8781207Z ) 2025-05-07T20:32:44.8781428Z self = 2025-05-07T20:32:44.8781604Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.8781609Z 2025-05-07T20:32:44.8781692Z @given( 2025-05-07T20:32:44.8781813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8781920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8782040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8782159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8782280Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8782358Z ) 2025-05-07T20:32:44.8782601Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8782702Z def test_silu_mul_quant( 2025-05-07T20:32:44.8782781Z self, 2025-05-07T20:32:44.8782862Z T: int, 2025-05-07T20:32:44.8782992Z D: int, 2025-05-07T20:32:44.8783100Z scale_ub: Optional[float], 2025-05-07T20:32:44.8783191Z contiguous: bool, 2025-05-07T20:32:44.8783283Z compiled: bool, 2025-05-07T20:32:44.8783365Z ) -> None: 2025-05-07T20:32:44.8783467Z torch.manual_seed(2025) 2025-05-07T20:32:44.8783544Z 2025-05-07T20:32:44.8783715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8783798Z 2025-05-07T20:32:44.8783895Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8784087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8784186Z x = x_sign * x_clamp 2025-05-07T20:32:44.8784267Z x0 = x[:, :D] 2025-05-07T20:32:44.8784350Z x1 = x[:, D:] 2025-05-07T20:32:44.8784434Z 2025-05-07T20:32:44.8784520Z if contiguous: 2025-05-07T20:32:44.8784614Z x0 = x0.contiguous() 2025-05-07T20:32:44.8784713Z x1 = x1.contiguous() 2025-05-07T20:32:44.8784789Z 2025-05-07T20:32:44.8784924Z if scale_ub is not None: 2025-05-07T20:32:44.8785037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8785172Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8785255Z ) 2025-05-07T20:32:44.8785332Z else: 2025-05-07T20:32:44.8785428Z scale_ub_tensor = None 2025-05-07T20:32:44.8785510Z 2025-05-07T20:32:44.8785642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8785736Z op = silu_mul_quant 2025-05-07T20:32:44.8785832Z if compiled: 2025-05-07T20:32:44.8785934Z op = torch.compile(op) 2025-05-07T20:32:44.8786042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8786123Z 2025-05-07T20:32:44.8786220Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8786225Z 2025-05-07T20:32:44.8786330Z moe/activation_test.py:117: 2025-05-07T20:32:44.8786464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8786568Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8786673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8787170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8787269Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8787632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8787900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8788252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8788347Z kernel = self.compile( 2025-05-07T20:32:44.8788728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8788914Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8789044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8789049Z 2025-05-07T20:32:44.8789258Z self = 2025-05-07T20:32:44.8790038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8790545Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df43d9f80>} 2025-05-07T20:32:44.8791298Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8791533Z context = 2025-05-07T20:32:44.8791539Z 2025-05-07T20:32:44.8791714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8791975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8792084Z module_map=module_map) 2025-05-07T20:32:44.8792256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8792362Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8792485Z E ^ 2025-05-07T20:32:44.8792848Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8792853Z 2025-05-07T20:32:44.8793264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8793269Z 2025-05-07T20:32:44.8793382Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8793666Z self=, 2025-05-07T20:32:44.8793750Z T=1, 2025-05-07T20:32:44.8793841Z D=5120, 2025-05-07T20:32:44.8793931Z scale_ub=None, 2025-05-07T20:32:44.8794021Z contiguous=True, 2025-05-07T20:32:44.8794114Z compiled=True, 2025-05-07T20:32:44.8794194Z ) 2025-05-07T20:32:44.8794416Z self = 2025-05-07T20:32:44.8794580Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8794588Z 2025-05-07T20:32:44.8794667Z @given( 2025-05-07T20:32:44.8794795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8794896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8795013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8795137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8795254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8795334Z ) 2025-05-07T20:32:44.8795584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8795681Z def test_silu_mul_quant( 2025-05-07T20:32:44.8795764Z self, 2025-05-07T20:32:44.8795844Z T: int, 2025-05-07T20:32:44.8795920Z D: int, 2025-05-07T20:32:44.8796027Z scale_ub: Optional[float], 2025-05-07T20:32:44.8796120Z contiguous: bool, 2025-05-07T20:32:44.8796255Z compiled: bool, 2025-05-07T20:32:44.8796344Z ) -> None: 2025-05-07T20:32:44.8796439Z torch.manual_seed(2025) 2025-05-07T20:32:44.8796518Z 2025-05-07T20:32:44.8796694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8796773Z 2025-05-07T20:32:44.8796867Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8796997Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8797088Z x = x_sign * x_clamp 2025-05-07T20:32:44.8797177Z x0 = x[:, :D] 2025-05-07T20:32:44.8797263Z x1 = x[:, D:] 2025-05-07T20:32:44.8797339Z 2025-05-07T20:32:44.8797430Z if contiguous: 2025-05-07T20:32:44.8797523Z x0 = x0.contiguous() 2025-05-07T20:32:44.8797613Z x1 = x1.contiguous() 2025-05-07T20:32:44.8797691Z 2025-05-07T20:32:44.8797785Z if scale_ub is not None: 2025-05-07T20:32:44.8797891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8798035Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8798119Z ) 2025-05-07T20:32:44.8798198Z else: 2025-05-07T20:32:44.8798302Z scale_ub_tensor = None 2025-05-07T20:32:44.8798380Z 2025-05-07T20:32:44.8798516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8798608Z op = silu_mul_quant 2025-05-07T20:32:44.8798695Z if compiled: 2025-05-07T20:32:44.8798803Z op = torch.compile(op) 2025-05-07T20:32:44.8798956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8799033Z 2025-05-07T20:32:44.8799132Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8799254Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8799330Z 2025-05-07T20:32:44.8799474Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8799579Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8799680Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8799811Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8799994Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8800075Z 2025-05-07T20:32:44.8800177Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8800182Z 2025-05-07T20:32:44.8800284Z moe/activation_test.py:126: 2025-05-07T20:32:44.8800419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8800529Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8800704Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8801274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8801379Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8801744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8801970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8802338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8802600Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8802999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8803260Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8803634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8803801Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8804150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8804295Z fn() 2025-05-07T20:32:44.8804696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8804785Z self.fn.run( 2025-05-07T20:32:44.8805122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8805223Z kernel = self.compile( 2025-05-07T20:32:44.8805813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8806062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8806231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8806237Z 2025-05-07T20:32:44.8806445Z self = 2025-05-07T20:32:44.8807226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8807776Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df43dafc0>} 2025-05-07T20:32:44.8808676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8808880Z context = 2025-05-07T20:32:44.8808885Z 2025-05-07T20:32:44.8809050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8809320Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8809431Z module_map=module_map) 2025-05-07T20:32:44.8809660Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8809769Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8809849Z E ^ 2025-05-07T20:32:44.8810206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8810216Z 2025-05-07T20:32:44.8810633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8810695Z 2025-05-07T20:32:44.8810804Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8811034Z self=, 2025-05-07T20:32:44.8811116Z T=2048, 2025-05-07T20:32:44.8811198Z D=5120, 2025-05-07T20:32:44.8811293Z scale_ub=None, 2025-05-07T20:32:44.8811380Z contiguous=True, 2025-05-07T20:32:44.8811467Z compiled=True, 2025-05-07T20:32:44.8811552Z ) 2025-05-07T20:32:44.8811771Z self = 2025-05-07T20:32:44.8811951Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8811956Z 2025-05-07T20:32:44.8812034Z @given( 2025-05-07T20:32:44.8812155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8812264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8812382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8812505Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8812625Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8812701Z ) 2025-05-07T20:32:44.8812945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8813047Z def test_silu_mul_quant( 2025-05-07T20:32:44.8813129Z self, 2025-05-07T20:32:44.8813213Z T: int, 2025-05-07T20:32:44.8813292Z D: int, 2025-05-07T20:32:44.8813464Z scale_ub: Optional[float], 2025-05-07T20:32:44.8813565Z contiguous: bool, 2025-05-07T20:32:44.8813652Z compiled: bool, 2025-05-07T20:32:44.8813730Z ) -> None: 2025-05-07T20:32:44.8813833Z torch.manual_seed(2025) 2025-05-07T20:32:44.8813911Z 2025-05-07T20:32:44.8814087Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8814163Z 2025-05-07T20:32:44.8814257Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8814395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8814486Z x = x_sign * x_clamp 2025-05-07T20:32:44.8814570Z x0 = x[:, :D] 2025-05-07T20:32:44.8814658Z x1 = x[:, D:] 2025-05-07T20:32:44.8814733Z 2025-05-07T20:32:44.8814827Z if contiguous: 2025-05-07T20:32:44.8814926Z x0 = x0.contiguous() 2025-05-07T20:32:44.8815018Z x1 = x1.contiguous() 2025-05-07T20:32:44.8815102Z 2025-05-07T20:32:44.8815195Z if scale_ub is not None: 2025-05-07T20:32:44.8815306Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8815452Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8815531Z ) 2025-05-07T20:32:44.8815610Z else: 2025-05-07T20:32:44.8815713Z scale_ub_tensor = None 2025-05-07T20:32:44.8815788Z 2025-05-07T20:32:44.8815920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8816018Z op = silu_mul_quant 2025-05-07T20:32:44.8816160Z if compiled: 2025-05-07T20:32:44.8816272Z op = torch.compile(op) 2025-05-07T20:32:44.8816380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8816456Z 2025-05-07T20:32:44.8816557Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8816679Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8816755Z 2025-05-07T20:32:44.8816903Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8817009Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8817155Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8817286Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8817428Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8817501Z 2025-05-07T20:32:44.8817608Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8817613Z 2025-05-07T20:32:44.8817714Z moe/activation_test.py:126: 2025-05-07T20:32:44.8817893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8818002Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8818137Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8818753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8818858Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8819221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8819453Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8819818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8820081Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8820481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8820735Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8821118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8821287Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8821681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8821765Z fn() 2025-05-07T20:32:44.8822163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8822255Z self.fn.run( 2025-05-07T20:32:44.8822592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8822688Z kernel = self.compile( 2025-05-07T20:32:44.8823076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8823253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8823390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8823395Z 2025-05-07T20:32:44.8823601Z self = 2025-05-07T20:32:44.8824375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8824887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5df4023740>} 2025-05-07T20:32:44.8825674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8825871Z context = 2025-05-07T20:32:44.8825876Z 2025-05-07T20:32:44.8826040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8826307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8826487Z module_map=module_map) 2025-05-07T20:32:44.8826649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8826756Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8826837Z E ^ 2025-05-07T20:32:44.8827193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8827198Z 2025-05-07T20:32:44.8827659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8827664Z 2025-05-07T20:32:44.8827772Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8828003Z self=, 2025-05-07T20:32:44.8828084Z T=128, 2025-05-07T20:32:44.8828164Z D=5120, 2025-05-07T20:32:44.8828255Z scale_ub=None, 2025-05-07T20:32:44.8828342Z contiguous=True, 2025-05-07T20:32:44.8828426Z compiled=True, 2025-05-07T20:32:44.8828510Z ) 2025-05-07T20:32:44.8828728Z self = 2025-05-07T20:32:44.8828897Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8828907Z 2025-05-07T20:32:44.8828984Z @given( 2025-05-07T20:32:44.8829104Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8829213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8829331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8829451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8829570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8829647Z ) 2025-05-07T20:32:44.8829893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8829996Z def test_silu_mul_quant( 2025-05-07T20:32:44.8830123Z self, 2025-05-07T20:32:44.8830201Z T: int, 2025-05-07T20:32:44.8830291Z D: int, 2025-05-07T20:32:44.8830390Z scale_ub: Optional[float], 2025-05-07T20:32:44.8830486Z contiguous: bool, 2025-05-07T20:32:44.8830572Z compiled: bool, 2025-05-07T20:32:44.8830652Z ) -> None: 2025-05-07T20:32:44.8830755Z torch.manual_seed(2025) 2025-05-07T20:32:44.8830834Z 2025-05-07T20:32:44.8831003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8831088Z 2025-05-07T20:32:44.8831185Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8831311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8831406Z x = x_sign * x_clamp 2025-05-07T20:32:44.8831490Z x0 = x[:, :D] 2025-05-07T20:32:44.8831574Z x1 = x[:, D:] 2025-05-07T20:32:44.8831653Z 2025-05-07T20:32:44.8831738Z if contiguous: 2025-05-07T20:32:44.8831838Z x0 = x0.contiguous() 2025-05-07T20:32:44.8831932Z x1 = x1.contiguous() 2025-05-07T20:32:44.8832011Z 2025-05-07T20:32:44.8832108Z if scale_ub is not None: 2025-05-07T20:32:44.8832214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8832352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8832439Z ) 2025-05-07T20:32:44.8832516Z else: 2025-05-07T20:32:44.8832612Z scale_ub_tensor = None 2025-05-07T20:32:44.8832693Z 2025-05-07T20:32:44.8832870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8832968Z op = silu_mul_quant 2025-05-07T20:32:44.8833061Z if compiled: 2025-05-07T20:32:44.8833162Z op = torch.compile(op) 2025-05-07T20:32:44.8833273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8833349Z 2025-05-07T20:32:44.8833442Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8833571Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8833649Z 2025-05-07T20:32:44.8833784Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8833940Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8834040Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8834163Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8834310Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8834384Z 2025-05-07T20:32:44.8834487Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8834500Z 2025-05-07T20:32:44.8834637Z moe/activation_test.py:126: 2025-05-07T20:32:44.8834769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8834883Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8835019Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8835576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8835692Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8836052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8836282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8836649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8836907Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8837307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8837560Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8837932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8838149Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8838492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8838577Z fn() 2025-05-07T20:32:44.8838975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8839059Z self.fn.run( 2025-05-07T20:32:44.8839406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8839503Z kernel = self.compile( 2025-05-07T20:32:44.8839880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8840059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8840191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8840199Z 2025-05-07T20:32:44.8840416Z self = 2025-05-07T20:32:44.8841193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8841744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3fc91e40>} 2025-05-07T20:32:44.8842490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8842683Z context = 2025-05-07T20:32:44.8842688Z 2025-05-07T20:32:44.8842862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8843175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8843294Z module_map=module_map) 2025-05-07T20:32:44.8843456Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8843559Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8843645Z E ^ 2025-05-07T20:32:44.8844042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8844047Z 2025-05-07T20:32:44.8844460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8844469Z 2025-05-07T20:32:44.8844574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8844798Z self=, 2025-05-07T20:32:44.8844891Z T=4096, 2025-05-07T20:32:44.8844969Z D=5120, 2025-05-07T20:32:44.8845071Z scale_ub=None, 2025-05-07T20:32:44.8845162Z contiguous=True, 2025-05-07T20:32:44.8857146Z compiled=True, 2025-05-07T20:32:44.8857245Z ) 2025-05-07T20:32:44.8857487Z self = 2025-05-07T20:32:44.8857665Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8857670Z 2025-05-07T20:32:44.8857762Z @given( 2025-05-07T20:32:44.8857900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8858014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8858133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8858254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8858378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8858457Z ) 2025-05-07T20:32:44.8858709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8858915Z def test_silu_mul_quant( 2025-05-07T20:32:44.8858999Z self, 2025-05-07T20:32:44.8859081Z T: int, 2025-05-07T20:32:44.8859170Z D: int, 2025-05-07T20:32:44.8859273Z scale_ub: Optional[float], 2025-05-07T20:32:44.8859374Z contiguous: bool, 2025-05-07T20:32:44.8859467Z compiled: bool, 2025-05-07T20:32:44.8859552Z ) -> None: 2025-05-07T20:32:44.8859659Z torch.manual_seed(2025) 2025-05-07T20:32:44.8859737Z 2025-05-07T20:32:44.8859917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8860003Z 2025-05-07T20:32:44.8860101Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8860231Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8860335Z x = x_sign * x_clamp 2025-05-07T20:32:44.8860421Z x0 = x[:, :D] 2025-05-07T20:32:44.8860507Z x1 = x[:, D:] 2025-05-07T20:32:44.8860595Z 2025-05-07T20:32:44.8860691Z if contiguous: 2025-05-07T20:32:44.8860790Z x0 = x0.contiguous() 2025-05-07T20:32:44.8860894Z x1 = x1.contiguous() 2025-05-07T20:32:44.8860974Z 2025-05-07T20:32:44.8861077Z if scale_ub is not None: 2025-05-07T20:32:44.8861189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8861331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8861421Z ) 2025-05-07T20:32:44.8861502Z else: 2025-05-07T20:32:44.8861648Z scale_ub_tensor = None 2025-05-07T20:32:44.8861736Z 2025-05-07T20:32:44.8861871Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8861968Z op = silu_mul_quant 2025-05-07T20:32:44.8862065Z if compiled: 2025-05-07T20:32:44.8862171Z op = torch.compile(op) 2025-05-07T20:32:44.8862282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8862367Z 2025-05-07T20:32:44.8862464Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8862601Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8862723Z 2025-05-07T20:32:44.8862863Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8862976Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8863081Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8863208Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8863360Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8863481Z 2025-05-07T20:32:44.8863588Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8863593Z 2025-05-07T20:32:44.8863701Z moe/activation_test.py:126: 2025-05-07T20:32:44.8863837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8863947Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8864084Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8864661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8864772Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8865138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8865373Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8865749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8866016Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8866419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8866674Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8867103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8867277Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8867629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8867711Z fn() 2025-05-07T20:32:44.8868130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8868244Z self.fn.run( 2025-05-07T20:32:44.8868611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8868709Z kernel = self.compile( 2025-05-07T20:32:44.8869103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8869284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8869427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8869434Z 2025-05-07T20:32:44.8869646Z self = 2025-05-07T20:32:44.8870430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8870993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3fc79ee0>} 2025-05-07T20:32:44.8871746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8871947Z context = 2025-05-07T20:32:44.8871994Z 2025-05-07T20:32:44.8872161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8872427Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8872547Z module_map=module_map) 2025-05-07T20:32:44.8872716Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8872837Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8872958Z E ^ 2025-05-07T20:32:44.8873317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8873322Z 2025-05-07T20:32:44.8873746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8873751Z 2025-05-07T20:32:44.8873858Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8874098Z self=, 2025-05-07T20:32:44.8874183Z T=16384, 2025-05-07T20:32:44.8874269Z D=5120, 2025-05-07T20:32:44.8874365Z scale_ub=None, 2025-05-07T20:32:44.8874455Z contiguous=True, 2025-05-07T20:32:44.8874544Z compiled=True, 2025-05-07T20:32:44.8874631Z ) 2025-05-07T20:32:44.8874853Z self = 2025-05-07T20:32:44.8875032Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8875038Z 2025-05-07T20:32:44.8875127Z @given( 2025-05-07T20:32:44.8875253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8875361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8875480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8875600Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8875723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8875846Z ) 2025-05-07T20:32:44.8876101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8876210Z def test_silu_mul_quant( 2025-05-07T20:32:44.8876292Z self, 2025-05-07T20:32:44.8876374Z T: int, 2025-05-07T20:32:44.8876460Z D: int, 2025-05-07T20:32:44.8876563Z scale_ub: Optional[float], 2025-05-07T20:32:44.8876657Z contiguous: bool, 2025-05-07T20:32:44.8876754Z compiled: bool, 2025-05-07T20:32:44.8876839Z ) -> None: 2025-05-07T20:32:44.8876948Z torch.manual_seed(2025) 2025-05-07T20:32:44.8877026Z 2025-05-07T20:32:44.8877197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8877283Z 2025-05-07T20:32:44.8877380Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8877509Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8877608Z x = x_sign * x_clamp 2025-05-07T20:32:44.8877693Z x0 = x[:, :D] 2025-05-07T20:32:44.8877784Z x1 = x[:, D:] 2025-05-07T20:32:44.8877868Z 2025-05-07T20:32:44.8877957Z if contiguous: 2025-05-07T20:32:44.8878053Z x0 = x0.contiguous() 2025-05-07T20:32:44.8878154Z x1 = x1.contiguous() 2025-05-07T20:32:44.8878231Z 2025-05-07T20:32:44.8878326Z if scale_ub is not None: 2025-05-07T20:32:44.8878442Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8878606Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8878761Z ) 2025-05-07T20:32:44.8878842Z else: 2025-05-07T20:32:44.8878946Z scale_ub_tensor = None 2025-05-07T20:32:44.8879028Z 2025-05-07T20:32:44.8879161Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8879256Z op = silu_mul_quant 2025-05-07T20:32:44.8879351Z if compiled: 2025-05-07T20:32:44.8879456Z op = torch.compile(op) 2025-05-07T20:32:44.8879566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8879652Z 2025-05-07T20:32:44.8879788Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8879912Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8879994Z 2025-05-07T20:32:44.8880135Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8880247Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8880350Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8880480Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8880672Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8880751Z 2025-05-07T20:32:44.8880856Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8880861Z 2025-05-07T20:32:44.8880976Z moe/activation_test.py:126: 2025-05-07T20:32:44.8881109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8881225Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8881370Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8881936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8882050Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8882411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8882640Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8883015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8883272Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8883678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8883975Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8884356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8884534Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8884881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8884973Z fn() 2025-05-07T20:32:44.8885379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8885468Z self.fn.run( 2025-05-07T20:32:44.8885816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8885913Z kernel = self.compile( 2025-05-07T20:32:44.8886299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8886485Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8886620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8886625Z 2025-05-07T20:32:44.8886840Z self = 2025-05-07T20:32:44.8887757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8888272Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f91d080>} 2025-05-07T20:32:44.8889030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8889265Z context = 2025-05-07T20:32:44.8889270Z 2025-05-07T20:32:44.8889446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8889711Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8889824Z module_map=module_map) 2025-05-07T20:32:44.8890062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8890170Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8890257Z E ^ 2025-05-07T20:32:44.8890614Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8890619Z 2025-05-07T20:32:44.8891035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8891042Z 2025-05-07T20:32:44.8891156Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8891382Z self=, 2025-05-07T20:32:44.8891470Z T=1, 2025-05-07T20:32:44.8891553Z D=5120, 2025-05-07T20:32:44.8891642Z scale_ub=1200.0, 2025-05-07T20:32:44.8891738Z contiguous=True, 2025-05-07T20:32:44.8891825Z compiled=True, 2025-05-07T20:32:44.8891902Z ) 2025-05-07T20:32:44.8892130Z self = 2025-05-07T20:32:44.8892304Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.8892309Z 2025-05-07T20:32:44.8892389Z @given( 2025-05-07T20:32:44.8892522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8892626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8892750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8892870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8893032Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8893118Z ) 2025-05-07T20:32:44.8893369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8893467Z def test_silu_mul_quant( 2025-05-07T20:32:44.8893557Z self, 2025-05-07T20:32:44.8893637Z T: int, 2025-05-07T20:32:44.8893718Z D: int, 2025-05-07T20:32:44.8893829Z scale_ub: Optional[float], 2025-05-07T20:32:44.8893925Z contiguous: bool, 2025-05-07T20:32:44.8894016Z compiled: bool, 2025-05-07T20:32:44.8894106Z ) -> None: 2025-05-07T20:32:44.8894206Z torch.manual_seed(2025) 2025-05-07T20:32:44.8894283Z 2025-05-07T20:32:44.8894468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8894547Z 2025-05-07T20:32:44.8894650Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8894780Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8894879Z x = x_sign * x_clamp 2025-05-07T20:32:44.8894972Z x0 = x[:, :D] 2025-05-07T20:32:44.8895057Z x1 = x[:, D:] 2025-05-07T20:32:44.8895134Z 2025-05-07T20:32:44.8895229Z if contiguous: 2025-05-07T20:32:44.8895323Z x0 = x0.contiguous() 2025-05-07T20:32:44.8895417Z x1 = x1.contiguous() 2025-05-07T20:32:44.8895500Z 2025-05-07T20:32:44.8895596Z if scale_ub is not None: 2025-05-07T20:32:44.8895748Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8895894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8895975Z ) 2025-05-07T20:32:44.8896061Z else: 2025-05-07T20:32:44.8896158Z scale_ub_tensor = None 2025-05-07T20:32:44.8896235Z 2025-05-07T20:32:44.8896376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8896472Z op = silu_mul_quant 2025-05-07T20:32:44.8896560Z if compiled: 2025-05-07T20:32:44.8896672Z op = torch.compile(op) 2025-05-07T20:32:44.8896822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8896898Z 2025-05-07T20:32:44.8897000Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8897005Z 2025-05-07T20:32:44.8897107Z moe/activation_test.py:117: 2025-05-07T20:32:44.8897246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8897350Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8897457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8897874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8897976Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8898473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8898586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8898948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8899180Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8899518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8899624Z kernel = self.compile( 2025-05-07T20:32:44.8900011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8900188Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8900324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8900329Z 2025-05-07T20:32:44.8900536Z self = 2025-05-07T20:32:44.8901322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8901877Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f2ddee0>} 2025-05-07T20:32:44.8902631Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8902832Z context = 2025-05-07T20:32:44.8902836Z 2025-05-07T20:32:44.8903002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8903275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8903385Z module_map=module_map) 2025-05-07T20:32:44.8903552Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8903670Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8903749Z E ^ 2025-05-07T20:32:44.8904112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8904117Z 2025-05-07T20:32:44.8904571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8904577Z 2025-05-07T20:32:44.8904686Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8904918Z self=, 2025-05-07T20:32:44.8904998Z T=1, 2025-05-07T20:32:44.8905081Z D=5120, 2025-05-07T20:32:44.8905172Z scale_ub=None, 2025-05-07T20:32:44.8905261Z contiguous=False, 2025-05-07T20:32:44.8905351Z compiled=True, 2025-05-07T20:32:44.8905428Z ) 2025-05-07T20:32:44.8905978Z self = 2025-05-07T20:32:44.8906295Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.8906300Z 2025-05-07T20:32:44.8906381Z @given( 2025-05-07T20:32:44.8906503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8906610Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8906727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8906852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8907056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8907137Z ) 2025-05-07T20:32:44.8907388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8907485Z def test_silu_mul_quant( 2025-05-07T20:32:44.8907566Z self, 2025-05-07T20:32:44.8907651Z T: int, 2025-05-07T20:32:44.8907732Z D: int, 2025-05-07T20:32:44.8907836Z scale_ub: Optional[float], 2025-05-07T20:32:44.8907933Z contiguous: bool, 2025-05-07T20:32:44.8908027Z compiled: bool, 2025-05-07T20:32:44.8908108Z ) -> None: 2025-05-07T20:32:44.8908210Z torch.manual_seed(2025) 2025-05-07T20:32:44.8908286Z 2025-05-07T20:32:44.8908458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8908541Z 2025-05-07T20:32:44.8908638Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8908772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8908867Z x = x_sign * x_clamp 2025-05-07T20:32:44.8908952Z x0 = x[:, :D] 2025-05-07T20:32:44.8909039Z x1 = x[:, D:] 2025-05-07T20:32:44.8909115Z 2025-05-07T20:32:44.8909202Z if contiguous: 2025-05-07T20:32:44.8909300Z x0 = x0.contiguous() 2025-05-07T20:32:44.8909393Z x1 = x1.contiguous() 2025-05-07T20:32:44.8909468Z 2025-05-07T20:32:44.8909568Z if scale_ub is not None: 2025-05-07T20:32:44.8909747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8909889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8909973Z ) 2025-05-07T20:32:44.8910052Z else: 2025-05-07T20:32:44.8910150Z scale_ub_tensor = None 2025-05-07T20:32:44.8910233Z 2025-05-07T20:32:44.8910364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8910461Z op = silu_mul_quant 2025-05-07T20:32:44.8910552Z if compiled: 2025-05-07T20:32:44.8910658Z op = torch.compile(op) 2025-05-07T20:32:44.8910772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8910851Z 2025-05-07T20:32:44.8910950Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8911079Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8911154Z 2025-05-07T20:32:44.8911293Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8911406Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8911509Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8911635Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8911783Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8911859Z 2025-05-07T20:32:44.8911970Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8911975Z 2025-05-07T20:32:44.8912076Z moe/activation_test.py:126: 2025-05-07T20:32:44.8912273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8912388Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8912526Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8913088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8913200Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8913565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8913837Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8914205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8914463Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8914907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8915160Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8915541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8915710Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8916058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8916146Z fn() 2025-05-07T20:32:44.8916548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8916634Z self.fn.run( 2025-05-07T20:32:44.8916980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8917081Z kernel = self.compile( 2025-05-07T20:32:44.8917468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8917645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8917777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8917781Z 2025-05-07T20:32:44.8917994Z self = 2025-05-07T20:32:44.8918822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8919338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f2f9f80>} 2025-05-07T20:32:44.8920091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8920289Z context = 2025-05-07T20:32:44.8920294Z 2025-05-07T20:32:44.8920459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8920725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8920848Z module_map=module_map) 2025-05-07T20:32:44.8921013Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8921117Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8921201Z E ^ 2025-05-07T20:32:44.8921556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8921561Z 2025-05-07T20:32:44.8922029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8922034Z 2025-05-07T20:32:44.8922145Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8922368Z self=, 2025-05-07T20:32:44.8922452Z T=1, 2025-05-07T20:32:44.8922535Z D=5120, 2025-05-07T20:32:44.8922621Z scale_ub=None, 2025-05-07T20:32:44.8922717Z contiguous=True, 2025-05-07T20:32:44.8922804Z compiled=False, 2025-05-07T20:32:44.8922946Z ) 2025-05-07T20:32:44.8923173Z self = 2025-05-07T20:32:44.8923340Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.8923345Z 2025-05-07T20:32:44.8923433Z @given( 2025-05-07T20:32:44.8923555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8923660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8923830Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8923952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8924067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8924149Z ) 2025-05-07T20:32:44.8924396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8924500Z def test_silu_mul_quant( 2025-05-07T20:32:44.8924580Z self, 2025-05-07T20:32:44.8924663Z T: int, 2025-05-07T20:32:44.8924751Z D: int, 2025-05-07T20:32:44.8924857Z scale_ub: Optional[float], 2025-05-07T20:32:44.8924948Z contiguous: bool, 2025-05-07T20:32:44.8925044Z compiled: bool, 2025-05-07T20:32:44.8925126Z ) -> None: 2025-05-07T20:32:44.8925224Z torch.manual_seed(2025) 2025-05-07T20:32:44.8925306Z 2025-05-07T20:32:44.8925476Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8925553Z 2025-05-07T20:32:44.8925660Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8925787Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8925885Z x = x_sign * x_clamp 2025-05-07T20:32:44.8925968Z x0 = x[:, :D] 2025-05-07T20:32:44.8926053Z x1 = x[:, D:] 2025-05-07T20:32:44.8926133Z 2025-05-07T20:32:44.8926220Z if contiguous: 2025-05-07T20:32:44.8926314Z x0 = x0.contiguous() 2025-05-07T20:32:44.8926413Z x1 = x1.contiguous() 2025-05-07T20:32:44.8926534Z 2025-05-07T20:32:44.8926631Z if scale_ub is not None: 2025-05-07T20:32:44.8926744Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8926882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8926961Z ) 2025-05-07T20:32:44.8927046Z else: 2025-05-07T20:32:44.8927144Z scale_ub_tensor = None 2025-05-07T20:32:44.8927220Z 2025-05-07T20:32:44.8927364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8927461Z op = silu_mul_quant 2025-05-07T20:32:44.8927623Z if compiled: 2025-05-07T20:32:44.8927723Z op = torch.compile(op) 2025-05-07T20:32:44.8927829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8927908Z 2025-05-07T20:32:44.8928008Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8928014Z 2025-05-07T20:32:44.8928127Z moe/activation_test.py:117: 2025-05-07T20:32:44.8928286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8928396Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8928502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8929004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8929103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8929508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8929731Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8930072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8930176Z kernel = self.compile( 2025-05-07T20:32:44.8930556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8930738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8930946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8930951Z 2025-05-07T20:32:44.8931156Z self = 2025-05-07T20:32:44.8931981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8932483Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f2fb9c0>} 2025-05-07T20:32:44.8933230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8933426Z context = 2025-05-07T20:32:44.8933431Z 2025-05-07T20:32:44.8933592Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8933859Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8933968Z module_map=module_map) 2025-05-07T20:32:44.8934137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8934240Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8934321Z E ^ 2025-05-07T20:32:44.8934682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8934686Z 2025-05-07T20:32:44.8935100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8935146Z 2025-05-07T20:32:44.8935257Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8935479Z self=, 2025-05-07T20:32:44.8935556Z T=128, 2025-05-07T20:32:44.8935639Z D=5120, 2025-05-07T20:32:44.8935727Z scale_ub=None, 2025-05-07T20:32:44.8935814Z contiguous=False, 2025-05-07T20:32:44.8935904Z compiled=True, 2025-05-07T20:32:44.8935979Z ) 2025-05-07T20:32:44.8936196Z self = 2025-05-07T20:32:44.8936375Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.8936380Z 2025-05-07T20:32:44.8936459Z @given( 2025-05-07T20:32:44.8936583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8936683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8936798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8936920Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8937035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8937114Z ) 2025-05-07T20:32:44.8937360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8937455Z def test_silu_mul_quant( 2025-05-07T20:32:44.8937534Z self, 2025-05-07T20:32:44.8937617Z T: int, 2025-05-07T20:32:44.8937696Z D: int, 2025-05-07T20:32:44.8937796Z scale_ub: Optional[float], 2025-05-07T20:32:44.8937934Z contiguous: bool, 2025-05-07T20:32:44.8938024Z compiled: bool, 2025-05-07T20:32:44.8938107Z ) -> None: 2025-05-07T20:32:44.8938203Z torch.manual_seed(2025) 2025-05-07T20:32:44.8938279Z 2025-05-07T20:32:44.8938453Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8938527Z 2025-05-07T20:32:44.8938618Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8938746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8938839Z x = x_sign * x_clamp 2025-05-07T20:32:44.8938963Z x0 = x[:, :D] 2025-05-07T20:32:44.8939052Z x1 = x[:, D:] 2025-05-07T20:32:44.8939124Z 2025-05-07T20:32:44.8939209Z if contiguous: 2025-05-07T20:32:44.8939305Z x0 = x0.contiguous() 2025-05-07T20:32:44.8939393Z x1 = x1.contiguous() 2025-05-07T20:32:44.8939470Z 2025-05-07T20:32:44.8939562Z if scale_ub is not None: 2025-05-07T20:32:44.8939668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8939850Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8939930Z ) 2025-05-07T20:32:44.8940006Z else: 2025-05-07T20:32:44.8940106Z scale_ub_tensor = None 2025-05-07T20:32:44.8940181Z 2025-05-07T20:32:44.8940310Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8940406Z op = silu_mul_quant 2025-05-07T20:32:44.8940491Z if compiled: 2025-05-07T20:32:44.8940593Z op = torch.compile(op) 2025-05-07T20:32:44.8940706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8940780Z 2025-05-07T20:32:44.8940870Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8940879Z 2025-05-07T20:32:44.8940978Z moe/activation_test.py:117: 2025-05-07T20:32:44.8941112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8941218Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8941319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8941685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8941782Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8942273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8942377Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8942731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8942999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8943341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8943437Z kernel = self.compile( 2025-05-07T20:32:44.8943819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8943996Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8944124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8944128Z 2025-05-07T20:32:44.8944338Z self = 2025-05-07T20:32:44.8945108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8945614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f0a1120>} 2025-05-07T20:32:44.8946403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8946595Z context = 2025-05-07T20:32:44.8946599Z 2025-05-07T20:32:44.8946766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8947027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8947134Z module_map=module_map) 2025-05-07T20:32:44.8947302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8947444Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8947530Z E ^ 2025-05-07T20:32:44.8947881Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8947886Z 2025-05-07T20:32:44.8948331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8948340Z 2025-05-07T20:32:44.8948501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8948723Z self=, 2025-05-07T20:32:44.8948804Z T=128, 2025-05-07T20:32:44.8948885Z D=7168, 2025-05-07T20:32:44.8948970Z scale_ub=1200.0, 2025-05-07T20:32:44.8949059Z contiguous=False, 2025-05-07T20:32:44.8949147Z compiled=False, 2025-05-07T20:32:44.8949221Z ) 2025-05-07T20:32:44.8949443Z self = 2025-05-07T20:32:44.8949620Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.8949625Z 2025-05-07T20:32:44.8949704Z @given( 2025-05-07T20:32:44.8949828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8949927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8950046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8950164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8950278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8950357Z ) 2025-05-07T20:32:44.8950600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8950693Z def test_silu_mul_quant( 2025-05-07T20:32:44.8950774Z self, 2025-05-07T20:32:44.8950850Z T: int, 2025-05-07T20:32:44.8950927Z D: int, 2025-05-07T20:32:44.8951029Z scale_ub: Optional[float], 2025-05-07T20:32:44.8951164Z contiguous: bool, 2025-05-07T20:32:44.8951255Z compiled: bool, 2025-05-07T20:32:44.8951340Z ) -> None: 2025-05-07T20:32:44.8951435Z torch.manual_seed(2025) 2025-05-07T20:32:44.8951517Z 2025-05-07T20:32:44.8951686Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8951763Z 2025-05-07T20:32:44.8951856Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8951979Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8952072Z x = x_sign * x_clamp 2025-05-07T20:32:44.8952158Z x0 = x[:, :D] 2025-05-07T20:32:44.8952239Z x1 = x[:, D:] 2025-05-07T20:32:44.8952318Z 2025-05-07T20:32:44.8952403Z if contiguous: 2025-05-07T20:32:44.8952494Z x0 = x0.contiguous() 2025-05-07T20:32:44.8952589Z x1 = x1.contiguous() 2025-05-07T20:32:44.8952660Z 2025-05-07T20:32:44.8952749Z if scale_ub is not None: 2025-05-07T20:32:44.8952862Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8953000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8953085Z ) 2025-05-07T20:32:44.8953162Z else: 2025-05-07T20:32:44.8953257Z scale_ub_tensor = None 2025-05-07T20:32:44.8953339Z 2025-05-07T20:32:44.8953466Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8953556Z op = silu_mul_quant 2025-05-07T20:32:44.8953646Z if compiled: 2025-05-07T20:32:44.8953816Z op = torch.compile(op) 2025-05-07T20:32:44.8953923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8954000Z 2025-05-07T20:32:44.8954093Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8954097Z 2025-05-07T20:32:44.8954192Z moe/activation_test.py:117: 2025-05-07T20:32:44.8954327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8954428Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8954536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8955073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8955171Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8955531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8955754Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8956131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8956231Z kernel = self.compile( 2025-05-07T20:32:44.8956609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8956784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8956914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8956921Z 2025-05-07T20:32:44.8957129Z self = 2025-05-07T20:32:44.8957902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8958453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f0a0360>} 2025-05-07T20:32:44.8959203Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8959393Z context = 2025-05-07T20:32:44.8959438Z 2025-05-07T20:32:44.8959606Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8959873Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8959982Z module_map=module_map) 2025-05-07T20:32:44.8960150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8960251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8960331Z E ^ 2025-05-07T20:32:44.8960696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8960700Z 2025-05-07T20:32:44.8961111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8961116Z 2025-05-07T20:32:44.8961225Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8961446Z self=, 2025-05-07T20:32:44.8961531Z T=128, 2025-05-07T20:32:44.8961615Z D=5120, 2025-05-07T20:32:44.8961703Z scale_ub=None, 2025-05-07T20:32:44.8961788Z contiguous=False, 2025-05-07T20:32:44.8961878Z compiled=False, 2025-05-07T20:32:44.8961953Z ) 2025-05-07T20:32:44.8962168Z self = 2025-05-07T20:32:44.8962344Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8962392Z 2025-05-07T20:32:44.8962478Z @given( 2025-05-07T20:32:44.8962601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8962702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8962817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8962941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8963055Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8963130Z ) 2025-05-07T20:32:44.8963377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8963514Z def test_silu_mul_quant( 2025-05-07T20:32:44.8963597Z self, 2025-05-07T20:32:44.8963675Z T: int, 2025-05-07T20:32:44.8963748Z D: int, 2025-05-07T20:32:44.8963850Z scale_ub: Optional[float], 2025-05-07T20:32:44.8963941Z contiguous: bool, 2025-05-07T20:32:44.8964028Z compiled: bool, 2025-05-07T20:32:44.8964112Z ) -> None: 2025-05-07T20:32:44.8964210Z torch.manual_seed(2025) 2025-05-07T20:32:44.8964328Z 2025-05-07T20:32:44.8964504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8964580Z 2025-05-07T20:32:44.8964672Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8964801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8964890Z x = x_sign * x_clamp 2025-05-07T20:32:44.8964974Z x0 = x[:, :D] 2025-05-07T20:32:44.8965063Z x1 = x[:, D:] 2025-05-07T20:32:44.8965137Z 2025-05-07T20:32:44.8965228Z if contiguous: 2025-05-07T20:32:44.8965321Z x0 = x0.contiguous() 2025-05-07T20:32:44.8965413Z x1 = x1.contiguous() 2025-05-07T20:32:44.8965492Z 2025-05-07T20:32:44.8965583Z if scale_ub is not None: 2025-05-07T20:32:44.8965693Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8965834Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8965914Z ) 2025-05-07T20:32:44.8965997Z else: 2025-05-07T20:32:44.8966097Z scale_ub_tensor = None 2025-05-07T20:32:44.8966171Z 2025-05-07T20:32:44.8966302Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8966397Z op = silu_mul_quant 2025-05-07T20:32:44.8966484Z if compiled: 2025-05-07T20:32:44.8966590Z op = torch.compile(op) 2025-05-07T20:32:44.8966696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8966814Z 2025-05-07T20:32:44.8966912Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8966919Z 2025-05-07T20:32:44.8967016Z moe/activation_test.py:117: 2025-05-07T20:32:44.8967146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8967252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8967351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8967898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8968008Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8968360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8968582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8968919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8969020Z kernel = self.compile( 2025-05-07T20:32:44.8969406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8969579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8969710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8969714Z 2025-05-07T20:32:44.8969965Z self = 2025-05-07T20:32:44.8970738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8971242Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ec28720>} 2025-05-07T20:32:44.8971987Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8972219Z context = 2025-05-07T20:32:44.8972224Z 2025-05-07T20:32:44.8972389Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8972691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8972805Z module_map=module_map) 2025-05-07T20:32:44.8972964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8973066Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8973145Z E ^ 2025-05-07T20:32:44.8973495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8973503Z 2025-05-07T20:32:44.8973918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8973926Z 2025-05-07T20:32:44.8974030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8974254Z self=, 2025-05-07T20:32:44.8974334Z T=128, 2025-05-07T20:32:44.8974413Z D=5120, 2025-05-07T20:32:44.8974501Z scale_ub=1200.0, 2025-05-07T20:32:44.8974589Z contiguous=True, 2025-05-07T20:32:44.8974675Z compiled=False, 2025-05-07T20:32:44.8974756Z ) 2025-05-07T20:32:44.8974971Z self = 2025-05-07T20:32:44.8975138Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.8975143Z 2025-05-07T20:32:44.8975222Z @given( 2025-05-07T20:32:44.8975341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8975490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8975606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8975723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8975841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8975916Z ) 2025-05-07T20:32:44.8976158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8976257Z def test_silu_mul_quant( 2025-05-07T20:32:44.8976336Z self, 2025-05-07T20:32:44.8976415Z T: int, 2025-05-07T20:32:44.8976495Z D: int, 2025-05-07T20:32:44.8976592Z scale_ub: Optional[float], 2025-05-07T20:32:44.8976681Z contiguous: bool, 2025-05-07T20:32:44.8976772Z compiled: bool, 2025-05-07T20:32:44.8976848Z ) -> None: 2025-05-07T20:32:44.8976945Z torch.manual_seed(2025) 2025-05-07T20:32:44.8977020Z 2025-05-07T20:32:44.8977188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8977280Z 2025-05-07T20:32:44.8977375Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8977500Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8977597Z x = x_sign * x_clamp 2025-05-07T20:32:44.8977679Z x0 = x[:, :D] 2025-05-07T20:32:44.8977760Z x1 = x[:, D:] 2025-05-07T20:32:44.8977837Z 2025-05-07T20:32:44.8977923Z if contiguous: 2025-05-07T20:32:44.8978014Z x0 = x0.contiguous() 2025-05-07T20:32:44.8978155Z x1 = x1.contiguous() 2025-05-07T20:32:44.8978232Z 2025-05-07T20:32:44.8978325Z if scale_ub is not None: 2025-05-07T20:32:44.8978436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8978570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8978649Z ) 2025-05-07T20:32:44.8978725Z else: 2025-05-07T20:32:44.8978818Z scale_ub_tensor = None 2025-05-07T20:32:44.8978897Z 2025-05-07T20:32:44.8979029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8979168Z op = silu_mul_quant 2025-05-07T20:32:44.8983196Z if compiled: 2025-05-07T20:32:44.8983314Z op = torch.compile(op) 2025-05-07T20:32:44.8983427Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8983501Z 2025-05-07T20:32:44.8983594Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8983599Z 2025-05-07T20:32:44.8983706Z moe/activation_test.py:117: 2025-05-07T20:32:44.8983913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8984016Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8984117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8984629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8984736Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8985099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8985325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8985670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8985765Z kernel = self.compile( 2025-05-07T20:32:44.8986151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8986333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8986461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8986465Z 2025-05-07T20:32:44.8986678Z self = 2025-05-07T20:32:44.8987455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8988027Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ec298a0>} 2025-05-07T20:32:44.8988782Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8988975Z context = 2025-05-07T20:32:44.8988980Z 2025-05-07T20:32:44.8989146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8989406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8989513Z module_map=module_map) 2025-05-07T20:32:44.8989679Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8989780Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8989864Z E ^ 2025-05-07T20:32:44.8990218Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8990222Z 2025-05-07T20:32:44.8990631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8990676Z 2025-05-07T20:32:44.8990792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8991015Z self=, 2025-05-07T20:32:44.8991099Z T=1, 2025-05-07T20:32:44.8991179Z D=7168, 2025-05-07T20:32:44.8991262Z scale_ub=1200.0, 2025-05-07T20:32:44.8991351Z contiguous=True, 2025-05-07T20:32:44.8991434Z compiled=True, 2025-05-07T20:32:44.8991506Z ) 2025-05-07T20:32:44.8991728Z self = 2025-05-07T20:32:44.8991933Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.8991938Z 2025-05-07T20:32:44.8992015Z @given( 2025-05-07T20:32:44.8992143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8992243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8992362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8992481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8992629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8992713Z ) 2025-05-07T20:32:44.8992953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8993051Z def test_silu_mul_quant( 2025-05-07T20:32:44.8993137Z self, 2025-05-07T20:32:44.8993214Z T: int, 2025-05-07T20:32:44.8993290Z D: int, 2025-05-07T20:32:44.8993390Z scale_ub: Optional[float], 2025-05-07T20:32:44.8993482Z contiguous: bool, 2025-05-07T20:32:44.8993567Z compiled: bool, 2025-05-07T20:32:44.8993654Z ) -> None: 2025-05-07T20:32:44.8993749Z torch.manual_seed(2025) 2025-05-07T20:32:44.8993826Z 2025-05-07T20:32:44.8993991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8994071Z 2025-05-07T20:32:44.8994161Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8994286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8994377Z x = x_sign * x_clamp 2025-05-07T20:32:44.8994458Z x0 = x[:, :D] 2025-05-07T20:32:44.8994535Z x1 = x[:, D:] 2025-05-07T20:32:44.8994613Z 2025-05-07T20:32:44.8994696Z if contiguous: 2025-05-07T20:32:44.8994786Z x0 = x0.contiguous() 2025-05-07T20:32:44.8994877Z x1 = x1.contiguous() 2025-05-07T20:32:44.8994953Z 2025-05-07T20:32:44.8995043Z if scale_ub is not None: 2025-05-07T20:32:44.8995196Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8995332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8995418Z ) 2025-05-07T20:32:44.8995494Z else: 2025-05-07T20:32:44.8995590Z scale_ub_tensor = None 2025-05-07T20:32:44.8995667Z 2025-05-07T20:32:44.8995796Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8995891Z op = silu_mul_quant 2025-05-07T20:32:44.8995982Z if compiled: 2025-05-07T20:32:44.8996088Z op = torch.compile(op) 2025-05-07T20:32:44.8996193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8996269Z 2025-05-07T20:32:44.8996360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8996364Z 2025-05-07T20:32:44.8996459Z moe/activation_test.py:117: 2025-05-07T20:32:44.8996589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8996689Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8996796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8997170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8997264Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8997759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8997857Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8998262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8998488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8998825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8998923Z kernel = self.compile( 2025-05-07T20:32:44.8999302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8999518Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8999654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8999659Z 2025-05-07T20:32:44.8999862Z self = 2025-05-07T20:32:44.9000678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9001179Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ec2ae80>} 2025-05-07T20:32:44.9001922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9002120Z context = 2025-05-07T20:32:44.9002124Z 2025-05-07T20:32:44.9002285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9002547Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9002655Z module_map=module_map) 2025-05-07T20:32:44.9002818Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9002919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9002996Z E ^ 2025-05-07T20:32:44.9003353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9003358Z 2025-05-07T20:32:44.9003767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9003817Z 2025-05-07T20:32:44.9003922Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9004143Z self=, 2025-05-07T20:32:44.9004224Z T=1, 2025-05-07T20:32:44.9004304Z D=7168, 2025-05-07T20:32:44.9004392Z scale_ub=1200.0, 2025-05-07T20:32:44.9004480Z contiguous=False, 2025-05-07T20:32:44.9004571Z compiled=True, 2025-05-07T20:32:44.9004645Z ) 2025-05-07T20:32:44.9004865Z self = 2025-05-07T20:32:44.9005037Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9005041Z 2025-05-07T20:32:44.9005121Z @given( 2025-05-07T20:32:44.9005240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9005342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9005456Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9005574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9005952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9006059Z ) 2025-05-07T20:32:44.9006325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9006420Z def test_silu_mul_quant( 2025-05-07T20:32:44.9006497Z self, 2025-05-07T20:32:44.9006576Z T: int, 2025-05-07T20:32:44.9006653Z D: int, 2025-05-07T20:32:44.9006841Z scale_ub: Optional[float], 2025-05-07T20:32:44.9006936Z contiguous: bool, 2025-05-07T20:32:44.9007021Z compiled: bool, 2025-05-07T20:32:44.9007098Z ) -> None: 2025-05-07T20:32:44.9007195Z torch.manual_seed(2025) 2025-05-07T20:32:44.9007268Z 2025-05-07T20:32:44.9007433Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9007512Z 2025-05-07T20:32:44.9007706Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9007838Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9007991Z x = x_sign * x_clamp 2025-05-07T20:32:44.9008068Z x0 = x[:, :D] 2025-05-07T20:32:44.9008152Z x1 = x[:, D:] 2025-05-07T20:32:44.9008224Z 2025-05-07T20:32:44.9008308Z if contiguous: 2025-05-07T20:32:44.9008401Z x0 = x0.contiguous() 2025-05-07T20:32:44.9008489Z x1 = x1.contiguous() 2025-05-07T20:32:44.9008566Z 2025-05-07T20:32:44.9008662Z if scale_ub is not None: 2025-05-07T20:32:44.9008829Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9008965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9009042Z ) 2025-05-07T20:32:44.9009118Z else: 2025-05-07T20:32:44.9009211Z scale_ub_tensor = None 2025-05-07T20:32:44.9009287Z 2025-05-07T20:32:44.9009416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9009512Z op = silu_mul_quant 2025-05-07T20:32:44.9009598Z if compiled: 2025-05-07T20:32:44.9009699Z op = torch.compile(op) 2025-05-07T20:32:44.9009811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9009887Z 2025-05-07T20:32:44.9009978Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9009982Z 2025-05-07T20:32:44.9010081Z moe/activation_test.py:117: 2025-05-07T20:32:44.9010214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9010319Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9010426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9010793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9010888Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9011374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9011538Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9011893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9012118Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9012460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9012556Z kernel = self.compile( 2025-05-07T20:32:44.9012938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9013115Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9013245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9013250Z 2025-05-07T20:32:44.9013455Z self = 2025-05-07T20:32:44.9014223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9014730Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ee9c680>} 2025-05-07T20:32:44.9015516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9015709Z context = 2025-05-07T20:32:44.9015714Z 2025-05-07T20:32:44.9015884Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9016145Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9016254Z module_map=module_map) 2025-05-07T20:32:44.9016459Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9016559Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9016639Z E ^ 2025-05-07T20:32:44.9016996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9017000Z 2025-05-07T20:32:44.9017473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9017478Z 2025-05-07T20:32:44.9017585Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9017806Z self=, 2025-05-07T20:32:44.9017884Z T=1, 2025-05-07T20:32:44.9017966Z D=7168, 2025-05-07T20:32:44.9018052Z scale_ub=None, 2025-05-07T20:32:44.9018139Z contiguous=False, 2025-05-07T20:32:44.9018223Z compiled=True, 2025-05-07T20:32:44.9018305Z ) 2025-05-07T20:32:44.9018530Z self = 2025-05-07T20:32:44.9018695Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9018700Z 2025-05-07T20:32:44.9018777Z @given( 2025-05-07T20:32:44.9018897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9018995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9019108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9019233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9019345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9019421Z ) 2025-05-07T20:32:44.9019664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9019758Z def test_silu_mul_quant( 2025-05-07T20:32:44.9020238Z self, 2025-05-07T20:32:44.9020317Z T: int, 2025-05-07T20:32:44.9020440Z D: int, 2025-05-07T20:32:44.9020544Z scale_ub: Optional[float], 2025-05-07T20:32:44.9020634Z contiguous: bool, 2025-05-07T20:32:44.9020719Z compiled: bool, 2025-05-07T20:32:44.9020802Z ) -> None: 2025-05-07T20:32:44.9020897Z torch.manual_seed(2025) 2025-05-07T20:32:44.9020972Z 2025-05-07T20:32:44.9021142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9021219Z 2025-05-07T20:32:44.9021308Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9021440Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9021528Z x = x_sign * x_clamp 2025-05-07T20:32:44.9021610Z x0 = x[:, :D] 2025-05-07T20:32:44.9021694Z x1 = x[:, D:] 2025-05-07T20:32:44.9021766Z 2025-05-07T20:32:44.9021851Z if contiguous: 2025-05-07T20:32:44.9021942Z x0 = x0.contiguous() 2025-05-07T20:32:44.9022031Z x1 = x1.contiguous() 2025-05-07T20:32:44.9022111Z 2025-05-07T20:32:44.9022206Z if scale_ub is not None: 2025-05-07T20:32:44.9022316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9022453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9022527Z ) 2025-05-07T20:32:44.9022604Z else: 2025-05-07T20:32:44.9022700Z scale_ub_tensor = None 2025-05-07T20:32:44.9022774Z 2025-05-07T20:32:44.9022908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9022999Z op = silu_mul_quant 2025-05-07T20:32:44.9023136Z if compiled: 2025-05-07T20:32:44.9023244Z op = torch.compile(op) 2025-05-07T20:32:44.9023350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9023425Z 2025-05-07T20:32:44.9023518Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.9023636Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.9023711Z 2025-05-07T20:32:44.9023850Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9023956Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.9024102Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.9024223Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.9024367Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.9024441Z 2025-05-07T20:32:44.9024541Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.9024546Z 2025-05-07T20:32:44.9024651Z moe/activation_test.py:126: 2025-05-07T20:32:44.9024823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9024931Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.9025067Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.9025623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.9025727Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.9026088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9026310Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9026678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.9026932Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.9027333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.9027591Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.9027962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.9028143Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.9028560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.9028639Z fn() 2025-05-07T20:32:44.9029039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.9029123Z self.fn.run( 2025-05-07T20:32:44.9029459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9029563Z kernel = self.compile( 2025-05-07T20:32:44.9029940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9030115Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9030241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9030246Z 2025-05-07T20:32:44.9030451Z self = 2025-05-07T20:32:44.9031237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9031742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ee9d580>} 2025-05-07T20:32:44.9032536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9032729Z context = 2025-05-07T20:32:44.9032734Z 2025-05-07T20:32:44.9032898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9033164Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9033310Z module_map=module_map) 2025-05-07T20:32:44.9033473Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9033575Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.9033653Z E ^ 2025-05-07T20:32:44.9034007Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9034015Z 2025-05-07T20:32:44.9034463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9034468Z 2025-05-07T20:32:44.9034574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9034793Z self=, 2025-05-07T20:32:44.9034873Z T=1, 2025-05-07T20:32:44.9034955Z D=5120, 2025-05-07T20:32:44.9035042Z scale_ub=1200.0, 2025-05-07T20:32:44.9035134Z contiguous=False, 2025-05-07T20:32:44.9035225Z compiled=True, 2025-05-07T20:32:44.9035299Z ) 2025-05-07T20:32:44.9035515Z self = 2025-05-07T20:32:44.9035683Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9035687Z 2025-05-07T20:32:44.9035764Z @given( 2025-05-07T20:32:44.9035885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9035985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9036100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9036219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9036329Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9036407Z ) 2025-05-07T20:32:44.9036652Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9036747Z def test_silu_mul_quant( 2025-05-07T20:32:44.9036872Z self, 2025-05-07T20:32:44.9036948Z T: int, 2025-05-07T20:32:44.9037027Z D: int, 2025-05-07T20:32:44.9037129Z scale_ub: Optional[float], 2025-05-07T20:32:44.9037218Z contiguous: bool, 2025-05-07T20:32:44.9037304Z compiled: bool, 2025-05-07T20:32:44.9037385Z ) -> None: 2025-05-07T20:32:44.9037478Z torch.manual_seed(2025) 2025-05-07T20:32:44.9037552Z 2025-05-07T20:32:44.9037722Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9037799Z 2025-05-07T20:32:44.9037890Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9038016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9038103Z x = x_sign * x_clamp 2025-05-07T20:32:44.9038183Z x0 = x[:, :D] 2025-05-07T20:32:44.9038265Z x1 = x[:, D:] 2025-05-07T20:32:44.9038336Z 2025-05-07T20:32:44.9038426Z if contiguous: 2025-05-07T20:32:44.9038517Z x0 = x0.contiguous() 2025-05-07T20:32:44.9038608Z x1 = x1.contiguous() 2025-05-07T20:32:44.9038683Z 2025-05-07T20:32:44.9038774Z if scale_ub is not None: 2025-05-07T20:32:44.9038880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9039019Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9039097Z ) 2025-05-07T20:32:44.9039173Z else: 2025-05-07T20:32:44.9039272Z scale_ub_tensor = None 2025-05-07T20:32:44.9039346Z 2025-05-07T20:32:44.9039521Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9039614Z op = silu_mul_quant 2025-05-07T20:32:44.9039697Z if compiled: 2025-05-07T20:32:44.9039802Z op = torch.compile(op) 2025-05-07T20:32:44.9039906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9039979Z 2025-05-07T20:32:44.9040073Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9040077Z 2025-05-07T20:32:44.9040172Z moe/activation_test.py:117: 2025-05-07T20:32:44.9040304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9040446Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9040543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9040906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9041006Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9041535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9041637Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9041991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9042210Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9042554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9042650Z kernel = self.compile( 2025-05-07T20:32:44.9043031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9043202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9043330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9043335Z 2025-05-07T20:32:44.9043547Z self = 2025-05-07T20:32:44.9044317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9044818Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ee9eb60>} 2025-05-07T20:32:44.9045602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9045789Z context = 2025-05-07T20:32:44.9045794Z 2025-05-07T20:32:44.9045959Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9046223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9046332Z module_map=module_map) 2025-05-07T20:32:44.9046491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9046591Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9046677Z E ^ 2025-05-07T20:32:44.9047029Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9047037Z 2025-05-07T20:32:44.9047454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9047459Z 2025-05-07T20:32:44.9047608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9047826Z self=, 2025-05-07T20:32:44.9047908Z T=1, 2025-05-07T20:32:44.9047985Z D=5120, 2025-05-07T20:32:44.9048111Z scale_ub=1200.0, 2025-05-07T20:32:44.9048204Z contiguous=False, 2025-05-07T20:32:44.9048290Z compiled=False, 2025-05-07T20:32:44.9048364Z ) 2025-05-07T20:32:44.9048583Z self = 2025-05-07T20:32:44.9048749Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9048754Z 2025-05-07T20:32:44.9048838Z @given( 2025-05-07T20:32:44.9048955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9049057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9049244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9049357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9049467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9049548Z ) 2025-05-07T20:32:44.9049790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9049881Z def test_silu_mul_quant( 2025-05-07T20:32:44.9050003Z self, 2025-05-07T20:32:44.9050081Z T: int, 2025-05-07T20:32:44.9050163Z D: int, 2025-05-07T20:32:44.9050259Z scale_ub: Optional[float], 2025-05-07T20:32:44.9050346Z contiguous: bool, 2025-05-07T20:32:44.9050434Z compiled: bool, 2025-05-07T20:32:44.9050513Z ) -> None: 2025-05-07T20:32:44.9050608Z torch.manual_seed(2025) 2025-05-07T20:32:44.9050682Z 2025-05-07T20:32:44.9050848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9050928Z 2025-05-07T20:32:44.9051021Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9051142Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9051234Z x = x_sign * x_clamp 2025-05-07T20:32:44.9051317Z x0 = x[:, :D] 2025-05-07T20:32:44.9051397Z x1 = x[:, D:] 2025-05-07T20:32:44.9051469Z 2025-05-07T20:32:44.9051552Z if contiguous: 2025-05-07T20:32:44.9051645Z x0 = x0.contiguous() 2025-05-07T20:32:44.9051738Z x1 = x1.contiguous() 2025-05-07T20:32:44.9051808Z 2025-05-07T20:32:44.9051899Z if scale_ub is not None: 2025-05-07T20:32:44.9052006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9052138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9052214Z ) 2025-05-07T20:32:44.9052294Z else: 2025-05-07T20:32:44.9052387Z scale_ub_tensor = None 2025-05-07T20:32:44.9052505Z 2025-05-07T20:32:44.9052636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9052730Z op = silu_mul_quant 2025-05-07T20:32:44.9052814Z if compiled: 2025-05-07T20:32:44.9052915Z op = torch.compile(op) 2025-05-07T20:32:44.9053020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9053095Z 2025-05-07T20:32:44.9053184Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9053188Z 2025-05-07T20:32:44.9053285Z moe/activation_test.py:117: 2025-05-07T20:32:44.9053418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9053519Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9053618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9054114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9054209Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9054570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9054791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9055127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9055223Z kernel = self.compile( 2025-05-07T20:32:44.9055654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9055827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9055962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9055966Z 2025-05-07T20:32:44.9056170Z self = 2025-05-07T20:32:44.9056945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9057490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3ee9f2e0>} 2025-05-07T20:32:44.9058308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9058514Z context = 2025-05-07T20:32:44.9058519Z 2025-05-07T20:32:44.9058679Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9058944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9059051Z module_map=module_map) 2025-05-07T20:32:44.9059215Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9059318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9059397Z E ^ 2025-05-07T20:32:44.9059750Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9059754Z 2025-05-07T20:32:44.9060164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9060173Z 2025-05-07T20:32:44.9060274Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9060494Z self=, 2025-05-07T20:32:44.9060573Z T=16384, 2025-05-07T20:32:44.9060653Z D=5120, 2025-05-07T20:32:44.9060734Z scale_ub=1200.0, 2025-05-07T20:32:44.9060819Z contiguous=False, 2025-05-07T20:32:44.9060908Z compiled=True, 2025-05-07T20:32:44.9061036Z ) 2025-05-07T20:32:44.9061250Z self = 2025-05-07T20:32:44.9061433Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9061438Z 2025-05-07T20:32:44.9061514Z @given( 2025-05-07T20:32:44.9061633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9061735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9061847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9061969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9062078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9062151Z ) 2025-05-07T20:32:44.9062396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9062490Z def test_silu_mul_quant( 2025-05-07T20:32:44.9062565Z self, 2025-05-07T20:32:44.9062647Z T: int, 2025-05-07T20:32:44.9062727Z D: int, 2025-05-07T20:32:44.9062826Z scale_ub: Optional[float], 2025-05-07T20:32:44.9062920Z contiguous: bool, 2025-05-07T20:32:44.9063004Z compiled: bool, 2025-05-07T20:32:44.9063082Z ) -> None: 2025-05-07T20:32:44.9063180Z torch.manual_seed(2025) 2025-05-07T20:32:44.9063252Z 2025-05-07T20:32:44.9063422Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9063494Z 2025-05-07T20:32:44.9063584Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9063759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9063847Z x = x_sign * x_clamp 2025-05-07T20:32:44.9063927Z x0 = x[:, :D] 2025-05-07T20:32:44.9064010Z x1 = x[:, D:] 2025-05-07T20:32:44.9064083Z 2025-05-07T20:32:44.9064165Z if contiguous: 2025-05-07T20:32:44.9064256Z x0 = x0.contiguous() 2025-05-07T20:32:44.9064344Z x1 = x1.contiguous() 2025-05-07T20:32:44.9064416Z 2025-05-07T20:32:44.9064512Z if scale_ub is not None: 2025-05-07T20:32:44.9064616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9064807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9064883Z ) 2025-05-07T20:32:44.9064959Z else: 2025-05-07T20:32:44.9065053Z scale_ub_tensor = None 2025-05-07T20:32:44.9065127Z 2025-05-07T20:32:44.9065254Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9065345Z op = silu_mul_quant 2025-05-07T20:32:44.9065469Z if compiled: 2025-05-07T20:32:44.9065567Z op = torch.compile(op) 2025-05-07T20:32:44.9065673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9065746Z 2025-05-07T20:32:44.9065836Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9065840Z 2025-05-07T20:32:44.9065939Z moe/activation_test.py:117: 2025-05-07T20:32:44.9066067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9066173Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9066279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9066642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9066734Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9067222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9067322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9067680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9067897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9068234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9068326Z kernel = self.compile( 2025-05-07T20:32:44.9068746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9068924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9069051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9069055Z 2025-05-07T20:32:44.9069260Z self = 2025-05-07T20:32:44.9070034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9070534Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e69cfe0>} 2025-05-07T20:32:44.9071275Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9071467Z context = 2025-05-07T20:32:44.9071471Z 2025-05-07T20:32:44.9071636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9071896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9072046Z module_map=module_map) 2025-05-07T20:32:44.9072213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9072312Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9072393Z E ^ 2025-05-07T20:32:44.9072745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9072749Z 2025-05-07T20:32:44.9073157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9073204Z 2025-05-07T20:32:44.9073309Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9073528Z self=, 2025-05-07T20:32:44.9073606Z T=2048, 2025-05-07T20:32:44.9073687Z D=7168, 2025-05-07T20:32:44.9073769Z scale_ub=1200.0, 2025-05-07T20:32:44.9073856Z contiguous=False, 2025-05-07T20:32:44.9073938Z compiled=True, 2025-05-07T20:32:44.9074014Z ) 2025-05-07T20:32:44.9074271Z self = 2025-05-07T20:32:44.9074444Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9074449Z 2025-05-07T20:32:44.9074524Z @given( 2025-05-07T20:32:44.9074645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9074742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9074855Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9074974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9075085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9075165Z ) 2025-05-07T20:32:44.9075404Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9075498Z def test_silu_mul_quant( 2025-05-07T20:32:44.9075580Z self, 2025-05-07T20:32:44.9075658Z T: int, 2025-05-07T20:32:44.9075735Z D: int, 2025-05-07T20:32:44.9075838Z scale_ub: Optional[float], 2025-05-07T20:32:44.9075926Z contiguous: bool, 2025-05-07T20:32:44.9076010Z compiled: bool, 2025-05-07T20:32:44.9076093Z ) -> None: 2025-05-07T20:32:44.9076185Z torch.manual_seed(2025) 2025-05-07T20:32:44.9076260Z 2025-05-07T20:32:44.9076430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9076502Z 2025-05-07T20:32:44.9076640Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9076762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9076854Z x = x_sign * x_clamp 2025-05-07T20:32:44.9076933Z x0 = x[:, :D] 2025-05-07T20:32:44.9077012Z x1 = x[:, D:] 2025-05-07T20:32:44.9077087Z 2025-05-07T20:32:44.9077170Z if contiguous: 2025-05-07T20:32:44.9077260Z x0 = x0.contiguous() 2025-05-07T20:32:44.9077351Z x1 = x1.contiguous() 2025-05-07T20:32:44.9077423Z 2025-05-07T20:32:44.9077517Z if scale_ub is not None: 2025-05-07T20:32:44.9077625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9077756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9077832Z ) 2025-05-07T20:32:44.9077912Z else: 2025-05-07T20:32:44.9078006Z scale_ub_tensor = None 2025-05-07T20:32:44.9078080Z 2025-05-07T20:32:44.9078205Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9078294Z op = silu_mul_quant 2025-05-07T20:32:44.9078381Z if compiled: 2025-05-07T20:32:44.9078476Z op = torch.compile(op) 2025-05-07T20:32:44.9078578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9078651Z 2025-05-07T20:32:44.9078737Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9078742Z 2025-05-07T20:32:44.9078833Z moe/activation_test.py:117: 2025-05-07T20:32:44.9078959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9079125Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9079221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9079585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9079674Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9080161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9080259Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9080647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9080865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9081198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9081290Z kernel = self.compile( 2025-05-07T20:32:44.9081708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9081878Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9082004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9082009Z 2025-05-07T20:32:44.9082209Z self = 2025-05-07T20:32:44.9082982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9083478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e69db20>} 2025-05-07T20:32:44.9084219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9084407Z context = 2025-05-07T20:32:44.9084411Z 2025-05-07T20:32:44.9084569Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9084828Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9084973Z module_map=module_map) 2025-05-07T20:32:44.9085131Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9085229Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9085305Z E ^ 2025-05-07T20:32:44.9085655Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9085664Z 2025-05-07T20:32:44.9086072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9086077Z 2025-05-07T20:32:44.9086176Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9086396Z self=, 2025-05-07T20:32:44.9086472Z T=1, 2025-05-07T20:32:44.9086546Z D=5120, 2025-05-07T20:32:44.9086629Z scale_ub=None, 2025-05-07T20:32:44.9086710Z contiguous=False, 2025-05-07T20:32:44.9086793Z compiled=False, 2025-05-07T20:32:44.9086864Z ) 2025-05-07T20:32:44.9087074Z self = 2025-05-07T20:32:44.9087238Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9087242Z 2025-05-07T20:32:44.9087316Z @given( 2025-05-07T20:32:44.9087430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9087525Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9087734Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9087851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9087965Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9088043Z ) 2025-05-07T20:32:44.9088287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9088383Z def test_silu_mul_quant( 2025-05-07T20:32:44.9088459Z self, 2025-05-07T20:32:44.9088545Z T: int, 2025-05-07T20:32:44.9088621Z D: int, 2025-05-07T20:32:44.9088759Z scale_ub: Optional[float], 2025-05-07T20:32:44.9088848Z contiguous: bool, 2025-05-07T20:32:44.9088933Z compiled: bool, 2025-05-07T20:32:44.9089010Z ) -> None: 2025-05-07T20:32:44.9089106Z torch.manual_seed(2025) 2025-05-07T20:32:44.9089179Z 2025-05-07T20:32:44.9089346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9089422Z 2025-05-07T20:32:44.9089514Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9089679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9089772Z x = x_sign * x_clamp 2025-05-07T20:32:44.9089852Z x0 = x[:, :D] 2025-05-07T20:32:44.9089934Z x1 = x[:, D:] 2025-05-07T20:32:44.9090006Z 2025-05-07T20:32:44.9090089Z if contiguous: 2025-05-07T20:32:44.9090183Z x0 = x0.contiguous() 2025-05-07T20:32:44.9090272Z x1 = x1.contiguous() 2025-05-07T20:32:44.9090349Z 2025-05-07T20:32:44.9090441Z if scale_ub is not None: 2025-05-07T20:32:44.9090547Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9090679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9090759Z ) 2025-05-07T20:32:44.9090835Z else: 2025-05-07T20:32:44.9090927Z scale_ub_tensor = None 2025-05-07T20:32:44.9091004Z 2025-05-07T20:32:44.9091133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9091225Z op = silu_mul_quant 2025-05-07T20:32:44.9091314Z if compiled: 2025-05-07T20:32:44.9091411Z op = torch.compile(op) 2025-05-07T20:32:44.9091517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9091589Z 2025-05-07T20:32:44.9091678Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9091683Z 2025-05-07T20:32:44.9091782Z moe/activation_test.py:117: 2025-05-07T20:32:44.9091909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9092056Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9092154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9092645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9092745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9093102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9093322Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9093663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9093755Z kernel = self.compile( 2025-05-07T20:32:44.9094138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9094314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9094441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9094446Z 2025-05-07T20:32:44.9094652Z self = 2025-05-07T20:32:44.9095464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9095963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e69ee80>} 2025-05-07T20:32:44.9096708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9096900Z context = 2025-05-07T20:32:44.9096943Z 2025-05-07T20:32:44.9097109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9097369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9097477Z module_map=module_map) 2025-05-07T20:32:44.9097638Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9097776Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9097858Z E ^ 2025-05-07T20:32:44.9098210Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9098215Z 2025-05-07T20:32:44.9098623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9098628Z 2025-05-07T20:32:44.9098735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9098955Z self=, 2025-05-07T20:32:44.9099035Z T=4096, 2025-05-07T20:32:44.9099112Z D=7168, 2025-05-07T20:32:44.9099192Z scale_ub=1200.0, 2025-05-07T20:32:44.9099280Z contiguous=False, 2025-05-07T20:32:44.9099364Z compiled=False, 2025-05-07T20:32:44.9099437Z ) 2025-05-07T20:32:44.9099652Z self = 2025-05-07T20:32:44.9099828Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9099833Z 2025-05-07T20:32:44.9099915Z @given( 2025-05-07T20:32:44.9100034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9100132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9100249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9100363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9100519Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9100600Z ) 2025-05-07T20:32:44.9100841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9100934Z def test_silu_mul_quant( 2025-05-07T20:32:44.9101016Z self, 2025-05-07T20:32:44.9101093Z T: int, 2025-05-07T20:32:44.9101169Z D: int, 2025-05-07T20:32:44.9101267Z scale_ub: Optional[float], 2025-05-07T20:32:44.9101355Z contiguous: bool, 2025-05-07T20:32:44.9101443Z compiled: bool, 2025-05-07T20:32:44.9101524Z ) -> None: 2025-05-07T20:32:44.9101616Z torch.manual_seed(2025) 2025-05-07T20:32:44.9101691Z 2025-05-07T20:32:44.9101860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9101934Z 2025-05-07T20:32:44.9102030Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9102150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9102241Z x = x_sign * x_clamp 2025-05-07T20:32:44.9102323Z x0 = x[:, :D] 2025-05-07T20:32:44.9102405Z x1 = x[:, D:] 2025-05-07T20:32:44.9102476Z 2025-05-07T20:32:44.9102563Z if contiguous: 2025-05-07T20:32:44.9102653Z x0 = x0.contiguous() 2025-05-07T20:32:44.9102747Z x1 = x1.contiguous() 2025-05-07T20:32:44.9106808Z 2025-05-07T20:32:44.9106921Z if scale_ub is not None: 2025-05-07T20:32:44.9107031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9107287Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9107367Z ) 2025-05-07T20:32:44.9107449Z else: 2025-05-07T20:32:44.9107542Z scale_ub_tensor = None 2025-05-07T20:32:44.9107612Z 2025-05-07T20:32:44.9107746Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9107837Z op = silu_mul_quant 2025-05-07T20:32:44.9107924Z if compiled: 2025-05-07T20:32:44.9108031Z op = torch.compile(op) 2025-05-07T20:32:44.9108137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9108279Z 2025-05-07T20:32:44.9108371Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9108376Z 2025-05-07T20:32:44.9108476Z moe/activation_test.py:117: 2025-05-07T20:32:44.9108608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9108708Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9108806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9109381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9109480Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9109847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9110067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9110410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9110509Z kernel = self.compile( 2025-05-07T20:32:44.9110888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9111061Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9111193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9111198Z 2025-05-07T20:32:44.9111405Z self = 2025-05-07T20:32:44.9112178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9112678Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e7dc040>} 2025-05-07T20:32:44.9113517Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9113708Z context = 2025-05-07T20:32:44.9113713Z 2025-05-07T20:32:44.9113890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9114158Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9114265Z module_map=module_map) 2025-05-07T20:32:44.9114427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9114524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9114602Z E ^ 2025-05-07T20:32:44.9114959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9114966Z 2025-05-07T20:32:44.9115375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9115380Z 2025-05-07T20:32:44.9115482Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9115707Z self=, 2025-05-07T20:32:44.9115831Z T=16384, 2025-05-07T20:32:44.9115914Z D=7168, 2025-05-07T20:32:44.9115997Z scale_ub=None, 2025-05-07T20:32:44.9116079Z contiguous=True, 2025-05-07T20:32:44.9116165Z compiled=True, 2025-05-07T20:32:44.9116235Z ) 2025-05-07T20:32:44.9116450Z self = 2025-05-07T20:32:44.9116630Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.9116635Z 2025-05-07T20:32:44.9116719Z @given( 2025-05-07T20:32:44.9116838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9116981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9117093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9117210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9117320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9117396Z ) 2025-05-07T20:32:44.9117645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9117776Z def test_silu_mul_quant( 2025-05-07T20:32:44.9117860Z self, 2025-05-07T20:32:44.9117936Z T: int, 2025-05-07T20:32:44.9118018Z D: int, 2025-05-07T20:32:44.9118137Z scale_ub: Optional[float], 2025-05-07T20:32:44.9118238Z contiguous: bool, 2025-05-07T20:32:44.9118332Z compiled: bool, 2025-05-07T20:32:44.9118413Z ) -> None: 2025-05-07T20:32:44.9118507Z torch.manual_seed(2025) 2025-05-07T20:32:44.9118583Z 2025-05-07T20:32:44.9118753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9118828Z 2025-05-07T20:32:44.9118920Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9119044Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9119131Z x = x_sign * x_clamp 2025-05-07T20:32:44.9119217Z x0 = x[:, :D] 2025-05-07T20:32:44.9119295Z x1 = x[:, D:] 2025-05-07T20:32:44.9119369Z 2025-05-07T20:32:44.9119456Z if contiguous: 2025-05-07T20:32:44.9119549Z x0 = x0.contiguous() 2025-05-07T20:32:44.9119635Z x1 = x1.contiguous() 2025-05-07T20:32:44.9119714Z 2025-05-07T20:32:44.9119806Z if scale_ub is not None: 2025-05-07T20:32:44.9119909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9120044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9120120Z ) 2025-05-07T20:32:44.9120245Z else: 2025-05-07T20:32:44.9120337Z scale_ub_tensor = None 2025-05-07T20:32:44.9120414Z 2025-05-07T20:32:44.9120660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9120983Z op = silu_mul_quant 2025-05-07T20:32:44.9121228Z if compiled: 2025-05-07T20:32:44.9121473Z op = torch.compile(op) 2025-05-07T20:32:44.9121764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9122031Z 2025-05-07T20:32:44.9122225Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9122391Z 2025-05-07T20:32:44.9122491Z moe/activation_test.py:117: 2025-05-07T20:32:44.9122782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9123117Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9123395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9123948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9124506Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9125169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9125862Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9126398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9127120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9127850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9128375Z kernel = self.compile( 2025-05-07T20:32:44.9128911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9129563Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9129957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9130232Z 2025-05-07T20:32:44.9130438Z self = 2025-05-07T20:32:44.9131510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9132939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e7dd260>} 2025-05-07T20:32:44.9134283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9135304Z context = 2025-05-07T20:32:44.9135596Z 2025-05-07T20:32:44.9135765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9136282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9136740Z module_map=module_map) 2025-05-07T20:32:44.9137102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9137449Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9137706Z E ^ 2025-05-07T20:32:44.9138173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9138619Z 2025-05-07T20:32:44.9139034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9139539Z 2025-05-07T20:32:44.9139642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9140052Z self=, 2025-05-07T20:32:44.9140500Z T=4096, 2025-05-07T20:32:44.9140687Z D=5120, 2025-05-07T20:32:44.9140878Z scale_ub=None, 2025-05-07T20:32:44.9141090Z contiguous=False, 2025-05-07T20:32:44.9141310Z compiled=True, 2025-05-07T20:32:44.9141511Z ) 2025-05-07T20:32:44.9141825Z self = 2025-05-07T20:32:44.9142318Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9142588Z 2025-05-07T20:32:44.9142668Z @given( 2025-05-07T20:32:44.9142898Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9143210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9143512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9143838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9144166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9144442Z ) 2025-05-07T20:32:44.9144789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9145231Z def test_silu_mul_quant( 2025-05-07T20:32:44.9145470Z self, 2025-05-07T20:32:44.9145658Z T: int, 2025-05-07T20:32:44.9145851Z D: int, 2025-05-07T20:32:44.9146069Z scale_ub: Optional[float], 2025-05-07T20:32:44.9146332Z contiguous: bool, 2025-05-07T20:32:44.9146570Z compiled: bool, 2025-05-07T20:32:44.9146791Z ) -> None: 2025-05-07T20:32:44.9147046Z torch.manual_seed(2025) 2025-05-07T20:32:44.9147292Z 2025-05-07T20:32:44.9147565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9147902Z 2025-05-07T20:32:44.9148123Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9148428Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9148729Z x = x_sign * x_clamp 2025-05-07T20:32:44.9148968Z x0 = x[:, :D] 2025-05-07T20:32:44.9149187Z x1 = x[:, D:] 2025-05-07T20:32:44.9149393Z 2025-05-07T20:32:44.9149577Z if contiguous: 2025-05-07T20:32:44.9149850Z x0 = x0.contiguous() 2025-05-07T20:32:44.9150101Z x1 = x1.contiguous() 2025-05-07T20:32:44.9150338Z 2025-05-07T20:32:44.9150529Z if scale_ub is not None: 2025-05-07T20:32:44.9150798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9151127Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9151434Z ) 2025-05-07T20:32:44.9151630Z else: 2025-05-07T20:32:44.9151879Z scale_ub_tensor = None 2025-05-07T20:32:44.9152126Z 2025-05-07T20:32:44.9152354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9152661Z op = silu_mul_quant 2025-05-07T20:32:44.9152909Z if compiled: 2025-05-07T20:32:44.9153154Z op = torch.compile(op) 2025-05-07T20:32:44.9153442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9153713Z 2025-05-07T20:32:44.9153901Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9154067Z 2025-05-07T20:32:44.9154163Z moe/activation_test.py:117: 2025-05-07T20:32:44.9154455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9154783Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9155058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9155618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9156178Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9156838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9157522Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9158079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9158791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9159503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9160029Z kernel = self.compile( 2025-05-07T20:32:44.9160565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9161219Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9161613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9161841Z 2025-05-07T20:32:44.9162046Z self = 2025-05-07T20:32:44.9163127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9164502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e7ddda0>} 2025-05-07T20:32:44.9165846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9166927Z context = 2025-05-07T20:32:44.9167217Z 2025-05-07T20:32:44.9167382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9167945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9168456Z module_map=module_map) 2025-05-07T20:32:44.9168813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9169164Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9169426Z E ^ 2025-05-07T20:32:44.9169885Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9170383Z 2025-05-07T20:32:44.9170799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9171310Z 2025-05-07T20:32:44.9171412Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9171823Z self=, 2025-05-07T20:32:44.9172285Z T=4096, 2025-05-07T20:32:44.9172473Z D=5120, 2025-05-07T20:32:44.9172663Z scale_ub=1200.0, 2025-05-07T20:32:44.9172879Z contiguous=False, 2025-05-07T20:32:44.9173101Z compiled=False, 2025-05-07T20:32:44.9173300Z ) 2025-05-07T20:32:44.9173610Z self = 2025-05-07T20:32:44.9174105Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9174384Z 2025-05-07T20:32:44.9174461Z @given( 2025-05-07T20:32:44.9174691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9174993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9175294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9175616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9175934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9176219Z ) 2025-05-07T20:32:44.9176566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9176999Z def test_silu_mul_quant( 2025-05-07T20:32:44.9177244Z self, 2025-05-07T20:32:44.9177434Z T: int, 2025-05-07T20:32:44.9177623Z D: int, 2025-05-07T20:32:44.9177838Z scale_ub: Optional[float], 2025-05-07T20:32:44.9178104Z contiguous: bool, 2025-05-07T20:32:44.9178339Z compiled: bool, 2025-05-07T20:32:44.9178601Z ) -> None: 2025-05-07T20:32:44.9178813Z torch.manual_seed(2025) 2025-05-07T20:32:44.9179052Z 2025-05-07T20:32:44.9179318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9179656Z 2025-05-07T20:32:44.9179845Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9180130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9180433Z x = x_sign * x_clamp 2025-05-07T20:32:44.9180667Z x0 = x[:, :D] 2025-05-07T20:32:44.9180882Z x1 = x[:, D:] 2025-05-07T20:32:44.9181092Z 2025-05-07T20:32:44.9181269Z if contiguous: 2025-05-07T20:32:44.9181491Z x0 = x0.contiguous() 2025-05-07T20:32:44.9181746Z x1 = x1.contiguous() 2025-05-07T20:32:44.9181980Z 2025-05-07T20:32:44.9182163Z if scale_ub is not None: 2025-05-07T20:32:44.9182433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9182762Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9183073Z ) 2025-05-07T20:32:44.9183265Z else: 2025-05-07T20:32:44.9183470Z scale_ub_tensor = None 2025-05-07T20:32:44.9183715Z 2025-05-07T20:32:44.9183938Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9184247Z op = silu_mul_quant 2025-05-07T20:32:44.9184497Z if compiled: 2025-05-07T20:32:44.9184739Z op = torch.compile(op) 2025-05-07T20:32:44.9185033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9185358Z 2025-05-07T20:32:44.9185552Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9185719Z 2025-05-07T20:32:44.9185820Z moe/activation_test.py:117: 2025-05-07T20:32:44.9186110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9186433Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9186712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9187396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9188156Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9188711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9189385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9190048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9190616Z kernel = self.compile( 2025-05-07T20:32:44.9191152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9191803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9192190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9192417Z 2025-05-07T20:32:44.9192626Z self = 2025-05-07T20:32:44.9193703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9195079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e7df420>} 2025-05-07T20:32:44.9196421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9197445Z context = 2025-05-07T20:32:44.9197732Z 2025-05-07T20:32:44.9197895Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9198510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9198972Z module_map=module_map) 2025-05-07T20:32:44.9199334Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9199681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9199936Z E ^ 2025-05-07T20:32:44.9200398Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9200845Z 2025-05-07T20:32:44.9201257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9201765Z 2025-05-07T20:32:44.9201868Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9202274Z self=, 2025-05-07T20:32:44.9202671Z T=4096, 2025-05-07T20:32:44.9202856Z D=5120, 2025-05-07T20:32:44.9203048Z scale_ub=1200.0, 2025-05-07T20:32:44.9203268Z contiguous=False, 2025-05-07T20:32:44.9203486Z compiled=True, 2025-05-07T20:32:44.9203691Z ) 2025-05-07T20:32:44.9204003Z self = 2025-05-07T20:32:44.9204489Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9204762Z 2025-05-07T20:32:44.9204841Z @given( 2025-05-07T20:32:44.9205070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9205425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9205974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9206320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9206648Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9206928Z ) 2025-05-07T20:32:44.9207280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9207765Z def test_silu_mul_quant( 2025-05-07T20:32:44.9208009Z self, 2025-05-07T20:32:44.9208290Z T: int, 2025-05-07T20:32:44.9208493Z D: int, 2025-05-07T20:32:44.9208708Z scale_ub: Optional[float], 2025-05-07T20:32:44.9208978Z contiguous: bool, 2025-05-07T20:32:44.9209218Z compiled: bool, 2025-05-07T20:32:44.9209437Z ) -> None: 2025-05-07T20:32:44.9209651Z torch.manual_seed(2025) 2025-05-07T20:32:44.9209897Z 2025-05-07T20:32:44.9210170Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9210580Z 2025-05-07T20:32:44.9210785Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9211075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9211384Z x = x_sign * x_clamp 2025-05-07T20:32:44.9211628Z x0 = x[:, :D] 2025-05-07T20:32:44.9211847Z x1 = x[:, D:] 2025-05-07T20:32:44.9212052Z 2025-05-07T20:32:44.9212238Z if contiguous: 2025-05-07T20:32:44.9212473Z x0 = x0.contiguous() 2025-05-07T20:32:44.9212733Z x1 = x1.contiguous() 2025-05-07T20:32:44.9212973Z 2025-05-07T20:32:44.9213167Z if scale_ub is not None: 2025-05-07T20:32:44.9213439Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9213770Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9214079Z ) 2025-05-07T20:32:44.9214274Z else: 2025-05-07T20:32:44.9214482Z scale_ub_tensor = None 2025-05-07T20:32:44.9214733Z 2025-05-07T20:32:44.9214970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9215280Z op = silu_mul_quant 2025-05-07T20:32:44.9215532Z if compiled: 2025-05-07T20:32:44.9215782Z op = torch.compile(op) 2025-05-07T20:32:44.9216075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9216347Z 2025-05-07T20:32:44.9216543Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9216706Z 2025-05-07T20:32:44.9216883Z moe/activation_test.py:117: 2025-05-07T20:32:44.9217177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9217510Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9217793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9218350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9218911Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9219580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9220275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9220810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9221495Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9222164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9222699Z kernel = self.compile( 2025-05-07T20:32:44.9223242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9223899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9224296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9224528Z 2025-05-07T20:32:44.9224808Z self = 2025-05-07T20:32:44.9225902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9227281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f134860>} 2025-05-07T20:32:44.9228677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9229701Z context = 2025-05-07T20:32:44.9229995Z 2025-05-07T20:32:44.9230167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9230733Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9231204Z module_map=module_map) 2025-05-07T20:32:44.9231569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9231925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9232187Z E ^ 2025-05-07T20:32:44.9232654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9233114Z 2025-05-07T20:32:44.9233533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9234050Z 2025-05-07T20:32:44.9234157Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9234575Z self=, 2025-05-07T20:32:44.9234973Z T=2048, 2025-05-07T20:32:44.9235166Z D=7168, 2025-05-07T20:32:44.9235367Z scale_ub=1200.0, 2025-05-07T20:32:44.9235588Z contiguous=False, 2025-05-07T20:32:44.9235814Z compiled=False, 2025-05-07T20:32:44.9236019Z ) 2025-05-07T20:32:44.9236336Z self = 2025-05-07T20:32:44.9236837Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9237120Z 2025-05-07T20:32:44.9237198Z @given( 2025-05-07T20:32:44.9237482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9237796Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9238114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9238483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9238809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9239093Z ) 2025-05-07T20:32:44.9239445Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9239892Z def test_silu_mul_quant( 2025-05-07T20:32:44.9240133Z self, 2025-05-07T20:32:44.9240330Z T: int, 2025-05-07T20:32:44.9240533Z D: int, 2025-05-07T20:32:44.9240749Z scale_ub: Optional[float], 2025-05-07T20:32:44.9241021Z contiguous: bool, 2025-05-07T20:32:44.9241260Z compiled: bool, 2025-05-07T20:32:44.9241480Z ) -> None: 2025-05-07T20:32:44.9241697Z torch.manual_seed(2025) 2025-05-07T20:32:44.9241939Z 2025-05-07T20:32:44.9242212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9242560Z 2025-05-07T20:32:44.9242760Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9243051Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9243363Z x = x_sign * x_clamp 2025-05-07T20:32:44.9243603Z x0 = x[:, :D] 2025-05-07T20:32:44.9243817Z x1 = x[:, D:] 2025-05-07T20:32:44.9244026Z 2025-05-07T20:32:44.9244213Z if contiguous: 2025-05-07T20:32:44.9244491Z x0 = x0.contiguous() 2025-05-07T20:32:44.9244752Z x1 = x1.contiguous() 2025-05-07T20:32:44.9244993Z 2025-05-07T20:32:44.9250011Z if scale_ub is not None: 2025-05-07T20:32:44.9250316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9250664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9250982Z ) 2025-05-07T20:32:44.9251177Z else: 2025-05-07T20:32:44.9251402Z scale_ub_tensor = None 2025-05-07T20:32:44.9251659Z 2025-05-07T20:32:44.9251994Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9252315Z op = silu_mul_quant 2025-05-07T20:32:44.9252575Z if compiled: 2025-05-07T20:32:44.9252823Z op = torch.compile(op) 2025-05-07T20:32:44.9253127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9253404Z 2025-05-07T20:32:44.9253600Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9253766Z 2025-05-07T20:32:44.9253915Z moe/activation_test.py:117: 2025-05-07T20:32:44.9254217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9254556Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9254842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9255535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9256226Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9256759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9257441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9258100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9258631Z kernel = self.compile( 2025-05-07T20:32:44.9259174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9259830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9260232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9260459Z 2025-05-07T20:32:44.9260669Z self = 2025-05-07T20:32:44.9261739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9263162Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f1356c0>} 2025-05-07T20:32:44.9264505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9265528Z context = 2025-05-07T20:32:44.9265815Z 2025-05-07T20:32:44.9265985Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9266504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9266969Z module_map=module_map) 2025-05-07T20:32:44.9267339Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9267693Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9267954Z E ^ 2025-05-07T20:32:44.9268421Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9268867Z 2025-05-07T20:32:44.9269336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9269848Z 2025-05-07T20:32:44.9269958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9270369Z self=, 2025-05-07T20:32:44.9270768Z T=1, 2025-05-07T20:32:44.9270953Z D=7168, 2025-05-07T20:32:44.9271146Z scale_ub=None, 2025-05-07T20:32:44.9271360Z contiguous=True, 2025-05-07T20:32:44.9271582Z compiled=False, 2025-05-07T20:32:44.9271786Z ) 2025-05-07T20:32:44.9272108Z self = 2025-05-07T20:32:44.9272637Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9272897Z 2025-05-07T20:32:44.9272978Z @given( 2025-05-07T20:32:44.9273207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9273518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9273827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9274786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9275125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9275411Z ) 2025-05-07T20:32:44.9275758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9276199Z def test_silu_mul_quant( 2025-05-07T20:32:44.9276449Z self, 2025-05-07T20:32:44.9276638Z T: int, 2025-05-07T20:32:44.9276833Z D: int, 2025-05-07T20:32:44.9277053Z scale_ub: Optional[float], 2025-05-07T20:32:44.9277321Z contiguous: bool, 2025-05-07T20:32:44.9277557Z compiled: bool, 2025-05-07T20:32:44.9277776Z ) -> None: 2025-05-07T20:32:44.9277984Z torch.manual_seed(2025) 2025-05-07T20:32:44.9278216Z 2025-05-07T20:32:44.9278483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9278818Z 2025-05-07T20:32:44.9279001Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9279289Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9279593Z x = x_sign * x_clamp 2025-05-07T20:32:44.9279827Z x0 = x[:, :D] 2025-05-07T20:32:44.9280039Z x1 = x[:, D:] 2025-05-07T20:32:44.9280241Z 2025-05-07T20:32:44.9280420Z if contiguous: 2025-05-07T20:32:44.9280647Z x0 = x0.contiguous() 2025-05-07T20:32:44.9280901Z x1 = x1.contiguous() 2025-05-07T20:32:44.9281131Z 2025-05-07T20:32:44.9281375Z if scale_ub is not None: 2025-05-07T20:32:44.9281644Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9281972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9282276Z ) 2025-05-07T20:32:44.9282471Z else: 2025-05-07T20:32:44.9282680Z scale_ub_tensor = None 2025-05-07T20:32:44.9282920Z 2025-05-07T20:32:44.9283147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9283457Z op = silu_mul_quant 2025-05-07T20:32:44.9283707Z if compiled: 2025-05-07T20:32:44.9283950Z op = torch.compile(op) 2025-05-07T20:32:44.9284242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9284508Z 2025-05-07T20:32:44.9284697Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9284859Z 2025-05-07T20:32:44.9284958Z moe/activation_test.py:117: 2025-05-07T20:32:44.9285246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9285575Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9285854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9286538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9287222Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9287811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9288542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9289199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9289725Z kernel = self.compile( 2025-05-07T20:32:44.9290260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9290909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9291301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9291575Z 2025-05-07T20:32:44.9291779Z self = 2025-05-07T20:32:44.9292856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9294260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f134fe0>} 2025-05-07T20:32:44.9295582Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9296594Z context = 2025-05-07T20:32:44.9296886Z 2025-05-07T20:32:44.9297046Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9297553Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9297999Z module_map=module_map) 2025-05-07T20:32:44.9298352Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9298698Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9298947Z E ^ 2025-05-07T20:32:44.9299393Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9299837Z 2025-05-07T20:32:44.9300242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9300743Z 2025-05-07T20:32:44.9300845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9301291Z self=, 2025-05-07T20:32:44.9301683Z T=16384, 2025-05-07T20:32:44.9301867Z D=7168, 2025-05-07T20:32:44.9302052Z scale_ub=1200.0, 2025-05-07T20:32:44.9302262Z contiguous=False, 2025-05-07T20:32:44.9302477Z compiled=True, 2025-05-07T20:32:44.9302680Z ) 2025-05-07T20:32:44.9302982Z self = 2025-05-07T20:32:44.9303471Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9303751Z 2025-05-07T20:32:44.9303826Z @given( 2025-05-07T20:32:44.9304048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9304347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9304644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9304962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9305272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9305552Z ) 2025-05-07T20:32:44.9306238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9306771Z def test_silu_mul_quant( 2025-05-07T20:32:44.9307005Z self, 2025-05-07T20:32:44.9307196Z T: int, 2025-05-07T20:32:44.9307391Z D: int, 2025-05-07T20:32:44.9307599Z scale_ub: Optional[float], 2025-05-07T20:32:44.9307866Z contiguous: bool, 2025-05-07T20:32:44.9308199Z compiled: bool, 2025-05-07T20:32:44.9308420Z ) -> None: 2025-05-07T20:32:44.9308632Z torch.manual_seed(2025) 2025-05-07T20:32:44.9308872Z 2025-05-07T20:32:44.9309143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9309482Z 2025-05-07T20:32:44.9309671Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9309954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9310258Z x = x_sign * x_clamp 2025-05-07T20:32:44.9310501Z x0 = x[:, :D] 2025-05-07T20:32:44.9310788Z x1 = x[:, D:] 2025-05-07T20:32:44.9310995Z 2025-05-07T20:32:44.9311174Z if contiguous: 2025-05-07T20:32:44.9311404Z x0 = x0.contiguous() 2025-05-07T20:32:44.9311653Z x1 = x1.contiguous() 2025-05-07T20:32:44.9311887Z 2025-05-07T20:32:44.9312074Z if scale_ub is not None: 2025-05-07T20:32:44.9312340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9312672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9313038Z ) 2025-05-07T20:32:44.9313229Z else: 2025-05-07T20:32:44.9313435Z scale_ub_tensor = None 2025-05-07T20:32:44.9313684Z 2025-05-07T20:32:44.9313906Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9314214Z op = silu_mul_quant 2025-05-07T20:32:44.9314460Z if compiled: 2025-05-07T20:32:44.9314700Z op = torch.compile(op) 2025-05-07T20:32:44.9314995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9315271Z 2025-05-07T20:32:44.9315462Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9315630Z 2025-05-07T20:32:44.9315726Z moe/activation_test.py:117: 2025-05-07T20:32:44.9316020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9316343Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9316614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9317171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9317725Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9318372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9319052Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9319411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9319703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9320038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9320137Z kernel = self.compile( 2025-05-07T20:32:44.9320514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9320693Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9320824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9320829Z 2025-05-07T20:32:44.9321032Z self = 2025-05-07T20:32:44.9321802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9322309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3f137b00>} 2025-05-07T20:32:44.9323052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9323310Z context = 2025-05-07T20:32:44.9323316Z 2025-05-07T20:32:44.9323479Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9323742Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9323848Z module_map=module_map) 2025-05-07T20:32:44.9324008Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9324111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9324231Z E ^ 2025-05-07T20:32:44.9324586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9324590Z 2025-05-07T20:32:44.9324999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9325004Z 2025-05-07T20:32:44.9325109Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9325369Z self=, 2025-05-07T20:32:44.9325450Z T=1, 2025-05-07T20:32:44.9325532Z D=7168, 2025-05-07T20:32:44.9325614Z scale_ub=None, 2025-05-07T20:32:44.9325703Z contiguous=False, 2025-05-07T20:32:44.9325790Z compiled=False, 2025-05-07T20:32:44.9325867Z ) 2025-05-07T20:32:44.9326084Z self = 2025-05-07T20:32:44.9326253Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9326260Z 2025-05-07T20:32:44.9326338Z @given( 2025-05-07T20:32:44.9326456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9326559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9326676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9326797Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9326912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9326987Z ) 2025-05-07T20:32:44.9327234Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9327327Z def test_silu_mul_quant( 2025-05-07T20:32:44.9327404Z self, 2025-05-07T20:32:44.9327486Z T: int, 2025-05-07T20:32:44.9327625Z D: int, 2025-05-07T20:32:44.9327723Z scale_ub: Optional[float], 2025-05-07T20:32:44.9327814Z contiguous: bool, 2025-05-07T20:32:44.9327949Z compiled: bool, 2025-05-07T20:32:44.9328030Z ) -> None: 2025-05-07T20:32:44.9328127Z torch.manual_seed(2025) 2025-05-07T20:32:44.9328200Z 2025-05-07T20:32:44.9328372Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9328446Z 2025-05-07T20:32:44.9328537Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9328664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9328751Z x = x_sign * x_clamp 2025-05-07T20:32:44.9328840Z x0 = x[:, :D] 2025-05-07T20:32:44.9328923Z x1 = x[:, D:] 2025-05-07T20:32:44.9328997Z 2025-05-07T20:32:44.9329081Z if contiguous: 2025-05-07T20:32:44.9329175Z x0 = x0.contiguous() 2025-05-07T20:32:44.9329263Z x1 = x1.contiguous() 2025-05-07T20:32:44.9329335Z 2025-05-07T20:32:44.9329427Z if scale_ub is not None: 2025-05-07T20:32:44.9329531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9329672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9329752Z ) 2025-05-07T20:32:44.9329827Z else: 2025-05-07T20:32:44.9329922Z scale_ub_tensor = None 2025-05-07T20:32:44.9329995Z 2025-05-07T20:32:44.9330121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9330213Z op = silu_mul_quant 2025-05-07T20:32:44.9330297Z if compiled: 2025-05-07T20:32:44.9330395Z op = torch.compile(op) 2025-05-07T20:32:44.9330551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9330624Z 2025-05-07T20:32:44.9330715Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9330719Z 2025-05-07T20:32:44.9330820Z moe/activation_test.py:117: 2025-05-07T20:32:44.9330947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9331047Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9331142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9331640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9331783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9332138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9332355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9332736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9332836Z kernel = self.compile( 2025-05-07T20:32:44.9333216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9333387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9333510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9333517Z 2025-05-07T20:32:44.9333727Z self = 2025-05-07T20:32:44.9334496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9335004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e86c9a0>} 2025-05-07T20:32:44.9335743Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9335939Z context = 2025-05-07T20:32:44.9335944Z 2025-05-07T20:32:44.9336151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9336413Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9336525Z module_map=module_map) 2025-05-07T20:32:44.9336685Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9336784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9336868Z E ^ 2025-05-07T20:32:44.9337224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9337229Z 2025-05-07T20:32:44.9337644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9337649Z 2025-05-07T20:32:44.9337754Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9337973Z self=, 2025-05-07T20:32:44.9338059Z T=2048, 2025-05-07T20:32:44.9338136Z D=7168, 2025-05-07T20:32:44.9338223Z scale_ub=None, 2025-05-07T20:32:44.9338309Z contiguous=False, 2025-05-07T20:32:44.9338391Z compiled=True, 2025-05-07T20:32:44.9338464Z ) 2025-05-07T20:32:44.9338684Z self = 2025-05-07T20:32:44.9338857Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9338862Z 2025-05-07T20:32:44.9338945Z @given( 2025-05-07T20:32:44.9339108Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9339208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9339325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9339443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9339555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9339635Z ) 2025-05-07T20:32:44.9339874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9339973Z def test_silu_mul_quant( 2025-05-07T20:32:44.9340091Z self, 2025-05-07T20:32:44.9340168Z T: int, 2025-05-07T20:32:44.9340248Z D: int, 2025-05-07T20:32:44.9340346Z scale_ub: Optional[float], 2025-05-07T20:32:44.9340436Z contiguous: bool, 2025-05-07T20:32:44.9340527Z compiled: bool, 2025-05-07T20:32:44.9340606Z ) -> None: 2025-05-07T20:32:44.9340700Z torch.manual_seed(2025) 2025-05-07T20:32:44.9340780Z 2025-05-07T20:32:44.9340988Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9341064Z 2025-05-07T20:32:44.9341157Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9341281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9341372Z x = x_sign * x_clamp 2025-05-07T20:32:44.9341450Z x0 = x[:, :D] 2025-05-07T20:32:44.9341530Z x1 = x[:, D:] 2025-05-07T20:32:44.9341609Z 2025-05-07T20:32:44.9341694Z if contiguous: 2025-05-07T20:32:44.9341790Z x0 = x0.contiguous() 2025-05-07T20:32:44.9341881Z x1 = x1.contiguous() 2025-05-07T20:32:44.9341955Z 2025-05-07T20:32:44.9342044Z if scale_ub is not None: 2025-05-07T20:32:44.9342150Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9342283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9342359Z ) 2025-05-07T20:32:44.9342438Z else: 2025-05-07T20:32:44.9342534Z scale_ub_tensor = None 2025-05-07T20:32:44.9342608Z 2025-05-07T20:32:44.9342738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9342827Z op = silu_mul_quant 2025-05-07T20:32:44.9342916Z if compiled: 2025-05-07T20:32:44.9343013Z op = torch.compile(op) 2025-05-07T20:32:44.9343116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9343191Z 2025-05-07T20:32:44.9343327Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9343332Z 2025-05-07T20:32:44.9343429Z moe/activation_test.py:117: 2025-05-07T20:32:44.9343564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9343664Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9343761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9344127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9344224Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9344720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9344817Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9345169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9345392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9345732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9345830Z kernel = self.compile( 2025-05-07T20:32:44.9346208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9346379Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9346550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9346558Z 2025-05-07T20:32:44.9346761Z self = 2025-05-07T20:32:44.9347529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9348032Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e86dd00>} 2025-05-07T20:32:44.9348813Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9349005Z context = 2025-05-07T20:32:44.9349009Z 2025-05-07T20:32:44.9349213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9349477Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9349582Z module_map=module_map) 2025-05-07T20:32:44.9349741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9349842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9349921Z E ^ 2025-05-07T20:32:44.9350275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9350282Z 2025-05-07T20:32:44.9350699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9350704Z 2025-05-07T20:32:44.9350807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9351028Z self=, 2025-05-07T20:32:44.9351108Z T=4096, 2025-05-07T20:32:44.9351187Z D=7168, 2025-05-07T20:32:44.9351275Z scale_ub=None, 2025-05-07T20:32:44.9351359Z contiguous=False, 2025-05-07T20:32:44.9351443Z compiled=True, 2025-05-07T20:32:44.9351516Z ) 2025-05-07T20:32:44.9351729Z self = 2025-05-07T20:32:44.9351906Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9351977Z 2025-05-07T20:32:44.9352053Z @given( 2025-05-07T20:32:44.9352175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9352277Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9352391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9352506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9352620Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9352693Z ) 2025-05-07T20:32:44.9352939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9353037Z def test_silu_mul_quant( 2025-05-07T20:32:44.9353112Z self, 2025-05-07T20:32:44.9353194Z T: int, 2025-05-07T20:32:44.9353269Z D: int, 2025-05-07T20:32:44.9353367Z scale_ub: Optional[float], 2025-05-07T20:32:44.9353462Z contiguous: bool, 2025-05-07T20:32:44.9353547Z compiled: bool, 2025-05-07T20:32:44.9353625Z ) -> None: 2025-05-07T20:32:44.9353725Z torch.manual_seed(2025) 2025-05-07T20:32:44.9353797Z 2025-05-07T20:32:44.9353970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9354043Z 2025-05-07T20:32:44.9354132Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9354261Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9354349Z x = x_sign * x_clamp 2025-05-07T20:32:44.9354430Z x0 = x[:, :D] 2025-05-07T20:32:44.9354512Z x1 = x[:, D:] 2025-05-07T20:32:44.9354585Z 2025-05-07T20:32:44.9354716Z if contiguous: 2025-05-07T20:32:44.9354810Z x0 = x0.contiguous() 2025-05-07T20:32:44.9354901Z x1 = x1.contiguous() 2025-05-07T20:32:44.9354977Z 2025-05-07T20:32:44.9355066Z if scale_ub is not None: 2025-05-07T20:32:44.9355171Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9355307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9355383Z ) 2025-05-07T20:32:44.9355464Z else: 2025-05-07T20:32:44.9355560Z scale_ub_tensor = None 2025-05-07T20:32:44.9355675Z 2025-05-07T20:32:44.9355803Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9355897Z op = silu_mul_quant 2025-05-07T20:32:44.9355981Z if compiled: 2025-05-07T20:32:44.9356079Z op = torch.compile(op) 2025-05-07T20:32:44.9356186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9356258Z 2025-05-07T20:32:44.9356353Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9356396Z 2025-05-07T20:32:44.9356495Z moe/activation_test.py:117: 2025-05-07T20:32:44.9356623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9356725Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9356822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9357185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9357281Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9357772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9357871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9358226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9358477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9358840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9358934Z kernel = self.compile( 2025-05-07T20:32:44.9359310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9359484Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9359652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9359660Z 2025-05-07T20:32:44.9359865Z self = 2025-05-07T20:32:44.9360632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9361139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e86e840>} 2025-05-07T20:32:44.9361880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9362066Z context = 2025-05-07T20:32:44.9362074Z 2025-05-07T20:32:44.9362243Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9362503Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9362612Z module_map=module_map) 2025-05-07T20:32:44.9362771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9362869Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9362950Z E ^ 2025-05-07T20:32:44.9363344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9363350Z 2025-05-07T20:32:44.9363758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9363766Z 2025-05-07T20:32:44.9363870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9364089Z self=, 2025-05-07T20:32:44.9364215Z T=16384, 2025-05-07T20:32:44.9364293Z D=5120, 2025-05-07T20:32:44.9364378Z scale_ub=1200.0, 2025-05-07T20:32:44.9364471Z contiguous=False, 2025-05-07T20:32:44.9364555Z compiled=False, 2025-05-07T20:32:44.9364627Z ) 2025-05-07T20:32:44.9364844Z self = 2025-05-07T20:32:44.9365023Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9365031Z 2025-05-07T20:32:44.9365148Z @given( 2025-05-07T20:32:44.9365271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9365374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9365490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9365603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9365714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9365787Z ) 2025-05-07T20:32:44.9366032Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9366129Z def test_silu_mul_quant( 2025-05-07T20:32:44.9366211Z self, 2025-05-07T20:32:44.9366290Z T: int, 2025-05-07T20:32:44.9366369Z D: int, 2025-05-07T20:32:44.9366472Z scale_ub: Optional[float], 2025-05-07T20:32:44.9366560Z contiguous: bool, 2025-05-07T20:32:44.9366648Z compiled: bool, 2025-05-07T20:32:44.9366725Z ) -> None: 2025-05-07T20:32:44.9366822Z torch.manual_seed(2025) 2025-05-07T20:32:44.9366898Z 2025-05-07T20:32:44.9367066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9367140Z 2025-05-07T20:32:44.9367232Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9367356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9367444Z x = x_sign * x_clamp 2025-05-07T20:32:44.9367526Z x0 = x[:, :D] 2025-05-07T20:32:44.9367707Z x1 = x[:, D:] 2025-05-07T20:32:44.9367781Z 2025-05-07T20:32:44.9367871Z if contiguous: 2025-05-07T20:32:44.9367962Z x0 = x0.contiguous() 2025-05-07T20:32:44.9368049Z x1 = x1.contiguous() 2025-05-07T20:32:44.9368125Z 2025-05-07T20:32:44.9368216Z if scale_ub is not None: 2025-05-07T20:32:44.9368345Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9368498Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9368583Z ) 2025-05-07T20:32:44.9368663Z else: 2025-05-07T20:32:44.9368758Z scale_ub_tensor = None 2025-05-07T20:32:44.9368833Z 2025-05-07T20:32:44.9368965Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9369054Z op = silu_mul_quant 2025-05-07T20:32:44.9369137Z if compiled: 2025-05-07T20:32:44.9369240Z op = torch.compile(op) 2025-05-07T20:32:44.9369347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9369423Z 2025-05-07T20:32:44.9369514Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9369521Z 2025-05-07T20:32:44.9369617Z moe/activation_test.py:117: 2025-05-07T20:32:44.9369750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9369848Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9369948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9370495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9370594Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9370948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9371172Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9371508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9371609Z kernel = self.compile( 2025-05-07T20:32:44.9372027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9372201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9372329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9372333Z 2025-05-07T20:32:44.9372539Z self = 2025-05-07T20:32:44.9373349Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9373847Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e4f4040>} 2025-05-07T20:32:44.9374586Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9374779Z context = 2025-05-07T20:32:44.9374784Z 2025-05-07T20:32:44.9374946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9375213Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9375319Z module_map=module_map) 2025-05-07T20:32:44.9375478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9375579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9375657Z E ^ 2025-05-07T20:32:44.9376013Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9376060Z 2025-05-07T20:32:44.9376470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9376475Z 2025-05-07T20:32:44.9376580Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9376799Z self=, 2025-05-07T20:32:44.9376875Z T=16384, 2025-05-07T20:32:44.9376952Z D=5120, 2025-05-07T20:32:44.9377036Z scale_ub=1200.0, 2025-05-07T20:32:44.9377124Z contiguous=True, 2025-05-07T20:32:44.9377209Z compiled=True, 2025-05-07T20:32:44.9377284Z ) 2025-05-07T20:32:44.9377500Z self = 2025-05-07T20:32:44.9377676Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9377680Z 2025-05-07T20:32:44.9377758Z @given( 2025-05-07T20:32:44.9377874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9377977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9378093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9378209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9378320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9378394Z ) 2025-05-07T20:32:44.9378638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9378731Z def test_silu_mul_quant( 2025-05-07T20:32:44.9378852Z self, 2025-05-07T20:32:44.9378935Z T: int, 2025-05-07T20:32:44.9379012Z D: int, 2025-05-07T20:32:44.9379109Z scale_ub: Optional[float], 2025-05-07T20:32:44.9379204Z contiguous: bool, 2025-05-07T20:32:44.9379291Z compiled: bool, 2025-05-07T20:32:44.9379367Z ) -> None: 2025-05-07T20:32:44.9379461Z torch.manual_seed(2025) 2025-05-07T20:32:44.9379535Z 2025-05-07T20:32:44.9379699Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9379782Z 2025-05-07T20:32:44.9379915Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9380042Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9380131Z x = x_sign * x_clamp 2025-05-07T20:32:44.9380211Z x0 = x[:, :D] 2025-05-07T20:32:44.9380293Z x1 = x[:, D:] 2025-05-07T20:32:44.9380364Z 2025-05-07T20:32:44.9380446Z if contiguous: 2025-05-07T20:32:44.9380540Z x0 = x0.contiguous() 2025-05-07T20:32:44.9380699Z x1 = x1.contiguous() 2025-05-07T20:32:44.9380773Z 2025-05-07T20:32:44.9380866Z if scale_ub is not None: 2025-05-07T20:32:44.9380968Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9385171Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9385261Z ) 2025-05-07T20:32:44.9385342Z else: 2025-05-07T20:32:44.9385444Z scale_ub_tensor = None 2025-05-07T20:32:44.9385525Z 2025-05-07T20:32:44.9385663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9385765Z op = silu_mul_quant 2025-05-07T20:32:44.9385852Z if compiled: 2025-05-07T20:32:44.9385956Z op = torch.compile(op) 2025-05-07T20:32:44.9386066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9386141Z 2025-05-07T20:32:44.9386232Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9386241Z 2025-05-07T20:32:44.9386342Z moe/activation_test.py:117: 2025-05-07T20:32:44.9386482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9386588Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9386689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9387068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9387167Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9387665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9387836Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9388198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9388423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9388774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9388873Z kernel = self.compile( 2025-05-07T20:32:44.9389256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9389434Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9389564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9389568Z 2025-05-07T20:32:44.9389781Z self = 2025-05-07T20:32:44.9390567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9391114Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e4f5300>} 2025-05-07T20:32:44.9391872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9392065Z context = 2025-05-07T20:32:44.9392069Z 2025-05-07T20:32:44.9392240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9392509Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9392662Z module_map=module_map) 2025-05-07T20:32:44.9392830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9392931Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9393011Z E ^ 2025-05-07T20:32:44.9393367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9393410Z 2025-05-07T20:32:44.9393823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9393827Z 2025-05-07T20:32:44.9393935Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9394157Z self=, 2025-05-07T20:32:44.9394241Z T=16384, 2025-05-07T20:32:44.9394323Z D=5120, 2025-05-07T20:32:44.9394407Z scale_ub=None, 2025-05-07T20:32:44.9394500Z contiguous=False, 2025-05-07T20:32:44.9394585Z compiled=True, 2025-05-07T20:32:44.9394660Z ) 2025-05-07T20:32:44.9394881Z self = 2025-05-07T20:32:44.9395059Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9395064Z 2025-05-07T20:32:44.9395144Z @given( 2025-05-07T20:32:44.9395274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9395380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9395503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9395621Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9395734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9395814Z ) 2025-05-07T20:32:44.9396060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9396202Z def test_silu_mul_quant( 2025-05-07T20:32:44.9396286Z self, 2025-05-07T20:32:44.9396366Z T: int, 2025-05-07T20:32:44.9396448Z D: int, 2025-05-07T20:32:44.9396548Z scale_ub: Optional[float], 2025-05-07T20:32:44.9396638Z contiguous: bool, 2025-05-07T20:32:44.9396727Z compiled: bool, 2025-05-07T20:32:44.9396807Z ) -> None: 2025-05-07T20:32:44.9396903Z torch.manual_seed(2025) 2025-05-07T20:32:44.9396980Z 2025-05-07T20:32:44.9397154Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9397233Z 2025-05-07T20:32:44.9397327Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9397453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9397548Z x = x_sign * x_clamp 2025-05-07T20:32:44.9397630Z x0 = x[:, :D] 2025-05-07T20:32:44.9397713Z x1 = x[:, D:] 2025-05-07T20:32:44.9397792Z 2025-05-07T20:32:44.9397878Z if contiguous: 2025-05-07T20:32:44.9397975Z x0 = x0.contiguous() 2025-05-07T20:32:44.9398073Z x1 = x1.contiguous() 2025-05-07T20:32:44.9398148Z 2025-05-07T20:32:44.9398240Z if scale_ub is not None: 2025-05-07T20:32:44.9398350Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9398483Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9398557Z ) 2025-05-07T20:32:44.9398639Z else: 2025-05-07T20:32:44.9398732Z scale_ub_tensor = None 2025-05-07T20:32:44.9398850Z 2025-05-07T20:32:44.9398982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9399071Z op = silu_mul_quant 2025-05-07T20:32:44.9399159Z if compiled: 2025-05-07T20:32:44.9399259Z op = torch.compile(op) 2025-05-07T20:32:44.9399366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9399439Z 2025-05-07T20:32:44.9399528Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9399536Z 2025-05-07T20:32:44.9399631Z moe/activation_test.py:117: 2025-05-07T20:32:44.9399806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9399908Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9400011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9400380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9400473Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9401010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9401109Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9401465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9401688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9402028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9402125Z kernel = self.compile( 2025-05-07T20:32:44.9402504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9402677Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9402806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9402810Z 2025-05-07T20:32:44.9403018Z self = 2025-05-07T20:32:44.9403795Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9404296Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e4f5e40>} 2025-05-07T20:32:44.9405085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9405277Z context = 2025-05-07T20:32:44.9405281Z 2025-05-07T20:32:44.9405446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9406019Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9406130Z module_map=module_map) 2025-05-07T20:32:44.9406292Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9406395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9406474Z E ^ 2025-05-07T20:32:44.9406828Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9406838Z 2025-05-07T20:32:44.9407248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9407253Z 2025-05-07T20:32:44.9407355Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9407625Z self=, 2025-05-07T20:32:44.9407705Z T=2048, 2025-05-07T20:32:44.9407860Z D=5120, 2025-05-07T20:32:44.9407951Z scale_ub=None, 2025-05-07T20:32:44.9408038Z contiguous=False, 2025-05-07T20:32:44.9408125Z compiled=True, 2025-05-07T20:32:44.9408199Z ) 2025-05-07T20:32:44.9408444Z self = 2025-05-07T20:32:44.9408635Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9408640Z 2025-05-07T20:32:44.9408717Z @given( 2025-05-07T20:32:44.9408838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9409000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9409112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9409227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9409341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9409415Z ) 2025-05-07T20:32:44.9409658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9409810Z def test_silu_mul_quant( 2025-05-07T20:32:44.9409889Z self, 2025-05-07T20:32:44.9409966Z T: int, 2025-05-07T20:32:44.9410043Z D: int, 2025-05-07T20:32:44.9410142Z scale_ub: Optional[float], 2025-05-07T20:32:44.9410230Z contiguous: bool, 2025-05-07T20:32:44.9410313Z compiled: bool, 2025-05-07T20:32:44.9410391Z ) -> None: 2025-05-07T20:32:44.9410487Z torch.manual_seed(2025) 2025-05-07T20:32:44.9410562Z 2025-05-07T20:32:44.9410730Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9410807Z 2025-05-07T20:32:44.9410897Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9411025Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9411113Z x = x_sign * x_clamp 2025-05-07T20:32:44.9411192Z x0 = x[:, :D] 2025-05-07T20:32:44.9411273Z x1 = x[:, D:] 2025-05-07T20:32:44.9411345Z 2025-05-07T20:32:44.9411428Z if contiguous: 2025-05-07T20:32:44.9411526Z x0 = x0.contiguous() 2025-05-07T20:32:44.9411614Z x1 = x1.contiguous() 2025-05-07T20:32:44.9411686Z 2025-05-07T20:32:44.9411779Z if scale_ub is not None: 2025-05-07T20:32:44.9411883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9412016Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9412096Z ) 2025-05-07T20:32:44.9412172Z else: 2025-05-07T20:32:44.9412330Z scale_ub_tensor = None 2025-05-07T20:32:44.9412411Z 2025-05-07T20:32:44.9412543Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9412633Z op = silu_mul_quant 2025-05-07T20:32:44.9412719Z if compiled: 2025-05-07T20:32:44.9412819Z op = torch.compile(op) 2025-05-07T20:32:44.9412924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9412996Z 2025-05-07T20:32:44.9413085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9413092Z 2025-05-07T20:32:44.9413193Z moe/activation_test.py:117: 2025-05-07T20:32:44.9413322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9413421Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9413523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9413888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9413984Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9414477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9414576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9414936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9415153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9415539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9415634Z kernel = self.compile( 2025-05-07T20:32:44.9416010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9416186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9416310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9416318Z 2025-05-07T20:32:44.9416584Z self = 2025-05-07T20:32:44.9417363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9417905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e4f7240>} 2025-05-07T20:32:44.9418699Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9418886Z context = 2025-05-07T20:32:44.9418891Z 2025-05-07T20:32:44.9419059Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9419322Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9419428Z module_map=module_map) 2025-05-07T20:32:44.9419595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9419694Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9419771Z E ^ 2025-05-07T20:32:44.9420130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9420135Z 2025-05-07T20:32:44.9420547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9420552Z 2025-05-07T20:32:44.9420659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9420876Z self=, 2025-05-07T20:32:44.9420999Z T=2048, 2025-05-07T20:32:44.9421081Z D=5120, 2025-05-07T20:32:44.9421168Z scale_ub=1200.0, 2025-05-07T20:32:44.9421254Z contiguous=False, 2025-05-07T20:32:44.9421337Z compiled=True, 2025-05-07T20:32:44.9421411Z ) 2025-05-07T20:32:44.9421631Z self = 2025-05-07T20:32:44.9421803Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9421808Z 2025-05-07T20:32:44.9421885Z @given( 2025-05-07T20:32:44.9422011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9422108Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9422220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9422336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9422448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9422522Z ) 2025-05-07T20:32:44.9422770Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9422867Z def test_silu_mul_quant( 2025-05-07T20:32:44.9422944Z self, 2025-05-07T20:32:44.9423021Z T: int, 2025-05-07T20:32:44.9423096Z D: int, 2025-05-07T20:32:44.9423196Z scale_ub: Optional[float], 2025-05-07T20:32:44.9423283Z contiguous: bool, 2025-05-07T20:32:44.9423367Z compiled: bool, 2025-05-07T20:32:44.9423450Z ) -> None: 2025-05-07T20:32:44.9423544Z torch.manual_seed(2025) 2025-05-07T20:32:44.9423661Z 2025-05-07T20:32:44.9423836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9423911Z 2025-05-07T20:32:44.9423998Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9424125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9424213Z x = x_sign * x_clamp 2025-05-07T20:32:44.9424295Z x0 = x[:, :D] 2025-05-07T20:32:44.9424373Z x1 = x[:, D:] 2025-05-07T20:32:44.9424448Z 2025-05-07T20:32:44.9424536Z if contiguous: 2025-05-07T20:32:44.9424665Z x0 = x0.contiguous() 2025-05-07T20:32:44.9424754Z x1 = x1.contiguous() 2025-05-07T20:32:44.9424829Z 2025-05-07T20:32:44.9424918Z if scale_ub is not None: 2025-05-07T20:32:44.9425021Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9425157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9425231Z ) 2025-05-07T20:32:44.9425305Z else: 2025-05-07T20:32:44.9425449Z scale_ub_tensor = None 2025-05-07T20:32:44.9425525Z 2025-05-07T20:32:44.9425655Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9425752Z op = silu_mul_quant 2025-05-07T20:32:44.9425836Z if compiled: 2025-05-07T20:32:44.9425937Z op = torch.compile(op) 2025-05-07T20:32:44.9426041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9426113Z 2025-05-07T20:32:44.9426211Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9426216Z 2025-05-07T20:32:44.9426315Z moe/activation_test.py:117: 2025-05-07T20:32:44.9426446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9426547Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9426643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9427010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9427107Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9427604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9427703Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9428055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9428276Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9428658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9428757Z kernel = self.compile( 2025-05-07T20:32:44.9429138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9429309Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9429441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9429448Z 2025-05-07T20:32:44.9429649Z self = 2025-05-07T20:32:44.9430415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9430916Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e50c720>} 2025-05-07T20:32:44.9431664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9431856Z context = 2025-05-07T20:32:44.9431861Z 2025-05-07T20:32:44.9432066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9432329Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9432434Z module_map=module_map) 2025-05-07T20:32:44.9432595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9432695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9432776Z E ^ 2025-05-07T20:32:44.9433127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9433172Z 2025-05-07T20:32:44.9433586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9433591Z 2025-05-07T20:32:44.9433693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9433915Z self=, 2025-05-07T20:32:44.9434032Z T=4096, 2025-05-07T20:32:44.9434109Z D=5120, 2025-05-07T20:32:44.9434197Z scale_ub=1200.0, 2025-05-07T20:32:44.9434281Z contiguous=True, 2025-05-07T20:32:44.9434362Z compiled=True, 2025-05-07T20:32:44.9434441Z ) 2025-05-07T20:32:44.9434655Z self = 2025-05-07T20:32:44.9434823Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9434834Z 2025-05-07T20:32:44.9434907Z @given( 2025-05-07T20:32:44.9435028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9435127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9435239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9435354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9435469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9435545Z ) 2025-05-07T20:32:44.9435791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9435888Z def test_silu_mul_quant( 2025-05-07T20:32:44.9435964Z self, 2025-05-07T20:32:44.9436038Z T: int, 2025-05-07T20:32:44.9436118Z D: int, 2025-05-07T20:32:44.9436214Z scale_ub: Optional[float], 2025-05-07T20:32:44.9436304Z contiguous: bool, 2025-05-07T20:32:44.9436388Z compiled: bool, 2025-05-07T20:32:44.9436464Z ) -> None: 2025-05-07T20:32:44.9436610Z torch.manual_seed(2025) 2025-05-07T20:32:44.9436685Z 2025-05-07T20:32:44.9436852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9436929Z 2025-05-07T20:32:44.9437019Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9437140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9437232Z x = x_sign * x_clamp 2025-05-07T20:32:44.9437311Z x0 = x[:, :D] 2025-05-07T20:32:44.9437391Z x1 = x[:, D:] 2025-05-07T20:32:44.9437466Z 2025-05-07T20:32:44.9437552Z if contiguous: 2025-05-07T20:32:44.9437641Z x0 = x0.contiguous() 2025-05-07T20:32:44.9437734Z x1 = x1.contiguous() 2025-05-07T20:32:44.9437805Z 2025-05-07T20:32:44.9437895Z if scale_ub is not None: 2025-05-07T20:32:44.9438001Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9438143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9438241Z ) 2025-05-07T20:32:44.9438324Z else: 2025-05-07T20:32:44.9438437Z scale_ub_tensor = None 2025-05-07T20:32:44.9438513Z 2025-05-07T20:32:44.9438640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9438727Z op = silu_mul_quant 2025-05-07T20:32:44.9438812Z if compiled: 2025-05-07T20:32:44.9438913Z op = torch.compile(op) 2025-05-07T20:32:44.9439018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9439093Z 2025-05-07T20:32:44.9439227Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9439232Z 2025-05-07T20:32:44.9439336Z moe/activation_test.py:117: 2025-05-07T20:32:44.9439465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9439565Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9439667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9440030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9440124Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9440658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9440753Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9441108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9441371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9441708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9441803Z kernel = self.compile( 2025-05-07T20:32:44.9442180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9442356Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9442485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9442492Z 2025-05-07T20:32:44.9442693Z self = 2025-05-07T20:32:44.9443463Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9443967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e50d260>} 2025-05-07T20:32:44.9444710Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9444898Z context = 2025-05-07T20:32:44.9444944Z 2025-05-07T20:32:44.9445109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9445371Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9445478Z module_map=module_map) 2025-05-07T20:32:44.9445639Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9445737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9445816Z E ^ 2025-05-07T20:32:44.9446170Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9446175Z 2025-05-07T20:32:44.9446585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9446589Z 2025-05-07T20:32:44.9446693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9446910Z self=, 2025-05-07T20:32:44.9446994Z T=128, 2025-05-07T20:32:44.9447069Z D=5120, 2025-05-07T20:32:44.9447151Z scale_ub=1200.0, 2025-05-07T20:32:44.9447236Z contiguous=False, 2025-05-07T20:32:44.9447320Z compiled=True, 2025-05-07T20:32:44.9447393Z ) 2025-05-07T20:32:44.9447662Z self = 2025-05-07T20:32:44.9447836Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9447912Z 2025-05-07T20:32:44.9447994Z @given( 2025-05-07T20:32:44.9448114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9448211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9448346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9448476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9448599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9448678Z ) 2025-05-07T20:32:44.9448922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9449070Z def test_silu_mul_quant( 2025-05-07T20:32:44.9449145Z self, 2025-05-07T20:32:44.9449223Z T: int, 2025-05-07T20:32:44.9449300Z D: int, 2025-05-07T20:32:44.9449397Z scale_ub: Optional[float], 2025-05-07T20:32:44.9449488Z contiguous: bool, 2025-05-07T20:32:44.9449574Z compiled: bool, 2025-05-07T20:32:44.9449654Z ) -> None: 2025-05-07T20:32:44.9449793Z torch.manual_seed(2025) 2025-05-07T20:32:44.9449869Z 2025-05-07T20:32:44.9450039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9450113Z 2025-05-07T20:32:44.9450201Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9450328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9450415Z x = x_sign * x_clamp 2025-05-07T20:32:44.9450494Z x0 = x[:, :D] 2025-05-07T20:32:44.9450579Z x1 = x[:, D:] 2025-05-07T20:32:44.9450653Z 2025-05-07T20:32:44.9450735Z if contiguous: 2025-05-07T20:32:44.9450828Z x0 = x0.contiguous() 2025-05-07T20:32:44.9450918Z x1 = x1.contiguous() 2025-05-07T20:32:44.9450990Z 2025-05-07T20:32:44.9451083Z if scale_ub is not None: 2025-05-07T20:32:44.9451186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9451322Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9451399Z ) 2025-05-07T20:32:44.9451478Z else: 2025-05-07T20:32:44.9451573Z scale_ub_tensor = None 2025-05-07T20:32:44.9451646Z 2025-05-07T20:32:44.9451775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9451870Z op = silu_mul_quant 2025-05-07T20:32:44.9451953Z if compiled: 2025-05-07T20:32:44.9452051Z op = torch.compile(op) 2025-05-07T20:32:44.9452156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9452273Z 2025-05-07T20:32:44.9452366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9452374Z 2025-05-07T20:32:44.9452469Z moe/activation_test.py:117: 2025-05-07T20:32:44.9452596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9452697Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9452794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9453161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9453256Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9453744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9453840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9454196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9454418Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9454758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9454850Z kernel = self.compile( 2025-05-07T20:32:44.9455225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9455399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9455571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9455576Z 2025-05-07T20:32:44.9455782Z self = 2025-05-07T20:32:44.9456548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9457049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e50e480>} 2025-05-07T20:32:44.9457830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9458020Z context = 2025-05-07T20:32:44.9458062Z 2025-05-07T20:32:44.9458230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9458490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9458594Z module_map=module_map) 2025-05-07T20:32:44.9458756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9458855Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9458936Z E ^ 2025-05-07T20:32:44.9459288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9459293Z 2025-05-07T20:32:44.9459701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9459705Z 2025-05-07T20:32:44.9459810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9460032Z self=, 2025-05-07T20:32:44.9460110Z T=16384, 2025-05-07T20:32:44.9460188Z D=7168, 2025-05-07T20:32:44.9460269Z scale_ub=1200.0, 2025-05-07T20:32:44.9460355Z contiguous=True, 2025-05-07T20:32:44.9460437Z compiled=True, 2025-05-07T20:32:44.9460509Z ) 2025-05-07T20:32:44.9460729Z self = 2025-05-07T20:32:44.9460905Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9460956Z 2025-05-07T20:32:44.9461034Z @given( 2025-05-07T20:32:44.9461154Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9461252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9461366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9461481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9461591Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9461667Z ) 2025-05-07T20:32:44.9461912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9462005Z def test_silu_mul_quant( 2025-05-07T20:32:44.9462087Z self, 2025-05-07T20:32:44.9462164Z T: int, 2025-05-07T20:32:44.9462238Z D: int, 2025-05-07T20:32:44.9462337Z scale_ub: Optional[float], 2025-05-07T20:32:44.9462426Z contiguous: bool, 2025-05-07T20:32:44.9462511Z compiled: bool, 2025-05-07T20:32:44.9462594Z ) -> None: 2025-05-07T20:32:44.9462689Z torch.manual_seed(2025) 2025-05-07T20:32:44.9462764Z 2025-05-07T20:32:44.9462930Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9463005Z 2025-05-07T20:32:44.9463099Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9463221Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9463309Z x = x_sign * x_clamp 2025-05-07T20:32:44.9463393Z x0 = x[:, :D] 2025-05-07T20:32:44.9463516Z x1 = x[:, D:] 2025-05-07T20:32:44.9463591Z 2025-05-07T20:32:44.9463677Z if contiguous: 2025-05-07T20:32:44.9463769Z x0 = x0.contiguous() 2025-05-07T20:32:44.9463856Z x1 = x1.contiguous() 2025-05-07T20:32:44.9463931Z 2025-05-07T20:32:44.9464022Z if scale_ub is not None: 2025-05-07T20:32:44.9464125Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9464261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9464338Z ) 2025-05-07T20:32:44.9464458Z else: 2025-05-07T20:32:44.9464552Z scale_ub_tensor = None 2025-05-07T20:32:44.9464626Z 2025-05-07T20:32:44.9464761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9464850Z op = silu_mul_quant 2025-05-07T20:32:44.9464934Z if compiled: 2025-05-07T20:32:44.9465034Z op = torch.compile(op) 2025-05-07T20:32:44.9465144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9465256Z 2025-05-07T20:32:44.9465348Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9465352Z 2025-05-07T20:32:44.9465446Z moe/activation_test.py:117: 2025-05-07T20:32:44.9465576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9465675Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9465773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9466138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9466236Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9466724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9466823Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9467175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9467399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9467736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9467829Z kernel = self.compile( 2025-05-07T20:32:44.9468231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9468503Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9468632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9468641Z 2025-05-07T20:32:44.9468842Z self = 2025-05-07T20:32:44.9469614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9470113Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e50fd80>} 2025-05-07T20:32:44.9470852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9471046Z context = 2025-05-07T20:32:44.9471053Z 2025-05-07T20:32:44.9471214Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9471473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9471579Z module_map=module_map) 2025-05-07T20:32:44.9471737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9471876Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9471958Z E ^ 2025-05-07T20:32:44.9472309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9472314Z 2025-05-07T20:32:44.9472727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9472732Z 2025-05-07T20:32:44.9472835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9473055Z self=, 2025-05-07T20:32:44.9473174Z T=16384, 2025-05-07T20:32:44.9473250Z D=5120, 2025-05-07T20:32:44.9473336Z scale_ub=1200.0, 2025-05-07T20:32:44.9473421Z contiguous=True, 2025-05-07T20:32:44.9473503Z compiled=False, 2025-05-07T20:32:44.9473580Z ) 2025-05-07T20:32:44.9473795Z self = 2025-05-07T20:32:44.9474011Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9474016Z 2025-05-07T20:32:44.9474092Z @given( 2025-05-07T20:32:44.9474209Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9474304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9474422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9474538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9474654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9474731Z ) 2025-05-07T20:32:44.9474971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9475069Z def test_silu_mul_quant( 2025-05-07T20:32:44.9475144Z self, 2025-05-07T20:32:44.9475221Z T: int, 2025-05-07T20:32:44.9475297Z D: int, 2025-05-07T20:32:44.9475394Z scale_ub: Optional[float], 2025-05-07T20:32:44.9475482Z contiguous: bool, 2025-05-07T20:32:44.9475569Z compiled: bool, 2025-05-07T20:32:44.9475650Z ) -> None: 2025-05-07T20:32:44.9475741Z torch.manual_seed(2025) 2025-05-07T20:32:44.9475816Z 2025-05-07T20:32:44.9475981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9476057Z 2025-05-07T20:32:44.9476148Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9476270Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9476360Z x = x_sign * x_clamp 2025-05-07T20:32:44.9476509Z x0 = x[:, :D] 2025-05-07T20:32:44.9476590Z x1 = x[:, D:] 2025-05-07T20:32:44.9476664Z 2025-05-07T20:32:44.9476748Z if contiguous: 2025-05-07T20:32:44.9476841Z x0 = x0.contiguous() 2025-05-07T20:32:44.9476930Z x1 = x1.contiguous() 2025-05-07T20:32:44.9477002Z 2025-05-07T20:32:44.9477095Z if scale_ub is not None: 2025-05-07T20:32:44.9477203Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9477343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9477419Z ) 2025-05-07T20:32:44.9477495Z else: 2025-05-07T20:32:44.9477587Z scale_ub_tensor = None 2025-05-07T20:32:44.9477663Z 2025-05-07T20:32:44.9477791Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9477883Z op = silu_mul_quant 2025-05-07T20:32:44.9477966Z if compiled: 2025-05-07T20:32:44.9478069Z op = torch.compile(op) 2025-05-07T20:32:44.9478181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9478259Z 2025-05-07T20:32:44.9478370Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9478374Z 2025-05-07T20:32:44.9478483Z moe/activation_test.py:117: 2025-05-07T20:32:44.9478624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9478722Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9478822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9479366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9479465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9479819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9480036Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9480376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9480509Z kernel = self.compile( 2025-05-07T20:32:44.9480891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9481062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9481186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9481193Z 2025-05-07T20:32:44.9481438Z self = 2025-05-07T20:32:44.9482206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9482705Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e3f8cc0>} 2025-05-07T20:32:44.9483451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9483640Z context = 2025-05-07T20:32:44.9483645Z 2025-05-07T20:32:44.9483812Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9484073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9484182Z module_map=module_map) 2025-05-07T20:32:44.9484341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9484438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9484515Z E ^ 2025-05-07T20:32:44.9484862Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9484914Z 2025-05-07T20:32:44.9485324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9485331Z 2025-05-07T20:32:44.9485433Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9485653Z self=, 2025-05-07T20:32:44.9485735Z T=1, 2025-05-07T20:32:44.9485816Z D=7168, 2025-05-07T20:32:44.9485901Z scale_ub=1200.0, 2025-05-07T20:32:44.9485988Z contiguous=False, 2025-05-07T20:32:44.9486071Z compiled=False, 2025-05-07T20:32:44.9486145Z ) 2025-05-07T20:32:44.9486363Z self = 2025-05-07T20:32:44.9486527Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9486531Z 2025-05-07T20:32:44.9486608Z @given( 2025-05-07T20:32:44.9486730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9486829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9486947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9487060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9487170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9487251Z ) 2025-05-07T20:32:44.9487579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9487678Z def test_silu_mul_quant( 2025-05-07T20:32:44.9487754Z self, 2025-05-07T20:32:44.9487832Z T: int, 2025-05-07T20:32:44.9487908Z D: int, 2025-05-07T20:32:44.9488007Z scale_ub: Optional[float], 2025-05-07T20:32:44.9488096Z contiguous: bool, 2025-05-07T20:32:44.9488181Z compiled: bool, 2025-05-07T20:32:44.9488258Z ) -> None: 2025-05-07T20:32:44.9488350Z torch.manual_seed(2025) 2025-05-07T20:32:44.9488429Z 2025-05-07T20:32:44.9488597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9488718Z 2025-05-07T20:32:44.9488811Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9488937Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9489024Z x = x_sign * x_clamp 2025-05-07T20:32:44.9489107Z x0 = x[:, :D] 2025-05-07T20:32:44.9489192Z x1 = x[:, D:] 2025-05-07T20:32:44.9489264Z 2025-05-07T20:32:44.9489352Z if contiguous: 2025-05-07T20:32:44.9489480Z x0 = x0.contiguous() 2025-05-07T20:32:44.9489573Z x1 = x1.contiguous() 2025-05-07T20:32:44.9489645Z 2025-05-07T20:32:44.9489735Z if scale_ub is not None: 2025-05-07T20:32:44.9489841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9489973Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9490048Z ) 2025-05-07T20:32:44.9490125Z else: 2025-05-07T20:32:44.9490224Z scale_ub_tensor = None 2025-05-07T20:32:44.9490300Z 2025-05-07T20:32:44.9490434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9490524Z op = silu_mul_quant 2025-05-07T20:32:44.9490607Z if compiled: 2025-05-07T20:32:44.9490708Z op = torch.compile(op) 2025-05-07T20:32:44.9490812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9490887Z 2025-05-07T20:32:44.9490976Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9490983Z 2025-05-07T20:32:44.9491083Z moe/activation_test.py:117: 2025-05-07T20:32:44.9491213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9491315Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9491412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9491909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9492051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9492411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9492630Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9492966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9493066Z kernel = self.compile( 2025-05-07T20:32:44.9493447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9493619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9493748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9493752Z 2025-05-07T20:32:44.9493954Z self = 2025-05-07T20:32:44.9494727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9495226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e3f9080>} 2025-05-07T20:32:44.9496011Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9496202Z context = 2025-05-07T20:32:44.9496206Z 2025-05-07T20:32:44.9496369Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9496631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9496742Z module_map=module_map) 2025-05-07T20:32:44.9496943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9497044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9497122Z E ^ 2025-05-07T20:32:44.9497475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9497479Z 2025-05-07T20:32:44.9497928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9497933Z 2025-05-07T20:32:44.9498037Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9498259Z self=, 2025-05-07T20:32:44.9498336Z T=4096, 2025-05-07T20:32:44.9498418Z D=7168, 2025-05-07T20:32:44.9498501Z scale_ub=1200.0, 2025-05-07T20:32:44.9498586Z contiguous=False, 2025-05-07T20:32:44.9498677Z compiled=True, 2025-05-07T20:32:44.9498751Z ) 2025-05-07T20:32:44.9498970Z self = 2025-05-07T20:32:44.9499144Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9499148Z 2025-05-07T20:32:44.9499224Z @given( 2025-05-07T20:32:44.9499341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9499442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9499557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9499674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9499785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9499858Z ) 2025-05-07T20:32:44.9500100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9500194Z def test_silu_mul_quant( 2025-05-07T20:32:44.9500270Z self, 2025-05-07T20:32:44.9500392Z T: int, 2025-05-07T20:32:44.9500469Z D: int, 2025-05-07T20:32:44.9500568Z scale_ub: Optional[float], 2025-05-07T20:32:44.9500659Z contiguous: bool, 2025-05-07T20:32:44.9500743Z compiled: bool, 2025-05-07T20:32:44.9500820Z ) -> None: 2025-05-07T20:32:44.9500915Z torch.manual_seed(2025) 2025-05-07T20:32:44.9500987Z 2025-05-07T20:32:44.9501154Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9501229Z 2025-05-07T20:32:44.9501321Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9501448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9501533Z x = x_sign * x_clamp 2025-05-07T20:32:44.9501613Z x0 = x[:, :D] 2025-05-07T20:32:44.9501696Z x1 = x[:, D:] 2025-05-07T20:32:44.9501769Z 2025-05-07T20:32:44.9501850Z if contiguous: 2025-05-07T20:32:44.9501943Z x0 = x0.contiguous() 2025-05-07T20:32:44.9502031Z x1 = x1.contiguous() 2025-05-07T20:32:44.9502106Z 2025-05-07T20:32:44.9502197Z if scale_ub is not None: 2025-05-07T20:32:44.9502305Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9502437Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9502514Z ) 2025-05-07T20:32:44.9502590Z else: 2025-05-07T20:32:44.9502685Z scale_ub_tensor = None 2025-05-07T20:32:44.9502755Z 2025-05-07T20:32:44.9502881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9503016Z op = silu_mul_quant 2025-05-07T20:32:44.9503101Z if compiled: 2025-05-07T20:32:44.9503199Z op = torch.compile(op) 2025-05-07T20:32:44.9503306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9503378Z 2025-05-07T20:32:44.9508780Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9508792Z 2025-05-07T20:32:44.9508908Z moe/activation_test.py:117: 2025-05-07T20:32:44.9509040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9509246Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9509348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9509721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9509821Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9510316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9510487Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9510843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9511065Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9511404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9511505Z kernel = self.compile( 2025-05-07T20:32:44.9511886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9512064Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9512190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9512195Z 2025-05-07T20:32:44.9512402Z self = 2025-05-07T20:32:44.9513180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9513682Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e3fb060>} 2025-05-07T20:32:44.9514496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9514687Z context = 2025-05-07T20:32:44.9514692Z 2025-05-07T20:32:44.9514862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9515124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9515238Z module_map=module_map) 2025-05-07T20:32:44.9515397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9515493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9515573Z E ^ 2025-05-07T20:32:44.9515924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9515932Z 2025-05-07T20:32:44.9516343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9516350Z 2025-05-07T20:32:44.9516456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9516674Z self=, 2025-05-07T20:32:44.9516756Z T=128, 2025-05-07T20:32:44.9516834Z D=7168, 2025-05-07T20:32:44.9516918Z scale_ub=1200.0, 2025-05-07T20:32:44.9517005Z contiguous=False, 2025-05-07T20:32:44.9517149Z compiled=True, 2025-05-07T20:32:44.9517224Z ) 2025-05-07T20:32:44.9517442Z self = 2025-05-07T20:32:44.9517611Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9517616Z 2025-05-07T20:32:44.9517692Z @given( 2025-05-07T20:32:44.9517813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9517912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9518032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9518191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9518302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9518377Z ) 2025-05-07T20:32:44.9518619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9518711Z def test_silu_mul_quant( 2025-05-07T20:32:44.9518791Z self, 2025-05-07T20:32:44.9518873Z T: int, 2025-05-07T20:32:44.9519011Z D: int, 2025-05-07T20:32:44.9519115Z scale_ub: Optional[float], 2025-05-07T20:32:44.9519204Z contiguous: bool, 2025-05-07T20:32:44.9519288Z compiled: bool, 2025-05-07T20:32:44.9519369Z ) -> None: 2025-05-07T20:32:44.9519461Z torch.manual_seed(2025) 2025-05-07T20:32:44.9519537Z 2025-05-07T20:32:44.9519705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9519784Z 2025-05-07T20:32:44.9519877Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9520002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9520096Z x = x_sign * x_clamp 2025-05-07T20:32:44.9520177Z x0 = x[:, :D] 2025-05-07T20:32:44.9520256Z x1 = x[:, D:] 2025-05-07T20:32:44.9520328Z 2025-05-07T20:32:44.9520410Z if contiguous: 2025-05-07T20:32:44.9520500Z x0 = x0.contiguous() 2025-05-07T20:32:44.9520593Z x1 = x1.contiguous() 2025-05-07T20:32:44.9520666Z 2025-05-07T20:32:44.9520756Z if scale_ub is not None: 2025-05-07T20:32:44.9520865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9520999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9521075Z ) 2025-05-07T20:32:44.9521152Z else: 2025-05-07T20:32:44.9521247Z scale_ub_tensor = None 2025-05-07T20:32:44.9521323Z 2025-05-07T20:32:44.9521451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9521591Z op = silu_mul_quant 2025-05-07T20:32:44.9521678Z if compiled: 2025-05-07T20:32:44.9521777Z op = torch.compile(op) 2025-05-07T20:32:44.9521880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9521956Z 2025-05-07T20:32:44.9522045Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9522050Z 2025-05-07T20:32:44.9522149Z moe/activation_test.py:117: 2025-05-07T20:32:44.9522283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9522384Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9522485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9522848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9522938Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9523429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9523532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9523889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9524108Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9524445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9524584Z kernel = self.compile( 2025-05-07T20:32:44.9524963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9525135Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9525266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9525270Z 2025-05-07T20:32:44.9525474Z self = 2025-05-07T20:32:44.9526290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9526789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e0c0360>} 2025-05-07T20:32:44.9527640Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9527831Z context = 2025-05-07T20:32:44.9527836Z 2025-05-07T20:32:44.9527999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9528264Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9528370Z module_map=module_map) 2025-05-07T20:32:44.9528529Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9528631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9528708Z E ^ 2025-05-07T20:32:44.9529061Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9529068Z 2025-05-07T20:32:44.9529478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9529483Z 2025-05-07T20:32:44.9529583Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9529806Z self=, 2025-05-07T20:32:44.9529883Z T=2048, 2025-05-07T20:32:44.9529962Z D=7168, 2025-05-07T20:32:44.9530043Z scale_ub=None, 2025-05-07T20:32:44.9530170Z contiguous=True, 2025-05-07T20:32:44.9530261Z compiled=True, 2025-05-07T20:32:44.9530335Z ) 2025-05-07T20:32:44.9530549Z self = 2025-05-07T20:32:44.9530719Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.9530724Z 2025-05-07T20:32:44.9530799Z @given( 2025-05-07T20:32:44.9530919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9531023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9531141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9531259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9531370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9531444Z ) 2025-05-07T20:32:44.9531688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9531782Z def test_silu_mul_quant( 2025-05-07T20:32:44.9531858Z self, 2025-05-07T20:32:44.9531935Z T: int, 2025-05-07T20:32:44.9532012Z D: int, 2025-05-07T20:32:44.9532110Z scale_ub: Optional[float], 2025-05-07T20:32:44.9532201Z contiguous: bool, 2025-05-07T20:32:44.9532284Z compiled: bool, 2025-05-07T20:32:44.9532359Z ) -> None: 2025-05-07T20:32:44.9532455Z torch.manual_seed(2025) 2025-05-07T20:32:44.9532528Z 2025-05-07T20:32:44.9532693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9532813Z 2025-05-07T20:32:44.9532908Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9533034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9533121Z x = x_sign * x_clamp 2025-05-07T20:32:44.9533200Z x0 = x[:, :D] 2025-05-07T20:32:44.9533281Z x1 = x[:, D:] 2025-05-07T20:32:44.9533354Z 2025-05-07T20:32:44.9533435Z if contiguous: 2025-05-07T20:32:44.9533528Z x0 = x0.contiguous() 2025-05-07T20:32:44.9533620Z x1 = x1.contiguous() 2025-05-07T20:32:44.9533734Z 2025-05-07T20:32:44.9533826Z if scale_ub is not None: 2025-05-07T20:32:44.9533931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9534065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9534144Z ) 2025-05-07T20:32:44.9534220Z else: 2025-05-07T20:32:44.9534315Z scale_ub_tensor = None 2025-05-07T20:32:44.9534389Z 2025-05-07T20:32:44.9534559Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9534652Z op = silu_mul_quant 2025-05-07T20:32:44.9534736Z if compiled: 2025-05-07T20:32:44.9534833Z op = torch.compile(op) 2025-05-07T20:32:44.9534940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9535011Z 2025-05-07T20:32:44.9535099Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9535104Z 2025-05-07T20:32:44.9535203Z moe/activation_test.py:117: 2025-05-07T20:32:44.9535336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9535440Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9535538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9535904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9535998Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9536490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9536586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9536943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9537161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9537498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9537639Z kernel = self.compile( 2025-05-07T20:32:44.9538015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9538190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9538313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9538318Z 2025-05-07T20:32:44.9538525Z self = 2025-05-07T20:32:44.9539300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9539798Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e0c0ea0>} 2025-05-07T20:32:44.9540544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9540732Z context = 2025-05-07T20:32:44.9540737Z 2025-05-07T20:32:44.9540900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9541207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9541314Z module_map=module_map) 2025-05-07T20:32:44.9541478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9541575Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9541652Z E ^ 2025-05-07T20:32:44.9542006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9542014Z 2025-05-07T20:32:44.9542464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9542469Z 2025-05-07T20:32:44.9542574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9542790Z self=, 2025-05-07T20:32:44.9542866Z T=16384, 2025-05-07T20:32:44.9542945Z D=5120, 2025-05-07T20:32:44.9543031Z scale_ub=None, 2025-05-07T20:32:44.9543154Z contiguous=False, 2025-05-07T20:32:44.9543240Z compiled=False, 2025-05-07T20:32:44.9543314Z ) 2025-05-07T20:32:44.9543531Z self = 2025-05-07T20:32:44.9543707Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9543711Z 2025-05-07T20:32:44.9543789Z @given( 2025-05-07T20:32:44.9543911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9544012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9544127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9544244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9544354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9544429Z ) 2025-05-07T20:32:44.9544670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9544763Z def test_silu_mul_quant( 2025-05-07T20:32:44.9544844Z self, 2025-05-07T20:32:44.9544919Z T: int, 2025-05-07T20:32:44.9544997Z D: int, 2025-05-07T20:32:44.9545097Z scale_ub: Optional[float], 2025-05-07T20:32:44.9545185Z contiguous: bool, 2025-05-07T20:32:44.9545268Z compiled: bool, 2025-05-07T20:32:44.9545347Z ) -> None: 2025-05-07T20:32:44.9545442Z torch.manual_seed(2025) 2025-05-07T20:32:44.9545515Z 2025-05-07T20:32:44.9545687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9545808Z 2025-05-07T20:32:44.9545899Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9546023Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9547821Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9547830Z 2025-05-07T20:32:44.9547946Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9547950Z 2025-05-07T20:32:44.9548049Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9548301Z self=, 2025-05-07T20:32:44.9548399Z T=4096, 2025-05-07T20:32:44.9548477Z D=7168, 2025-05-07T20:32:44.9548563Z scale_ub=1200.0, 2025-05-07T20:32:44.9548645Z contiguous=True, 2025-05-07T20:32:44.9548726Z compiled=True, 2025-05-07T20:32:44.9548798Z ) 2025-05-07T20:32:44.9549013Z self = 2025-05-07T20:32:44.9549230Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9549235Z 2025-05-07T20:32:44.9549310Z @given( 2025-05-07T20:32:44.9549430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9549528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9549639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9549753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9549866Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9549943Z ) 2025-05-07T20:32:44.9550246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9550341Z def test_silu_mul_quant( 2025-05-07T20:32:44.9550418Z self, 2025-05-07T20:32:44.9550496Z T: int, 2025-05-07T20:32:44.9550573Z D: int, 2025-05-07T20:32:44.9550669Z scale_ub: Optional[float], 2025-05-07T20:32:44.9550763Z contiguous: bool, 2025-05-07T20:32:44.9550854Z compiled: bool, 2025-05-07T20:32:44.9550972Z ) -> None: 2025-05-07T20:32:44.9551067Z torch.manual_seed(2025) 2025-05-07T20:32:44.9551139Z 2025-05-07T20:32:44.9551303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9551378Z 2025-05-07T20:32:44.9551469Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9551594Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9553373Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9553386Z 2025-05-07T20:32:44.9553505Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9553510Z 2025-05-07T20:32:44.9553611Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9553831Z self=, 2025-05-07T20:32:44.9553907Z T=16384, 2025-05-07T20:32:44.9553985Z D=7168, 2025-05-07T20:32:44.9554069Z scale_ub=None, 2025-05-07T20:32:44.9554198Z contiguous=False, 2025-05-07T20:32:44.9554281Z compiled=False, 2025-05-07T20:32:44.9554358Z ) 2025-05-07T20:32:44.9554570Z self = 2025-05-07T20:32:44.9554742Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9554746Z 2025-05-07T20:32:44.9554826Z @given( 2025-05-07T20:32:44.9554943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9555043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9555162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9555275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9555390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9555463Z ) 2025-05-07T20:32:44.9555705Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9555799Z def test_silu_mul_quant( 2025-05-07T20:32:44.9555875Z self, 2025-05-07T20:32:44.9555953Z T: int, 2025-05-07T20:32:44.9556032Z D: int, 2025-05-07T20:32:44.9556129Z scale_ub: Optional[float], 2025-05-07T20:32:44.9556220Z contiguous: bool, 2025-05-07T20:32:44.9556306Z compiled: bool, 2025-05-07T20:32:44.9556383Z ) -> None: 2025-05-07T20:32:44.9556479Z torch.manual_seed(2025) 2025-05-07T20:32:44.9556552Z 2025-05-07T20:32:44.9556715Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9558595Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9558639Z 2025-05-07T20:32:44.9558757Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9558761Z 2025-05-07T20:32:44.9558863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9559080Z self=, 2025-05-07T20:32:44.9559157Z T=2048, 2025-05-07T20:32:44.9559233Z D=7168, 2025-05-07T20:32:44.9559315Z scale_ub=1200.0, 2025-05-07T20:32:44.9559402Z contiguous=True, 2025-05-07T20:32:44.9559524Z compiled=True, 2025-05-07T20:32:44.9559597Z ) 2025-05-07T20:32:44.9559810Z self = 2025-05-07T20:32:44.9559981Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9559986Z 2025-05-07T20:32:44.9560062Z @given( 2025-05-07T20:32:44.9560180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9560281Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9560394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9560513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9560623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9560693Z ) 2025-05-07T20:32:44.9560933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9561026Z def test_silu_mul_quant( 2025-05-07T20:32:44.9561107Z self, 2025-05-07T20:32:44.9561183Z T: int, 2025-05-07T20:32:44.9561256Z D: int, 2025-05-07T20:32:44.9561354Z scale_ub: Optional[float], 2025-05-07T20:32:44.9561442Z contiguous: bool, 2025-05-07T20:32:44.9561526Z compiled: bool, 2025-05-07T20:32:44.9561605Z ) -> None: 2025-05-07T20:32:44.9561698Z torch.manual_seed(2025) 2025-05-07T20:32:44.9561771Z 2025-05-07T20:32:44.9561938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9562056Z 2025-05-07T20:32:44.9562148Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9562272Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9564038Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9564045Z 2025-05-07T20:32:44.9564162Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9564166Z 2025-05-07T20:32:44.9564265Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9564487Z self=, 2025-05-07T20:32:44.9564564Z T=2048, 2025-05-07T20:32:44.9564639Z D=7168, 2025-05-07T20:32:44.9564721Z scale_ub=None, 2025-05-07T20:32:44.9564802Z contiguous=True, 2025-05-07T20:32:44.9564884Z compiled=False, 2025-05-07T20:32:44.9564960Z ) 2025-05-07T20:32:44.9565171Z self = 2025-05-07T20:32:44.9565381Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9565393Z 2025-05-07T20:32:44.9565469Z @given( 2025-05-07T20:32:44.9565584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9565682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9565794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9565907Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9566023Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9566100Z ) 2025-05-07T20:32:44.9566383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9566475Z def test_silu_mul_quant( 2025-05-07T20:32:44.9566552Z self, 2025-05-07T20:32:44.9566628Z T: int, 2025-05-07T20:32:44.9566705Z D: int, 2025-05-07T20:32:44.9566801Z scale_ub: Optional[float], 2025-05-07T20:32:44.9566890Z contiguous: bool, 2025-05-07T20:32:44.9566975Z compiled: bool, 2025-05-07T20:32:44.9567053Z ) -> None: 2025-05-07T20:32:44.9567188Z torch.manual_seed(2025) 2025-05-07T20:32:44.9567262Z 2025-05-07T20:32:44.9567427Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9567502Z 2025-05-07T20:32:44.9567634Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.9569397Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9569412Z 2025-05-07T20:32:44.9569527Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.9569535Z 2025-05-07T20:32:44.9569637Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9569856Z self=, 2025-05-07T20:32:44.9569933Z T=1, 2025-05-07T20:32:44.9570012Z D=7168, 2025-05-07T20:32:44.9570096Z scale_ub=1200.0, 2025-05-07T20:32:44.9570179Z contiguous=True, 2025-05-07T20:32:44.9570263Z compiled=False, 2025-05-07T20:32:44.9570335Z ) 2025-05-07T20:32:44.9570592Z self = 2025-05-07T20:32:44.9570759Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9570763Z 2025-05-07T20:32:44.9570840Z @given( 2025-05-07T20:32:44.9570956Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9571054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9571164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9571279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9571395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9571469Z ) 2025-05-07T20:32:44.9571711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9571801Z def test_silu_mul_quant( 2025-05-07T20:32:44.9571876Z self, 2025-05-07T20:32:44.9571955Z T: int, 2025-05-07T20:32:44.9572031Z D: int, 2025-05-07T20:32:44.9572127Z scale_ub: Optional[float], 2025-05-07T20:32:44.9572220Z contiguous: bool, 2025-05-07T20:32:44.9572305Z compiled: bool, 2025-05-07T20:32:44.9572381Z ) -> None: 2025-05-07T20:32:44.9572477Z torch.manual_seed(2025) 2025-05-07T20:32:44.9572550Z 2025-05-07T20:32:44.9572715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9572791Z 2025-05-07T20:32:44.9572881Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9573050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9573142Z x = x_sign * x_clamp 2025-05-07T20:32:44.9573220Z x0 = x[:, :D] 2025-05-07T20:32:44.9573301Z x1 = x[:, D:] 2025-05-07T20:32:44.9573372Z 2025-05-07T20:32:44.9573454Z if contiguous: 2025-05-07T20:32:44.9573547Z x0 = x0.contiguous() 2025-05-07T20:32:44.9573635Z x1 = x1.contiguous() 2025-05-07T20:32:44.9573708Z 2025-05-07T20:32:44.9573802Z if scale_ub is not None: 2025-05-07T20:32:44.9573909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9574083Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9574157Z ) 2025-05-07T20:32:44.9574233Z else: 2025-05-07T20:32:44.9574328Z scale_ub_tensor = None 2025-05-07T20:32:44.9574401Z 2025-05-07T20:32:44.9574529Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9574618Z op = silu_mul_quant 2025-05-07T20:32:44.9574706Z if compiled: 2025-05-07T20:32:44.9574844Z op = torch.compile(op) 2025-05-07T20:32:44.9574956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9575027Z 2025-05-07T20:32:44.9575116Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9575121Z 2025-05-07T20:32:44.9575218Z moe/activation_test.py:117: 2025-05-07T20:32:44.9575345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9575446Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9575549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9576046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9576144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9576498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9576719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9577061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9577153Z kernel = self.compile( 2025-05-07T20:32:44.9577535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9577705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9577872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9577879Z 2025-05-07T20:32:44.9578084Z self = 2025-05-07T20:32:44.9578854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9579357Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e230680>} 2025-05-07T20:32:44.9580099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9580288Z context = 2025-05-07T20:32:44.9580299Z 2025-05-07T20:32:44.9580463Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9580722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9580830Z module_map=module_map) 2025-05-07T20:32:44.9580988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9581085Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9581231Z E ^ 2025-05-07T20:32:44.9581586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9581591Z 2025-05-07T20:32:44.9582002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9582007Z 2025-05-07T20:32:44.9582108Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9582326Z self=, 2025-05-07T20:32:44.9582470Z T=128, 2025-05-07T20:32:44.9582545Z D=5120, 2025-05-07T20:32:44.9582626Z scale_ub=None, 2025-05-07T20:32:44.9582713Z contiguous=True, 2025-05-07T20:32:44.9582795Z compiled=False, 2025-05-07T20:32:44.9582867Z ) 2025-05-07T20:32:44.9583086Z self = 2025-05-07T20:32:44.9583253Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9583260Z 2025-05-07T20:32:44.9583381Z @given( 2025-05-07T20:32:44.9583503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9583601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9583715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9583831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9583940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9584020Z ) 2025-05-07T20:32:44.9584263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9584357Z def test_silu_mul_quant( 2025-05-07T20:32:44.9584435Z self, 2025-05-07T20:32:44.9584511Z T: int, 2025-05-07T20:32:44.9584586Z D: int, 2025-05-07T20:32:44.9584683Z scale_ub: Optional[float], 2025-05-07T20:32:44.9584770Z contiguous: bool, 2025-05-07T20:32:44.9584855Z compiled: bool, 2025-05-07T20:32:44.9584933Z ) -> None: 2025-05-07T20:32:44.9585030Z torch.manual_seed(2025) 2025-05-07T20:32:44.9585105Z 2025-05-07T20:32:44.9585270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9585345Z 2025-05-07T20:32:44.9585439Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9585562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9585651Z x = x_sign * x_clamp 2025-05-07T20:32:44.9585733Z x0 = x[:, :D] 2025-05-07T20:32:44.9585858Z x1 = x[:, D:] 2025-05-07T20:32:44.9585932Z 2025-05-07T20:32:44.9586021Z if contiguous: 2025-05-07T20:32:44.9586110Z x0 = x0.contiguous() 2025-05-07T20:32:44.9586201Z x1 = x1.contiguous() 2025-05-07T20:32:44.9586273Z 2025-05-07T20:32:44.9586362Z if scale_ub is not None: 2025-05-07T20:32:44.9586468Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9586603Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9586680Z ) 2025-05-07T20:32:44.9586763Z else: 2025-05-07T20:32:44.9586856Z scale_ub_tensor = None 2025-05-07T20:32:44.9586929Z 2025-05-07T20:32:44.9587063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9587153Z op = silu_mul_quant 2025-05-07T20:32:44.9587236Z if compiled: 2025-05-07T20:32:44.9587338Z op = torch.compile(op) 2025-05-07T20:32:44.9587442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9587522Z 2025-05-07T20:32:44.9587611Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9587618Z 2025-05-07T20:32:44.9587712Z moe/activation_test.py:117: 2025-05-07T20:32:44.9587842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9587945Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9588044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9588586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9588683Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9589039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9589256Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9589591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9589692Z kernel = self.compile( 2025-05-07T20:32:44.9590107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9590277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9590404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9590408Z 2025-05-07T20:32:44.9590610Z self = 2025-05-07T20:32:44.9591424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9591925Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e2318a0>} 2025-05-07T20:32:44.9592671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9592862Z context = 2025-05-07T20:32:44.9592866Z 2025-05-07T20:32:44.9593028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9593293Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9593398Z module_map=module_map) 2025-05-07T20:32:44.9593558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9593655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9593732Z E ^ 2025-05-07T20:32:44.9594083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9594130Z 2025-05-07T20:32:44.9594542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9594547Z 2025-05-07T20:32:44.9594649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9594868Z self=, 2025-05-07T20:32:44.9594946Z T=128, 2025-05-07T20:32:44.9595028Z D=7168, 2025-05-07T20:32:44.9595108Z scale_ub=None, 2025-05-07T20:32:44.9595194Z contiguous=True, 2025-05-07T20:32:44.9595279Z compiled=False, 2025-05-07T20:32:44.9595350Z ) 2025-05-07T20:32:44.9595562Z self = 2025-05-07T20:32:44.9595733Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9595737Z 2025-05-07T20:32:44.9595814Z @given( 2025-05-07T20:32:44.9595932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9596035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9596150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9596268Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9596378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9596452Z ) 2025-05-07T20:32:44.9596695Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9596788Z def test_silu_mul_quant( 2025-05-07T20:32:44.9596909Z self, 2025-05-07T20:32:44.9596993Z T: int, 2025-05-07T20:32:44.9597072Z D: int, 2025-05-07T20:32:44.9597170Z scale_ub: Optional[float], 2025-05-07T20:32:44.9597260Z contiguous: bool, 2025-05-07T20:32:44.9597344Z compiled: bool, 2025-05-07T20:32:44.9597421Z ) -> None: 2025-05-07T20:32:44.9597516Z torch.manual_seed(2025) 2025-05-07T20:32:44.9597588Z 2025-05-07T20:32:44.9597756Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9597831Z 2025-05-07T20:32:44.9597962Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9598089Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9598176Z x = x_sign * x_clamp 2025-05-07T20:32:44.9598255Z x0 = x[:, :D] 2025-05-07T20:32:44.9598337Z x1 = x[:, D:] 2025-05-07T20:32:44.9598407Z 2025-05-07T20:32:44.9598489Z if contiguous: 2025-05-07T20:32:44.9598582Z x0 = x0.contiguous() 2025-05-07T20:32:44.9598711Z x1 = x1.contiguous() 2025-05-07T20:32:44.9598784Z 2025-05-07T20:32:44.9598875Z if scale_ub is not None: 2025-05-07T20:32:44.9598979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9599114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9599191Z ) 2025-05-07T20:32:44.9599267Z else: 2025-05-07T20:32:44.9599364Z scale_ub_tensor = None 2025-05-07T20:32:44.9599440Z 2025-05-07T20:32:44.9599566Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9599662Z op = silu_mul_quant 2025-05-07T20:32:44.9599746Z if compiled: 2025-05-07T20:32:44.9599848Z op = torch.compile(op) 2025-05-07T20:32:44.9599951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9600024Z 2025-05-07T20:32:44.9600116Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9600120Z 2025-05-07T20:32:44.9600215Z moe/activation_test.py:117: 2025-05-07T20:32:44.9600346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9600449Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9600547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9601043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9601139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9601539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9601762Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9602098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9602194Z kernel = self.compile( 2025-05-07T20:32:44.9602575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9602747Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9602875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9602880Z 2025-05-07T20:32:44.9603082Z self = 2025-05-07T20:32:44.9603851Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9604356Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e2327a0>} 2025-05-07T20:32:44.9605138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9605330Z context = 2025-05-07T20:32:44.9605335Z 2025-05-07T20:32:44.9605495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9605993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9606101Z module_map=module_map) 2025-05-07T20:32:44.9606264Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9606434Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9606512Z E ^ 2025-05-07T20:32:44.9606862Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9606867Z 2025-05-07T20:32:44.9607282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9607286Z 2025-05-07T20:32:44.9607444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9607750Z self=, 2025-05-07T20:32:44.9607829Z T=2048, 2025-05-07T20:32:44.9607905Z D=7168, 2025-05-07T20:32:44.9607989Z scale_ub=1200.0, 2025-05-07T20:32:44.9608075Z contiguous=True, 2025-05-07T20:32:44.9608158Z compiled=False, 2025-05-07T20:32:44.9608231Z ) 2025-05-07T20:32:44.9608449Z self = 2025-05-07T20:32:44.9608626Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9608630Z 2025-05-07T20:32:44.9608709Z @given( 2025-05-07T20:32:44.9608825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9608924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9609035Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9609151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9609266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9609339Z ) 2025-05-07T20:32:44.9609579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9609674Z def test_silu_mul_quant( 2025-05-07T20:32:44.9609750Z self, 2025-05-07T20:32:44.9609829Z T: int, 2025-05-07T20:32:44.9609904Z D: int, 2025-05-07T20:32:44.9610094Z scale_ub: Optional[float], 2025-05-07T20:32:44.9610184Z contiguous: bool, 2025-05-07T20:32:44.9610273Z compiled: bool, 2025-05-07T20:32:44.9610350Z ) -> None: 2025-05-07T20:32:44.9610448Z torch.manual_seed(2025) 2025-05-07T20:32:44.9610521Z 2025-05-07T20:32:44.9610689Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9612473Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9612481Z 2025-05-07T20:32:44.9612597Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9612604Z 2025-05-07T20:32:44.9612706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9612923Z self=, 2025-05-07T20:32:44.9613002Z T=1, 2025-05-07T20:32:44.9613075Z D=5120, 2025-05-07T20:32:44.9613157Z scale_ub=1200.0, 2025-05-07T20:32:44.9613245Z contiguous=True, 2025-05-07T20:32:44.9613328Z compiled=False, 2025-05-07T20:32:44.9613400Z ) 2025-05-07T20:32:44.9613680Z self = 2025-05-07T20:32:44.9613844Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9613848Z 2025-05-07T20:32:44.9613926Z @given( 2025-05-07T20:32:44.9614046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9614142Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9614253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9614375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9614529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9614604Z ) 2025-05-07T20:32:44.9614844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9614935Z def test_silu_mul_quant( 2025-05-07T20:32:44.9615013Z self, 2025-05-07T20:32:44.9615090Z T: int, 2025-05-07T20:32:44.9615164Z D: int, 2025-05-07T20:32:44.9615304Z scale_ub: Optional[float], 2025-05-07T20:32:44.9615393Z contiguous: bool, 2025-05-07T20:32:44.9615480Z compiled: bool, 2025-05-07T20:32:44.9615561Z ) -> None: 2025-05-07T20:32:44.9615654Z torch.manual_seed(2025) 2025-05-07T20:32:44.9615730Z 2025-05-07T20:32:44.9615900Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9615974Z 2025-05-07T20:32:44.9616067Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9616192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9616284Z x = x_sign * x_clamp 2025-05-07T20:32:44.9616366Z x0 = x[:, :D] 2025-05-07T20:32:44.9616445Z x1 = x[:, D:] 2025-05-07T20:32:44.9616519Z 2025-05-07T20:32:44.9616604Z if contiguous: 2025-05-07T20:32:44.9616693Z x0 = x0.contiguous() 2025-05-07T20:32:44.9616783Z x1 = x1.contiguous() 2025-05-07T20:32:44.9616855Z 2025-05-07T20:32:44.9616947Z if scale_ub is not None: 2025-05-07T20:32:44.9617053Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9617190Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9617266Z ) 2025-05-07T20:32:44.9617344Z else: 2025-05-07T20:32:44.9617436Z scale_ub_tensor = None 2025-05-07T20:32:44.9617508Z 2025-05-07T20:32:44.9617640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9617729Z op = silu_mul_quant 2025-05-07T20:32:44.9617858Z if compiled: 2025-05-07T20:32:44.9617963Z op = torch.compile(op) 2025-05-07T20:32:44.9618067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9618139Z 2025-05-07T20:32:44.9618231Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9618235Z 2025-05-07T20:32:44.9618332Z moe/activation_test.py:117: 2025-05-07T20:32:44.9618461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9618563Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9618663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9619163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9619259Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9619613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9619840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9620178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9620271Z kernel = self.compile( 2025-05-07T20:32:44.9620648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9620862Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9620994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9620998Z 2025-05-07T20:32:44.9621198Z self = 2025-05-07T20:32:44.9621969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9622470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5d3e233b00>} 2025-05-07T20:32:44.9623250Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9623479Z context = 2025-05-07T20:32:44.9623485Z 2025-05-07T20:32:44.9623647Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9623909Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9624015Z module_map=module_map) 2025-05-07T20:32:44.9624174Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9624278Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9624354Z E ^ 2025-05-07T20:32:44.9624712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9624719Z 2025-05-07T20:32:44.9625127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9625131Z 2025-05-07T20:32:44.9625231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9625455Z self=, 2025-05-07T20:32:44.9625531Z T=2048, 2025-05-07T20:32:44.9625608Z D=5120, 2025-05-07T20:32:44.9625691Z scale_ub=None, 2025-05-07T20:32:44.9625775Z contiguous=True, 2025-05-07T20:32:44.9625861Z compiled=False, 2025-05-07T20:32:44.9625938Z ) 2025-05-07T20:32:44.9626150Z self = 2025-05-07T20:32:44.9626323Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9626374Z 2025-05-07T20:32:44.9626450Z @given( 2025-05-07T20:32:44.9626566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9626665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9626777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9626891Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9627006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9627083Z ) 2025-05-07T20:32:44.9627327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9627422Z def test_silu_mul_quant( 2025-05-07T20:32:44.9627497Z self, 2025-05-07T20:32:44.9627574Z T: int, 2025-05-07T20:32:44.9627650Z D: int, 2025-05-07T20:32:44.9627747Z scale_ub: Optional[float], 2025-05-07T20:32:44.9627836Z contiguous: bool, 2025-05-07T20:32:44.9627919Z compiled: bool, 2025-05-07T20:32:44.9627998Z ) -> None: 2025-05-07T20:32:44.9628097Z torch.manual_seed(2025) 2025-05-07T20:32:44.9628184Z 2025-05-07T20:32:44.9628375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9628461Z 2025-05-07T20:32:44.9628552Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.9630368Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9630374Z 2025-05-07T20:32:44.9630491Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.9630498Z 2025-05-07T20:32:44.9630642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9630864Z self=, 2025-05-07T20:32:44.9634338Z T=16384, 2025-05-07T20:32:44.9634432Z D=5120, 2025-05-07T20:32:44.9634519Z scale_ub=None, 2025-05-07T20:32:44.9634606Z contiguous=True, 2025-05-07T20:32:44.9634695Z compiled=False, 2025-05-07T20:32:44.9634771Z ) 2025-05-07T20:32:44.9635063Z self = 2025-05-07T20:32:44.9635247Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9635251Z 2025-05-07T20:32:44.9635329Z @given( 2025-05-07T20:32:44.9635452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9635551Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9635666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9635796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9635914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9635989Z ) 2025-05-07T20:32:44.9636239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9636334Z def test_silu_mul_quant( 2025-05-07T20:32:44.9636415Z self, 2025-05-07T20:32:44.9636494Z T: int, 2025-05-07T20:32:44.9636572Z D: int, 2025-05-07T20:32:44.9636677Z scale_ub: Optional[float], 2025-05-07T20:32:44.9636770Z contiguous: bool, 2025-05-07T20:32:44.9636857Z compiled: bool, 2025-05-07T20:32:44.9636941Z ) -> None: 2025-05-07T20:32:44.9637036Z torch.manual_seed(2025) 2025-05-07T20:32:44.9637110Z 2025-05-07T20:32:44.9637281Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9639118Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9639176Z 2025-05-07T20:32:44.9639303Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9639308Z 2025-05-07T20:32:44.9639412Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9639636Z self=, 2025-05-07T20:32:44.9639714Z T=4096, 2025-05-07T20:32:44.9639792Z D=5120, 2025-05-07T20:32:44.9639881Z scale_ub=None, 2025-05-07T20:32:44.9639967Z contiguous=True, 2025-05-07T20:32:44.9640051Z compiled=False, 2025-05-07T20:32:44.9640130Z ) 2025-05-07T20:32:44.9640347Z self = 2025-05-07T20:32:44.9640520Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9640524Z 2025-05-07T20:32:44.9640604Z @given( 2025-05-07T20:32:44.9640723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9640822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9640940Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9641104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9641223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9641297Z ) 2025-05-07T20:32:44.9641541Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9641639Z def test_silu_mul_quant( 2025-05-07T20:32:44.9641716Z self, 2025-05-07T20:32:44.9641793Z T: int, 2025-05-07T20:32:44.9641879Z D: int, 2025-05-07T20:32:44.9641979Z scale_ub: Optional[float], 2025-05-07T20:32:44.9642111Z contiguous: bool, 2025-05-07T20:32:44.9642200Z compiled: bool, 2025-05-07T20:32:44.9642279Z ) -> None: 2025-05-07T20:32:44.9642374Z torch.manual_seed(2025) 2025-05-07T20:32:44.9642452Z 2025-05-07T20:32:44.9642619Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9644451Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9644462Z 2025-05-07T20:32:44.9644585Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9644592Z 2025-05-07T20:32:44.9644695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9644917Z self=, 2025-05-07T20:32:44.9644998Z T=2048, 2025-05-07T20:32:44.9645075Z D=5120, 2025-05-07T20:32:44.9645158Z scale_ub=None, 2025-05-07T20:32:44.9645247Z contiguous=False, 2025-05-07T20:32:44.9645334Z compiled=False, 2025-05-07T20:32:44.9645408Z ) 2025-05-07T20:32:44.9645626Z self = 2025-05-07T20:32:44.9645798Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9645802Z 2025-05-07T20:32:44.9645884Z @given( 2025-05-07T20:32:44.9646002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9646101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9646264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9646385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9646497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9646575Z ) 2025-05-07T20:32:44.9646818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9646913Z def test_silu_mul_quant( 2025-05-07T20:32:44.9646992Z self, 2025-05-07T20:32:44.9647071Z T: int, 2025-05-07T20:32:44.9647154Z D: int, 2025-05-07T20:32:44.9647255Z scale_ub: Optional[float], 2025-05-07T20:32:44.9647345Z contiguous: bool, 2025-05-07T20:32:44.9647434Z compiled: bool, 2025-05-07T20:32:44.9647514Z ) -> None: 2025-05-07T20:32:44.9647687Z torch.manual_seed(2025) 2025-05-07T20:32:44.9647764Z 2025-05-07T20:32:44.9647931Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9649743Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9649760Z 2025-05-07T20:32:44.9649879Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9649883Z 2025-05-07T20:32:44.9649986Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9650208Z self=, 2025-05-07T20:32:44.9650285Z T=4096, 2025-05-07T20:32:44.9650367Z D=7168, 2025-05-07T20:32:44.9650450Z scale_ub=None, 2025-05-07T20:32:44.9650539Z contiguous=True, 2025-05-07T20:32:44.9650627Z compiled=True, 2025-05-07T20:32:44.9650746Z ) 2025-05-07T20:32:44.9650961Z self = 2025-05-07T20:32:44.9651132Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.9651136Z 2025-05-07T20:32:44.9651214Z @given( 2025-05-07T20:32:44.9651332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9651434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9651589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9651710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9651823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9651898Z ) 2025-05-07T20:32:44.9652144Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9652238Z def test_silu_mul_quant( 2025-05-07T20:32:44.9652315Z self, 2025-05-07T20:32:44.9652399Z T: int, 2025-05-07T20:32:44.9652476Z D: int, 2025-05-07T20:32:44.9652578Z scale_ub: Optional[float], 2025-05-07T20:32:44.9652670Z contiguous: bool, 2025-05-07T20:32:44.9652757Z compiled: bool, 2025-05-07T20:32:44.9652836Z ) -> None: 2025-05-07T20:32:44.9652937Z torch.manual_seed(2025) 2025-05-07T20:32:44.9653011Z 2025-05-07T20:32:44.9653184Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9654953Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9655004Z 2025-05-07T20:32:44.9655127Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9655132Z 2025-05-07T20:32:44.9655234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9655454Z self=, 2025-05-07T20:32:44.9655535Z T=2048, 2025-05-07T20:32:44.9655612Z D=5120, 2025-05-07T20:32:44.9655696Z scale_ub=1200.0, 2025-05-07T20:32:44.9655788Z contiguous=False, 2025-05-07T20:32:44.9655875Z compiled=False, 2025-05-07T20:32:44.9655950Z ) 2025-05-07T20:32:44.9656166Z self = 2025-05-07T20:32:44.9656339Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9656344Z 2025-05-07T20:32:44.9656424Z @given( 2025-05-07T20:32:44.9656542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9656644Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9656763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9656879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9656992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9657069Z ) 2025-05-07T20:32:44.9657310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9657404Z def test_silu_mul_quant( 2025-05-07T20:32:44.9657527Z self, 2025-05-07T20:32:44.9657609Z T: int, 2025-05-07T20:32:44.9657689Z D: int, 2025-05-07T20:32:44.9657788Z scale_ub: Optional[float], 2025-05-07T20:32:44.9657877Z contiguous: bool, 2025-05-07T20:32:44.9657965Z compiled: bool, 2025-05-07T20:32:44.9658045Z ) -> None: 2025-05-07T20:32:44.9658142Z torch.manual_seed(2025) 2025-05-07T20:32:44.9658218Z 2025-05-07T20:32:44.9658385Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9660193Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9660235Z 2025-05-07T20:32:44.9660355Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9660360Z 2025-05-07T20:32:44.9660463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9660685Z self=, 2025-05-07T20:32:44.9660764Z T=4096, 2025-05-07T20:32:44.9660845Z D=7168, 2025-05-07T20:32:44.9660932Z scale_ub=1200.0, 2025-05-07T20:32:44.9661021Z contiguous=True, 2025-05-07T20:32:44.9661108Z compiled=False, 2025-05-07T20:32:44.9661184Z ) 2025-05-07T20:32:44.9661397Z self = 2025-05-07T20:32:44.9661569Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9661574Z 2025-05-07T20:32:44.9661653Z @given( 2025-05-07T20:32:44.9661774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9661878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9661992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9662113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9662226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9662301Z ) 2025-05-07T20:32:44.9662545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9662684Z def test_silu_mul_quant( 2025-05-07T20:32:44.9662761Z self, 2025-05-07T20:32:44.9662844Z T: int, 2025-05-07T20:32:44.9662921Z D: int, 2025-05-07T20:32:44.9663020Z scale_ub: Optional[float], 2025-05-07T20:32:44.9663114Z contiguous: bool, 2025-05-07T20:32:44.9663200Z compiled: bool, 2025-05-07T20:32:44.9663279Z ) -> None: 2025-05-07T20:32:44.9663376Z torch.manual_seed(2025) 2025-05-07T20:32:44.9663452Z 2025-05-07T20:32:44.9663627Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9665390Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9665401Z 2025-05-07T20:32:44.9665522Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9665527Z 2025-05-07T20:32:44.9665630Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9665850Z self=, 2025-05-07T20:32:44.9665932Z T=16384, 2025-05-07T20:32:44.9666054Z D=7168, 2025-05-07T20:32:44.9666142Z scale_ub=None, 2025-05-07T20:32:44.9666232Z contiguous=False, 2025-05-07T20:32:44.9666316Z compiled=True, 2025-05-07T20:32:44.9666390Z ) 2025-05-07T20:32:44.9666611Z self = 2025-05-07T20:32:44.9666785Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9666790Z 2025-05-07T20:32:44.9666871Z @given( 2025-05-07T20:32:44.9666992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9667131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9667248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9667364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9667476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9667554Z ) 2025-05-07T20:32:44.9667796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9667931Z def test_silu_mul_quant( 2025-05-07T20:32:44.9668014Z self, 2025-05-07T20:32:44.9668092Z T: int, 2025-05-07T20:32:44.9668191Z D: int, 2025-05-07T20:32:44.9668298Z scale_ub: Optional[float], 2025-05-07T20:32:44.9668410Z contiguous: bool, 2025-05-07T20:32:44.9668499Z compiled: bool, 2025-05-07T20:32:44.9668578Z ) -> None: 2025-05-07T20:32:44.9668672Z torch.manual_seed(2025) 2025-05-07T20:32:44.9668752Z 2025-05-07T20:32:44.9668918Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9670698Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9670704Z 2025-05-07T20:32:44.9670822Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9670826Z 2025-05-07T20:32:44.9670928Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9671150Z self=, 2025-05-07T20:32:44.9671271Z T=4096, 2025-05-07T20:32:44.9671352Z D=7168, 2025-05-07T20:32:44.9671437Z scale_ub=None, 2025-05-07T20:32:44.9671523Z contiguous=True, 2025-05-07T20:32:44.9671614Z compiled=False, 2025-05-07T20:32:44.9671689Z ) 2025-05-07T20:32:44.9671903Z self = 2025-05-07T20:32:44.9672075Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9672080Z 2025-05-07T20:32:44.9672157Z @given( 2025-05-07T20:32:44.9672282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9672383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9672497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9672618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9672730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9672805Z ) 2025-05-07T20:32:44.9673049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9673149Z def test_silu_mul_quant( 2025-05-07T20:32:44.9673226Z self, 2025-05-07T20:32:44.9673307Z T: int, 2025-05-07T20:32:44.9673385Z D: int, 2025-05-07T20:32:44.9673483Z scale_ub: Optional[float], 2025-05-07T20:32:44.9673575Z contiguous: bool, 2025-05-07T20:32:44.9673662Z compiled: bool, 2025-05-07T20:32:44.9673741Z ) -> None: 2025-05-07T20:32:44.9673839Z torch.manual_seed(2025) 2025-05-07T20:32:44.9673958Z 2025-05-07T20:32:44.9674130Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9675896Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9675965Z 2025-05-07T20:32:44.9676086Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9676091Z 2025-05-07T20:32:44.9676193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9676417Z self=, 2025-05-07T20:32:44.9676534Z T=16384, 2025-05-07T20:32:44.9676616Z D=7168, 2025-05-07T20:32:44.9676700Z scale_ub=None, 2025-05-07T20:32:44.9676793Z contiguous=True, 2025-05-07T20:32:44.9676877Z compiled=False, 2025-05-07T20:32:44.9676951Z ) 2025-05-07T20:32:44.9677166Z self = 2025-05-07T20:32:44.9677338Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9677346Z 2025-05-07T20:32:44.9677429Z @given( 2025-05-07T20:32:44.9677549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9677651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9677766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9677883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9677999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9678073Z ) 2025-05-07T20:32:44.9678342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9678449Z def test_silu_mul_quant( 2025-05-07T20:32:44.9678545Z self, 2025-05-07T20:32:44.9678623Z T: int, 2025-05-07T20:32:44.9678705Z D: int, 2025-05-07T20:32:44.9678808Z scale_ub: Optional[float], 2025-05-07T20:32:44.9678900Z contiguous: bool, 2025-05-07T20:32:44.9678987Z compiled: bool, 2025-05-07T20:32:44.9679066Z ) -> None: 2025-05-07T20:32:44.9679210Z torch.manual_seed(2025) 2025-05-07T20:32:44.9679287Z 2025-05-07T20:32:44.9679452Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9681223Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9681229Z 2025-05-07T20:32:44.9681347Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9681351Z 2025-05-07T20:32:44.9681457Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9681681Z self=, 2025-05-07T20:32:44.9681762Z T=16384, 2025-05-07T20:32:44.9681843Z D=7168, 2025-05-07T20:32:44.9681927Z scale_ub=1200.0, 2025-05-07T20:32:44.9682014Z contiguous=True, 2025-05-07T20:32:44.9682099Z compiled=False, 2025-05-07T20:32:44.9682173Z ) 2025-05-07T20:32:44.9682388Z self = 2025-05-07T20:32:44.9682607Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9682614Z 2025-05-07T20:32:44.9682694Z @given( 2025-05-07T20:32:44.9682814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9682913Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9683027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9683146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9683259Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9683340Z ) 2025-05-07T20:32:44.9683582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9683717Z def test_silu_mul_quant( 2025-05-07T20:32:44.9683796Z self, 2025-05-07T20:32:44.9683874Z T: int, 2025-05-07T20:32:44.9683951Z D: int, 2025-05-07T20:32:44.9684052Z scale_ub: Optional[float], 2025-05-07T20:32:44.9684141Z contiguous: bool, 2025-05-07T20:32:44.9684228Z compiled: bool, 2025-05-07T20:32:44.9684314Z ) -> None: 2025-05-07T20:32:44.9684447Z torch.manual_seed(2025) 2025-05-07T20:32:44.9684526Z 2025-05-07T20:32:44.9684693Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9686459Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9686470Z 2025-05-07T20:32:44.9686591Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9686596Z 2025-05-07T20:32:44.9686698Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9686924Z self=, 2025-05-07T20:32:44.9687003Z T=128, 2025-05-07T20:32:44.9687081Z D=5120, 2025-05-07T20:32:44.9687168Z scale_ub=1200.0, 2025-05-07T20:32:44.9687255Z contiguous=False, 2025-05-07T20:32:44.9687339Z compiled=False, 2025-05-07T20:32:44.9687416Z ) 2025-05-07T20:32:44.9687675Z self = 2025-05-07T20:32:44.9687892Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9687901Z 2025-05-07T20:32:44.9687979Z @given( 2025-05-07T20:32:44.9688098Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9688199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9688312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9688429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9688547Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9688625Z ) 2025-05-07T20:32:44.9688867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9688964Z def test_silu_mul_quant( 2025-05-07T20:32:44.9689040Z self, 2025-05-07T20:32:44.9689117Z T: int, 2025-05-07T20:32:44.9689196Z D: int, 2025-05-07T20:32:44.9689295Z scale_ub: Optional[float], 2025-05-07T20:32:44.9689387Z contiguous: bool, 2025-05-07T20:32:44.9689476Z compiled: bool, 2025-05-07T20:32:44.9689555Z ) -> None: 2025-05-07T20:32:44.9689655Z torch.manual_seed(2025) 2025-05-07T20:32:44.9689730Z 2025-05-07T20:32:44.9689897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9689974Z 2025-05-07T20:32:44.9690067Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9690192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9690286Z x = x_sign * x_clamp 2025-05-07T20:32:44.9690413Z x0 = x[:, :D] 2025-05-07T20:32:44.9690499Z x1 = x[:, D:] 2025-05-07T20:32:44.9690577Z 2025-05-07T20:32:44.9690661Z if contiguous: 2025-05-07T20:32:44.9690760Z x0 = x0.contiguous() 2025-05-07T20:32:44.9690850Z x1 = x1.contiguous() 2025-05-07T20:32:44.9690923Z 2025-05-07T20:32:44.9691017Z if scale_ub is not None: 2025-05-07T20:32:44.9691124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9691263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9691384Z ) 2025-05-07T20:32:44.9691462Z else: 2025-05-07T20:32:44.9691557Z scale_ub_tensor = None 2025-05-07T20:32:44.9691634Z 2025-05-07T20:32:44.9691765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9691857Z op = silu_mul_quant 2025-05-07T20:32:44.9691949Z if compiled: 2025-05-07T20:32:44.9692050Z op = torch.compile(op) 2025-05-07T20:32:44.9692199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9692276Z 2025-05-07T20:32:44.9692369Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9692374Z 2025-05-07T20:32:44.9692474Z moe/activation_test.py:117: 2025-05-07T20:32:44.9692604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9692706Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9692809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9693317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9693420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9693780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9694001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9694345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9694441Z kernel = self.compile( 2025-05-07T20:32:44.9694821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9694998Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9695125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9695171Z 2025-05-07T20:32:44.9695376Z self = 2025-05-07T20:32:44.9696152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9696658Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5999d5a700>} 2025-05-07T20:32:44.9697406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9697595Z context = 2025-05-07T20:32:44.9697599Z 2025-05-07T20:32:44.9697766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9698033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9698141Z module_map=module_map) 2025-05-07T20:32:44.9698306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9698405Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9698486Z E ^ 2025-05-07T20:32:44.9698887Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9698892Z 2025-05-07T20:32:44.9699304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9699309Z 2025-05-07T20:32:44.9699416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9699636Z self=, 2025-05-07T20:32:44.9699719Z T=2048, 2025-05-07T20:32:44.9699800Z D=7168, 2025-05-07T20:32:44.9699884Z scale_ub=None, 2025-05-07T20:32:44.9700014Z contiguous=False, 2025-05-07T20:32:44.9700098Z compiled=False, 2025-05-07T20:32:44.9700172Z ) 2025-05-07T20:32:44.9700391Z self = 2025-05-07T20:32:44.9700562Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9700567Z 2025-05-07T20:32:44.9700646Z @given( 2025-05-07T20:32:44.9700809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9700910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9701024Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9701143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9701256Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9701333Z ) 2025-05-07T20:32:44.9701576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9701673Z def test_silu_mul_quant( 2025-05-07T20:32:44.9701755Z self, 2025-05-07T20:32:44.9701832Z T: int, 2025-05-07T20:32:44.9701910Z D: int, 2025-05-07T20:32:44.9702011Z scale_ub: Optional[float], 2025-05-07T20:32:44.9702102Z contiguous: bool, 2025-05-07T20:32:44.9702189Z compiled: bool, 2025-05-07T20:32:44.9702271Z ) -> None: 2025-05-07T20:32:44.9702367Z torch.manual_seed(2025) 2025-05-07T20:32:44.9702441Z 2025-05-07T20:32:44.9702619Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9704388Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9704443Z 2025-05-07T20:32:44.9704561Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9704566Z 2025-05-07T20:32:44.9704669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9704893Z self=, 2025-05-07T20:32:44.9704974Z T=128, 2025-05-07T20:32:44.9705054Z D=7168, 2025-05-07T20:32:44.9705142Z scale_ub=1200.0, 2025-05-07T20:32:44.9705228Z contiguous=True, 2025-05-07T20:32:44.9705313Z compiled=True, 2025-05-07T20:32:44.9705390Z ) 2025-05-07T20:32:44.9705880Z self = 2025-05-07T20:32:44.9706053Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9706060Z 2025-05-07T20:32:44.9706176Z @given( 2025-05-07T20:32:44.9706340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9706487Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9706639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9706758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9706873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9706948Z ) 2025-05-07T20:32:44.9707317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9707421Z def test_silu_mul_quant( 2025-05-07T20:32:44.9707502Z self, 2025-05-07T20:32:44.9707582Z T: int, 2025-05-07T20:32:44.9707661Z D: int, 2025-05-07T20:32:44.9707760Z scale_ub: Optional[float], 2025-05-07T20:32:44.9707853Z contiguous: bool, 2025-05-07T20:32:44.9707942Z compiled: bool, 2025-05-07T20:32:44.9708022Z ) -> None: 2025-05-07T20:32:44.9708121Z torch.manual_seed(2025) 2025-05-07T20:32:44.9708199Z 2025-05-07T20:32:44.9708430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9708509Z 2025-05-07T20:32:44.9708602Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9708730Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9708823Z x = x_sign * x_clamp 2025-05-07T20:32:44.9708905Z x0 = x[:, :D] 2025-05-07T20:32:44.9708988Z x1 = x[:, D:] 2025-05-07T20:32:44.9709065Z 2025-05-07T20:32:44.9709154Z if contiguous: 2025-05-07T20:32:44.9709309Z x0 = x0.contiguous() 2025-05-07T20:32:44.9709402Z x1 = x1.contiguous() 2025-05-07T20:32:44.9709477Z 2025-05-07T20:32:44.9709573Z if scale_ub is not None: 2025-05-07T20:32:44.9709680Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9709817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9709898Z ) 2025-05-07T20:32:44.9709979Z else: 2025-05-07T20:32:44.9710074Z scale_ub_tensor = None 2025-05-07T20:32:44.9710153Z 2025-05-07T20:32:44.9710283Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9710377Z op = silu_mul_quant 2025-05-07T20:32:44.9710468Z if compiled: 2025-05-07T20:32:44.9710570Z op = torch.compile(op) 2025-05-07T20:32:44.9710679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9710755Z 2025-05-07T20:32:44.9710850Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9710857Z 2025-05-07T20:32:44.9710961Z moe/activation_test.py:117: 2025-05-07T20:32:44.9711092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9711194Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9711297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9711667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9711829Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9712325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9712426Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9712784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9713009Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9713352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9713453Z kernel = self.compile( 2025-05-07T20:32:44.9713832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9714013Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9714142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9714151Z 2025-05-07T20:32:44.9714355Z self = 2025-05-07T20:32:44.9715132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9715676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5999d5bf60>} 2025-05-07T20:32:44.9716427Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9716615Z context = 2025-05-07T20:32:44.9716623Z 2025-05-07T20:32:44.9716792Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9717098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9717210Z module_map=module_map) 2025-05-07T20:32:44.9717376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9717477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9717557Z E ^ 2025-05-07T20:32:44.9717953Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9717958Z 2025-05-07T20:32:44.9718396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9718400Z 2025-05-07T20:32:44.9718533Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9718753Z self=, 2025-05-07T20:32:44.9718836Z T=128, 2025-05-07T20:32:44.9718921Z D=7168, 2025-05-07T20:32:44.9719007Z scale_ub=1200.0, 2025-05-07T20:32:44.9719094Z contiguous=True, 2025-05-07T20:32:44.9719185Z compiled=False, 2025-05-07T20:32:44.9719262Z ) 2025-05-07T20:32:44.9719477Z self = 2025-05-07T20:32:44.9719651Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9719655Z 2025-05-07T20:32:44.9719736Z @given( 2025-05-07T20:32:44.9719860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9719962Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9720077Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9720197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9720311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9720388Z ) 2025-05-07T20:32:44.9720681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9720779Z def test_silu_mul_quant( 2025-05-07T20:32:44.9720856Z self, 2025-05-07T20:32:44.9720938Z T: int, 2025-05-07T20:32:44.9721017Z D: int, 2025-05-07T20:32:44.9721115Z scale_ub: Optional[float], 2025-05-07T20:32:44.9721208Z contiguous: bool, 2025-05-07T20:32:44.9721295Z compiled: bool, 2025-05-07T20:32:44.9721377Z ) -> None: 2025-05-07T20:32:44.9721474Z torch.manual_seed(2025) 2025-05-07T20:32:44.9721551Z 2025-05-07T20:32:44.9721725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9721799Z 2025-05-07T20:32:44.9721891Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9722022Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9723790Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9723802Z 2025-05-07T20:32:44.9723965Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9723972Z 2025-05-07T20:32:44.9724077Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9724300Z self=, 2025-05-07T20:32:44.9724385Z T=128, 2025-05-07T20:32:44.9724462Z D=5120, 2025-05-07T20:32:44.9724550Z scale_ub=1200.0, 2025-05-07T20:32:44.9724637Z contiguous=True, 2025-05-07T20:32:44.9724720Z compiled=True, 2025-05-07T20:32:44.9724798Z ) 2025-05-07T20:32:44.9725018Z self = 2025-05-07T20:32:44.9725226Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9725230Z 2025-05-07T20:32:44.9725313Z @given( 2025-05-07T20:32:44.9725431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9725529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9725647Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9725805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9725925Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9726000Z ) 2025-05-07T20:32:44.9726243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9726340Z def test_silu_mul_quant( 2025-05-07T20:32:44.9726417Z self, 2025-05-07T20:32:44.9726495Z T: int, 2025-05-07T20:32:44.9726574Z D: int, 2025-05-07T20:32:44.9726675Z scale_ub: Optional[float], 2025-05-07T20:32:44.9726768Z contiguous: bool, 2025-05-07T20:32:44.9726859Z compiled: bool, 2025-05-07T20:32:44.9726940Z ) -> None: 2025-05-07T20:32:44.9727035Z torch.manual_seed(2025) 2025-05-07T20:32:44.9727113Z 2025-05-07T20:32:44.9727281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9727359Z 2025-05-07T20:32:44.9727452Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9727651Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9729414Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9729468Z 2025-05-07T20:32:44.9729590Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9729595Z 2025-05-07T20:32:44.9729697Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9729917Z self=, 2025-05-07T20:32:44.9730003Z T=128, 2025-05-07T20:32:44.9730080Z D=7168, 2025-05-07T20:32:44.9730168Z scale_ub=None, 2025-05-07T20:32:44.9730257Z contiguous=True, 2025-05-07T20:32:44.9730340Z compiled=True, 2025-05-07T20:32:44.9730414Z ) 2025-05-07T20:32:44.9730631Z self = 2025-05-07T20:32:44.9730797Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.9730802Z 2025-05-07T20:32:44.9730882Z @given( 2025-05-07T20:32:44.9731003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9731106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9731224Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9731340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9731453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9731532Z ) 2025-05-07T20:32:44.9731776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9731918Z def test_silu_mul_quant( 2025-05-07T20:32:44.9732000Z self, 2025-05-07T20:32:44.9732079Z T: int, 2025-05-07T20:32:44.9732160Z D: int, 2025-05-07T20:32:44.9732258Z scale_ub: Optional[float], 2025-05-07T20:32:44.9732348Z contiguous: bool, 2025-05-07T20:32:44.9732436Z compiled: bool, 2025-05-07T20:32:44.9732516Z ) -> None: 2025-05-07T20:32:44.9732610Z torch.manual_seed(2025) 2025-05-07T20:32:44.9732688Z 2025-05-07T20:32:44.9732858Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9734703Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9734709Z 2025-05-07T20:32:44.9734829Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9734965Z =============================== warnings summary =============================== 2025-05-07T20:32:44.9735277Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.9735581Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.9735883Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.9736761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:44.9736988Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:44.9736996Z 2025-05-07T20:32:44.9737203Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:44.9737366Z ================= 1 failed, 1 deselected, 3 warnings in 13.74s ================= 2025-05-07T20:32:46.5363991Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:46.5991720Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:32:46.5992181Z 2025-05-07T20:32:46.5992518Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:32:46.5993629Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:32:46.5994412Z 2025-05-07T20:32:46.5994431Z 2025-05-07T20:32:46.5994447Z 2025-05-07T20:32:46.6009962Z ##[error]Process completed with exit code 1. 2025-05-07T20:32:46.6097080Z Post job cleanup. 2025-05-07T20:32:46.7079923Z [command]/usr/bin/git version 2025-05-07T20:32:46.7120253Z git version 2.47.1 2025-05-07T20:32:46.7156137Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/a3f6b24d-5dfa-43e1-a56f-95622c529176/.gitconfig' 2025-05-07T20:32:46.7166963Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/a3f6b24d-5dfa-43e1-a56f-95622c529176' before making global git config changes 2025-05-07T20:32:46.7167880Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:32:46.7173251Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:32:46.7218972Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:32:46.7253635Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:32:46.7587916Z Entering 'external/asmjit' 2025-05-07T20:32:46.7654679Z Entering 'external/composable_kernel' 2025-05-07T20:32:46.7728623Z Entering 'external/cpuinfo' 2025-05-07T20:32:46.7795217Z Entering 'external/cutlass' 2025-05-07T20:32:46.7871101Z Entering 'external/googletest' 2025-05-07T20:32:46.7937559Z Entering 'external/hipify_torch' 2025-05-07T20:32:46.8003263Z Entering 'external/json' 2025-05-07T20:32:46.8089101Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:32:46.8114389Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8126062Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:32:46.8159333Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:32:46.8492407Z Entering 'external/asmjit' 2025-05-07T20:32:46.8536851Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8579039Z Entering 'external/composable_kernel' 2025-05-07T20:32:46.8622839Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8671720Z Entering 'external/cpuinfo' 2025-05-07T20:32:46.8714877Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8757129Z Entering 'external/cutlass' 2025-05-07T20:32:46.8799135Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8849807Z Entering 'external/googletest' 2025-05-07T20:32:46.8893612Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8935836Z Entering 'external/hipify_torch' 2025-05-07T20:32:46.8978217Z http.https://github.com/.extraheader 2025-05-07T20:32:46.9020106Z Entering 'external/json' 2025-05-07T20:32:46.9062589Z http.https://github.com/.extraheader 2025-05-07T20:32:46.9243061Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:32:46.9278177Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:32:46.9288591Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:32:46.9288948Z ##[endgroup] 2025-05-07T20:32:46.9391975Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:32:57.7270277Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:14.1077698Z Cleaning up orphan processes